Data Validation

A tooth wheel illustrating Spotless' seamless data validation

When your company has valid data then all those things the data does for it will work seamlessly, like clockwork.

Data validation is the process of ensuring that the data one has are valid so that they are fit for purpose rather than being contaminated with rogue data. If you fail to validate your company's data you are taking a big risk which may backfire in a whole range of catastrophic ways.

We never fail to be surprised at the sheer quantity of errors to be found in data, at times sourced to respectable multinational corporations, and caused for a whole variety of reasons. Then on top of this are the problems created by blending various data sources into a single data repository (such as a data lake or a data warehouse) where lack of consistency just compounds the already real problems created by errors.

800 pages of data on one person

We know that businesses simply cannot afford to get their data wrong, whether it is for legal compliance, displaying on a website, collected from Internet of things sensors or used for marketing and sales campaigns and internal business intelligence and reporting. When a journalist recently received 800 pages of data about her year as a moderately heavy user of a dating website most people just stayed with the thought of how much data one website had on one person. Yet as of next year, within the European Union, such data must be provided legally to anyone who requires it, and this is a trend that is more than likely to extend worldwide. The question we asked ourselves is how many errors were there within that 800 pages of data? Not that we think that dating sites are more likely to produce data errors than any other company. Indeed they may be one of that minority of companies which already have the processes in place to effectively validate their data and this guarantee that they are error-free, not merely for this one customer but the many thousands if not millions of customers they have on their books. But neither it appeared did the journalist go through those 800 pages with a fine toothcomb looking for errors. However, if somebody were to do so and were to find rogue data errors that violated some of these strict new data and data privacy laws such as GDPR it could have given that company serious problems.

Data blending issues

Our surprise at seeing so many data errors in much of the data that we have passed through our machine learning filters is partly because we know what a devastating impact a single error in data can have. If a hedge fund scrapes large quantities of data from multiple sources to then let its sophisticated and expensive artificial intelligence programme first analyse and then make financial buying and selling decisions based on said data we know that a single error in the data could result in the fund losing millions of dollars of their clients money. Scraping information for later analysis is particularly vulnerable to rogue data issues because of the problems of inconsistency while blending data from different sources into one data repository. To successfully blend these data together, so they form one large whole requires that each and every one of the data are properly validated.

At Spotless Data our goal in life is simply to ensure that all the data you have are validated as they are ingested into your repository and before entering your platforms. We do this through our Machine Learning filters. They filter out rogue data, either modifying or removing them, quarantining any suspicious data. This means data that are so spotlessly clean and so fit for purpose that every last piece has been properly validated. Thus all the data have data integrity throughout their lifecycle from the moment they leave us ready to be ingested into the repository of the company which owns or manages them.

Genre data

At the heart of our data validation process is the report which you receive within less than a minute of uploading your data into our python API, easy to access as it is simply a webpage. Within this report come our suggestions as to where your data have problems which means they fail to validate and what is the best solution to fixing these problems. You can take a look at our video on data cleaning a genre column and see for yourself how many errors we found data cleaning a genre column of data from an external source which we needed to ingest into an EPG. Blanks and mismatched data can completely mess up an EPG, possibly giving people wrong information about what appears on television and when genre information is also very important to the viewer experience. Someone who thinks they are about to watch a comedy but finds they are watching a war documentary instead is likely to be dissatisfied with the EPG if it led them to believe they were about to watch something much more light-hearted. Genres are also very important for recommendations, and an EPG that recommends a news programme to its soap viewers, based on an incorrect genre, is not likely to be taken very seriously by its viewers in the future. And while a broadcast time mistake due to a failure in the data validation of numbers is likely to be spotted very quickly, hopefully before the EPG goes live or very shortly afterwards, a genre column error is much harder for humans to spot unless they have a profound knowledge of all the genres of television or a similar area of entertainment but can have an equally negative effect on users who do discover the mistake when actually using the EPG or similar.

Data validation with the Spotless Data solution

You can read our introduction to using our API and then try out our service on your my filters page but, for this, you need to log in first. If you haven't already done so, you can sign-up using your email address, Facebook, Google or GitHub accounts. You may also view our video on data cleaning an EPG file, which also explains how to use our API.

We use the https protocol, guaranteeing your data are secure while they are in our care and that they are not accessible by any third party, a responsibility we take very seriously.

Here is a quick link to our FAQ. You can also check out our range of subscription packages and pricing. If you would like to contact us you can speak to one of our team by pressing on the white square icon with a smile within a blue circle, which you can find in the bottom right-hand corner of any of the web pages on our site.

    Blog posts about Data Validation

    April 21, 2017, 6:47 a.m.
    A tooth wheel illustrating Spotless' seamless data validation
    Spotless Data version 9 includes data validation, substitution and lookalike improvements for better data quality. We have just launched Version 9 of our unique data quality web-based API solution, which includes a new rule type, known as a data validation Rule, as well as significant enhancements to our rules engine, which have been driven by machine learning. Here are the five fundament...
    Read More
    April 28, 2017, 6:56 a.m.
    A tooth wheel illustrating Spotless' seamless data validation
    Data warehouses will only work properly when they contain quality data. A data warehouse is a repository or storage area where all the data in one's company is kept in a single place. This includes data from different sources as well as both current and historical data, perhaps from a legacy platform. It can consist of data from the company itself, which if the company is a large one mig...
    Read More
    Feb. 3, 2017, 7:36 a.m.
    A tooth wheel illustrating Spotless' seamless data validation
    A handshake as a symbol of trust is never more important than with data quality. Can you trust the quality of your data? Nobody doubts that in 2017 most companies need to exploit their data, including their dark data, to use them to their maximum potential. The question Spotless Data wants to ask you is, "can you trust your data?" The fundamental definition of data quality is th...
    Read More
    Dec. 15, 2016, 7:53 a.m.
    A tooth wheel illustrating Spotless' seamless data validation
    Spotless Data version 5 release focusses on fixing blank entries, reporting, sessions and handling errors. We have been working hard over the autumn to update our Spotless Data unique web-based data quality API solution which cleans the data at the speed of business and where you can regularly clean large quantities of your big data from the errors caused during data entry. You can thus ensu...
    Read More
    May 30, 2017, 6:48 a.m.
    A tooth wheel illustrating Spotless' seamless data validation
    Data integration is one of four examples where using Spotless Data's machine learning filters can ensure you have a seamless process. Spotless Data is a unique web-based Data Quality API solution to ensure the data cleaning of your data so that they have the data quality you can trust to ensure your company stands out among its rivals. As a part of our ongoing blogs designed to explain h...
    Read More
    May 12, 2017, 6:51 a.m.
    A tooth wheel illustrating Spotless' seamless data validation
    Dirty or rogue data will always have a profound effect on any organisation. Spotless Data have recognised five fundamental different types of cleansing of dirty data which can be done by using our unique web-based data quality API solution. 1. Regex Regex are regular expressions, which define search patterns and identify particular strings found within data sets. For instance, if you k...
    Read More
    Nov. 29, 2016, 9:51 a.m.
    A tooth wheel illustrating Spotless' seamless data validation
    Our unique browseable API is the swiftest and surest route to data quality. Spotless Data's unique web-based data quality API solution takes care of your big data throughout their life cycles in order to ensure that they remain clean from contamination or corruption from the moment you receive or generate your data until they are no longer required. We have identified six stages within t...
    Read More
    Dec. 5, 2016, 6:52 a.m.
    A tooth wheel illustrating Spotless' seamless data validation
    High quality data can transform Tableau data visualisation. Tableau, the data visualization software, is a way of illustrates data that is particularly useful when dealing with data which changes over time. It has a great mapping functionality, with a number of geographic identifiers built into their software, such as country, region and sometimes postcode. However, when using Tableau in ord...
    Read More
    Dec. 12, 2016, 7:52 a.m.
    A tooth wheel illustrating Spotless' seamless data validation
    Examining data quality within the context of TV show titles. TV title importance TV titles can be a source of great frustration for websites who base their business model on having easily identifiable TV show names. A title can make-or-break a TV show, but a successful name from the perspective of a TV production company is not the same thing as a good name from the point of view of websi...
    Read More
    Oct. 3, 2016, 1:41 p.m.
    A tooth wheel illustrating Spotless' seamless data validation
    Getting the data just right has never been easier thanks to Spotless Data Validation Solution to Rogue Data. We’ve just been talking to a client who’s had bad data prevent their business from reporting any of their key KPIs for the last three months. All they knew is they’d implemented a change in one of their systems and suddenly all the data they were reporting on was cle...
    Read More