More efficient data validation with Spotless

A tooth wheel illustrating Spotless' seamless data validation

When your company has valid data then all those things the data does for it will work seamlessly, like clockwork.

Data validation is the process of ensuring that the data one has are valid so that they are fit for purpose rather than being contaminated with rogue data. If you fail to validate your company's data you are taking a big risk which may backfire in a whole range of catastrophic ways.

We never fail to be surprised at the sheer quantity of errors to be found in data, at times sourced to respectable multinational corporations, and caused for a whole variety of reasons. Then on top of this are the problems created by blending various data sources into a single data repository (such as a data lake or a data warehouse) where lack of consistency just compounds the already real problems created by errors.

800 pages of data on one person

We know that businesses simply cannot afford to get their data wrong, whether it is for legal compliance, displaying on a website, collected from Internet of things sensors or used for marketing and sales campaigns and internal business intelligence and reporting. When a journalist recently received 800 pages of data about her year as a moderately heavy user of a dating website most people just stayed with the thought of how much data one website had on one person. Yet as of next year, within the European Union, such data must be provided legally to anyone who requires it, and this is a trend that is more than likely to extend worldwide. The question we asked ourselves is how many errors were there within that 800 pages of data? Not that we think that dating sites are more likely to produce data errors than any other company. Indeed they may be one of that minority of companies which already have the processes in place to effectively validate their data and this guarantee that they are error-free, not merely for this one customer but the many thousands if not millions of customers they have on their books. But neither it appeared did the journalist go through those 800 pages with a fine toothcomb looking for errors. However, if somebody were to do so and were to find rogue data errors that violated some of these strict new data and data privacy laws such as GDPR it could have given that company serious problems.

Data blending issues

Our surprise at seeing so many data errors in much of the data that we have passed through our machine learning filters is partly because we know what a devastating impact a single error in data can have. If a hedge fund scrapes large quantities of data from multiple sources to then let its sophisticated and expensive artificial intelligence programme first analyse and then make financial buying and selling decisions based on said data we know that a single error in the data could result in the fund losing millions of dollars of their clients money. Scraping information for later analysis is particularly vulnerable to rogue data issues because of the problems of inconsistency while blending data from different sources into one data repository. To successfully blend these data together, so they form one large whole requires that each and every one of the data are properly validated.

At Spotless Data our goal in life is simply to ensure that all the data you have are validated as they are ingested into your repository and before entering your platforms. We do this through our Machine Learning filters. They filter out rogue data, either modifying or removing them, quarantining any suspicious data. This means data that are so spotlessly clean and so fit for purpose that every last piece has been properly validated. Thus all the data have data integrity throughout their lifecycle from the moment they leave us ready to be ingested into the repository of the company which owns or manages them.

Genre data

At the heart of our data validation process is the report which you receive within less than a minute of uploading your data into our python API, easy to access as it is simply a webpage. Within this report come our suggestions as to where your data have problems which means they fail to validate and what is the best solution to fixing these problems. You can take a look at our video on data cleaning a genre column and see for yourself how many errors we found data cleaning a genre column of data from an external source which we needed to ingest into an EPG. Blanks and mismatched data can completely mess up an EPG, possibly giving people wrong information about what appears on television and when genre information is also very important to the viewer experience. Someone who thinks they are about to watch a comedy but finds they are watching a war documentary instead is likely to be dissatisfied with the EPG if it led them to believe they were about to watch something much more light-hearted. Genres are also very important for recommendations, and an EPG that recommends a news programme to its soap viewers, based on an incorrect genre, is not likely to be taken very seriously by its viewers in the future. And while a broadcast time mistake due to a failure in the data validation of numbers is likely to be spotted very quickly, hopefully before the EPG goes live or very shortly afterwards, a genre column error is much harder for humans to spot unless they have a profound knowledge of all the genres of television or a similar area of entertainment but can have an equally negative effect on users who do discover the mistake when actually using the EPG or similar.

Data validation with the Spotless Data solution

You can read our introduction to using our API and then try out our service on your my filters page but, for this, you need to log in first. If you haven't already done so, you can sign-up using your email address, Facebook, Google or GitHub accounts. You may also view our video on data cleaning an EPG file, which also explains how to use our API.

We use the https protocol, guaranteeing your data are secure while they are in our care and that they are not accessible by any third party, a responsibility we take very seriously.

Here is a quick link to our FAQ. You can also check out our range of subscription packages and pricing. If you would like to contact us you can speak to one of our team by pressing on the white square icon with a smile within a blue circle, which you can find in the bottom right-hand corner of any of the web pages on our site.

If your data quality is an issue or you know that you have known sources of dirty data but your files are just too big, and the problems too numerous to be able to fix manually please do log in and try now