Data Validation

A tooth wheel illustrating Spotless' seamless data validation

When your company has valid data then all those things the data does for it will work seamlessly, like clockwork.

Data validation is the process of ensuring that the data one has are valid so that they are fit for purpose rather than being contaminated with rogue data. If you fail to validate your company's data you are taking a big risk which may backfire in a whole range of catastrophic ways.

We never fail to be surprised at the sheer quantity of errors to be found in data, at times sourced to respectable multinational corporations, and caused for a whole variety of reasons. Then on top of this are the problems created by blending various data sources into a single data repository (such as a data lake or a data warehouse) where lack of consistency just compounds the already real problems created by errors.

800 pages of data on one person

We know that businesses simply cannot afford to get their data wrong, whether it is for legal compliance, displaying on a website, collected from Internet of things sensors or used for marketing and sales campaigns and internal business intelligence and reporting. When a journalist recently received 800 pages of data about her year as a moderately heavy user of a dating website most people just stayed with the thought of how much data one website had on one person. Yet as of next year, within the European Union, such data must be provided legally to anyone who requires it, and this is a trend that is more than likely to extend worldwide. The question we asked ourselves is how many errors were there within that 800 pages of data? Not that we think that dating sites are more likely to produce data errors than any other company. Indeed they may be one of that minority of companies which already have the processes in place to effectively validate their data and this guarantee that they are error-free, not merely for this one customer but the many thousands if not millions of customers they have on their books. But neither it appeared did the journalist go through those 800 pages with a fine toothcomb looking for errors. However, if somebody were to do so and were to find rogue data errors that violated some of these strict new data and data privacy laws such as GDPR it could have given that company serious problems.

Data blending issues

Our surprise at seeing so many data errors in much of the data that we have passed through our machine learning filters is partly because we know what a devastating impact a single error in data can have. If a hedge fund scrapes large quantities of data from multiple sources to then let its sophisticated and expensive artificial intelligence programme first analyse and then make financial buying and selling decisions based on said data we know that a single error in the data could result in the fund losing millions of dollars of their clients money. Scraping information for later analysis is particularly vulnerable to rogue data issues because of the problems of inconsistency while blending data from different sources into one data repository. To successfully blend these data together, so they form one large whole requires that each and every one of the data are properly validated.

At Spotless Data our goal in life is simply to ensure that all the data you have are validated as they are ingested into your repository and before entering your platforms. We do this through our Machine Learning filters. They filter out rogue data, either modifying or removing them, quarantining any suspicious data. This means data that are so spotlessly clean and so fit for purpose that every last piece has been properly validated. Thus all the data have data integrity throughout their lifecycle from the moment they leave us ready to be ingested into the repository of the company which owns or manages them.

Genre data

At the heart of our data validation process is the report which you receive within less than a minute of uploading your data into our python API, easy to access as it is simply a webpage. Within this report come our suggestions as to where your data have problems which means they fail to validate and what is the best solution to fixing these problems. You can take a look at our video on data cleaning a genre column and see for yourself how many errors we found data cleaning a genre column of data from an external source which we needed to ingest into an EPG. Blanks and mismatched data can completely mess up an EPG, possibly giving people wrong information about what appears on television and when genre information is also very important to the viewer experience. Someone who thinks they are about to watch a comedy but finds they are watching a war documentary instead is likely to be dissatisfied with the EPG if it led them to believe they were about to watch something much more light-hearted. Genres are also very important for recommendations, and an EPG that recommends a news programme to its soap viewers, based on an incorrect genre, is not likely to be taken very seriously by its viewers in the future. And while a broadcast time mistake due to a failure in the data validation of numbers is likely to be spotted very quickly, hopefully before the EPG goes live or very shortly afterwards, a genre column error is much harder for humans to spot unless they have a profound knowledge of all the genres of television or a similar area of entertainment but can have an equally negative effect on users who do discover the mistake when actually using the EPG or similar.

Data validation with the Spotless Data solution

You can read our introduction to using our API and then try out our service on your my filters page but, for this, you need to log in first. If you haven't already done so, you can sign-up using your email address, Facebook, Google or GitHub accounts. You may also view our video on data cleaning an EPG file, which also explains how to use our API.

We use the https protocol, guaranteeing your data are secure while they are in our care and that they are not accessible by any third party, a responsibility we take very seriously.

Here is a quick link to our FAQ. You can also check out our range of subscription packages and pricing. If you would like to contact us you can speak to one of our team by pressing on the white square icon with a smile within a blue circle, which you can find in the bottom right-hand corner of any of the web pages on our site.

    Blog posts about Data Validation

    March 16, 2018, 6:48 a.m.
    A tooth wheel illustrating Spotless' seamless data validation
    A brief history of data explores the causes of the huge explosion in big data in a process that reaches back to the 1970s, examines the fundamental problems associated with data and offers some solutions to rogue data issues in 2018. The quantity of data in the world and available to the individual businesses which store them have been building in recent years. Most companies will likely be ...
    Read More
    March 2, 2018, 7:14 a.m.
    A tooth wheel illustrating Spotless' seamless data validation
    Businesses need to get ready for GDPR, which comes into force in 3 months and effects all companies holding data about EU citizens. Validating the data with Spotless machine learning filters can help all data governors tackling GDPR-based data compliance issues. Spotless Data has recently begun offering our service offline, so that your precious data can be processed without ever leaving you...
    Read More
    Feb. 16, 2018, 12:05 p.m.
    A tooth wheel illustrating Spotless' seamless data validation
    In order to achieve the data quality you can trust in where everything works seamlessly requires data refining just as surely as crude oil does. Don't let rogue data destroy your platforms by using the Spotless API solution. Data is increasingly being seen as the new oil. It is unfortunate that the concept of refining these data, which are increasingly big data, does not have in the publ...
    Read More
    Feb. 9, 2018, 7:25 a.m.
    A tooth wheel illustrating Spotless' seamless data validation
    The need for effective data validation has never been more important due to the explosion of big data across most sectors of the modern economy. We are a group of experienced data scientists who have started Spotless Data because of our experience of how the recent massive explosion of data has primarily affected business sectors with little experience of big data. Retail, manufacturing, hea...
    Read More
    Dec. 27, 2017, 11:11 a.m.
    A tooth wheel illustrating Spotless' seamless data validation
    Cleaning and ensuring the integration of all your data is now more efficient and profound than ever with the new release Spotless version 17 We are delighted to announce the new release Version 17 of Spotless' data validation solution for all your rogue data issues. While the main focus of this release has been various bug fixes, we have also added one new feature, which is that Spotl...
    Read More
    Jan. 11, 2018, 2:44 p.m.
    A tooth wheel illustrating Spotless' seamless data validation
    Due to the rise of driverless vehicles and the Internet of Things data has suddenly taken a central place in the transport industry. In recent years data have become increasingly critical for the transport industry, and this is set to increase rapidly with the rise of both driverless vehicles and the Internet of things, the latter being sensors which are used to create smart transportation s...
    Read More
    Jan. 16, 2018, 6:18 p.m.
    A tooth wheel illustrating Spotless' seamless data validation
    Spotless Data's version 18 release makes it easier to comply with GDPR and other regulations which protect the digital rights and privacy of citizens. We are delighted to announce the release of Spotless version 18. We have focussed this release on GDPR compliance, coming into force on May 25th, by offering our customers the possibility of having everything they need for offline pro...
    Read More
    Nov. 6, 2017, 11:57 a.m.
    A tooth wheel illustrating Spotless' seamless data validation
    When your company has valid data then all those things the data does for it will work seamlessly, like clockwork. Data validation is the process of ensuring that the data one has are valid so that they are fit for purpose rather than being contaminated with rogue data. If you fail to validate your company's data you are taking a big risk which may backfire in a whole range of catastrophi...
    Read More
    June 30, 2017, 9:18 a.m.
    A tooth wheel illustrating Spotless' seamless data validation
    Eliminating rogue data from numbers is one of the first steps towards ensuring trustworthy data validation Validating numbers is always going to be important in data cleaning to ensure you have data quality you can trust in. Therefore the Spotless Data Science team have been working to ensure you can clean any list of numbers which may have errors in it. This type of dirty data can, with a s...
    Read More
    June 8, 2017, 9:55 a.m.
    A tooth wheel illustrating Spotless' seamless data validation
    Spotless Data's version 10 introduces tags on blogs and a better way to ensure data validation of rogue data affecting numbers We are delighted to announce the release of version 10 of our unique browseable data quality API solution to ensure you have data you can trust by data cleaning your rogue data of corruptions and inaccuracies and leaving them spotlessly scrubbed up. Version 10...
    Read More