Driving Data Integrity for all Datasets

Data Integrity written on the road, providing a roadmap for Data Integration

The first step towards achieving data integrity for all the different data entering your repository is using Spotless Data's Machine Learning filters, to ensure trustworthy data quality every time.

Data Integrity means have data quality you can trust in throughout their life cycle, starting from their ingestion into a data repository. At Spotless Data we focus on this entry point, the start of the data lifecycle, as if data have integrity and consistency at the start of their lifecycle, through the application of our Machine Learning filters, this trustworthiness will remain until they become legacy data at the end of their lifecycle.

Data has been described as the new oil, and while there certainly are parallels, there are also huge differences. Nobody has ever talked about the integrity of oil. Extracting oil and then refining it was the work of a tiny number of companies who would then sell their finished product to a far vaster number of businesses, who certainly did not have to worry about the quality of the oil they bought.

With data, things are very different, and the need for data integrity sums up these differences to perfection. Most companies across the whole range of business types that mark our modern societies are producing vast quantities of data themselves while many are also taking data from third party sources, all of which they then have to store in a data repository, typically a data lake or data warehouse.

These crude data can be considered to be rogue data, full of inaccuracies and about as useful to modern businesses as unrefined oil is to a fleet of trucks.

Types of Rogue Data

We can divide rogue data issues can into two broad categories; basic mistakes, or dirty data causing corruptions within the data, and inconsistencies, caused by data from different sources using either different metadata tags or expressing the same information in different ways. So if an EPG service has data which includes the broadcast time for all the shows appearing on television in the coming days and one show has the data for 2016 this can be considered an outright mistake. Whereas if the data for the EPG comes from two different sources and the first source uses a 05/11/2017 date format and the 2nd uses a 5/11/17 date format neither of these formats are wrong per se, but they are inconsistent with each other. And perhaps the EPG prefers to use the 5 November, 2017 date format, in which case both the previous are examples of rogue data lacking integrity which need to be modified before they enter the platform which transmits the EPG data into the actual EPG.

Spotless Data's Machine Learning Filters

We at Spotless Data know that there is simply no way that raw data from multiple sources which are to be ingested into your data repository before either entering your data platforms or being subjected to data analytics to extract business intelligence are going to have the data integrity they require. They need this integrity to both do what they were designed for and, in the case of artificial intelligence, other things as well which had not even occurred to the data science team which designed these algorithms and artificial intelligence programmes. We also know that in many cases, and especially with the Internet-of-Things (IoT), that data needs to be ingested very rapidly into platforms to be able to do the work it needs to do. For instance, a programme that monitors traffic on busy roads in a city by ingesting IoT data in order to control traffic lights so that the traffic is not unnecessarily delayed, as happens when traffic lights are set to fixed times rather than dependent on real traffic movement and jams, needs to have information in real time as data from five minutes ago is already out of date.

We have thus developed a solution which is easy to use because it is accessible through an API, and once you have set up it up is also extremely quick for data cleaning new but known parameters of dirty data. You can design our API into the ingestion phase of your data repository so that, instead of having chaotic data entering your platform and causing mayhem, you have data that are cleaned at the point of ingestion so that from this point on within the data cycle they are data with integrity.

The process of cleaning oil requires expensive and sophisticated technology but once these are in place the cleaning of the oil is relatively straightforward and follows known paths. Data is much more complicated, and there can be thousands of different problems that appear.

With this very much in mind, we have come up with two solutions to this general problem of varied data issues. One is that our filters, which filter out rogue data and modify them with data which pass the data integrity test, use machine learning This means that they learn from their experience. While the data issues facing your company may be fairly unique, it is likely that it will be the same problems that occur day after day and week after week, i.e. known sources of dirty data. Our Machine Learning filters will gain experience of these particular data cleaning issues and will then be able to clean similar data seamlessly every time for data without any errors. The second solution is that when you initially submit some data to our API in a CSV or TSV file Spotless will produce a report that outlines what it thinks are the rogue data issues preventing data integrity. Then it is you, the owner of the data, who sets the specifications to clean the data.

We realise that you may, in the beginning, need to tweak these specifications. For this reason, we are giving away 500Mb of free data cleaning to all our new customers to test our service and see how well it cleans their data without having to commit themselves but also to allow you to tweak the cleaning settings so that the data are exactly what you want them to be.

Using the Spotless Data Solution

You can read our introduction to using our API and then try using our service on your My Filters page but, for this, you need to log in first. If you haven't already done so, you can sign-up using your email address, Facebook, Google or GitHub accounts. You may also view our videos on data cleaning an EPG file and data cleaning a genre column, which both explain how to use our API.

We guarantee your data are secure while they are in our care and that they are not accessible by any third party, a responsibility we take very seriously.

Here is a quick link to our FAQ. You can also check out our range of Subscription Packages and Pricing.

If your data quality is an issue or you know that you have known sources of dirty data but your files are just too big, and the problems too numerous to be able to fix manually please do log in and try now