Data cleaning is a fundamental of data integration

Data integration seamlessly combining data sets for real data quality

Spotless Data's machine learning filters, accessible as an API through a web-browser, are the ideal tool for data integration of data from multiple sources.

Data integration describes the process of merging or combining two different datasets, or data from two different sources. Whether this is a retailer gathering together disparate data about their customers into a single customer view, an EPG which receives its data from various broadcasters, a hedge fund scraping financial news websites to help it make buying and selling decisions and then combining these with raw stocks and shares data, a healthcare operator combining data from different departments within a hospital and from doctor's surgeries or a manufacturer trying to store all its data in a single data warehouse, all these organisations are facing peculiarly modern data integration challenges.

Data cleaning is a key component of data integration

In our experience at Spotless, there is very little data integration that does not involve some type of data cleaning. Data warehouses and single customer views are by definition the integration from different sources of data about customers or a business. And it is an unusual organisation that does not have to deal with data that come from at least two different sources, although the norm is 40, 50 or more.

Because the growth of data over the last few years has tended to be organic rather than planned even medium-sized organisations mostly find that their different teams or departments have stored and sorted their data using different standards, different formats and different meta tags and descriptions to describe the same thing. And then, as many companies are increasingly learning to appreciate the value of, are the data they use which come from external sources. There are also legacy data which, for whatever reason, need to be stored and, if necessary, easily retrieved, such as to comply with privacy requirements, and assimilating these legacy data is a significant component of most data integration.

All these data, if they are to be useful to the organisations which have them, need to be integrated as if they were all one continuous dataset through a good data cleaning. Not only can these data then do all the sorts of things they need to do, such as run a sophisticated website or complex transport system, they can also be converted, through a good analytics software, into business intelligence and great reports for internal consumption of the business. The data which have you have successfully integrated can be said to have data quality which can be trusted in, not solely for the purposes for which they were designed but also for new purposes, such as those figured out by an artificial intelligence programme or the brightest new recruit.

The machine learning filters solution to data integration

We recognise that data integration is not easy. But we are also aware of how catastrophic failure to integrate data accurately can be, and that as current data becomes legacy data so the problems of rogue data can be compounded, leading to a data disaster. However, we are confident that the Spotless machine learning filters solution, accessible through a simple API, and which learns from experience, can tackle a whole range of tricky data integration issues. These include number validation, where columns of numbers from different sources are integrated, with checks made for maximum and minimum number values; string validation, which uses regular expressions to ensure the integration of different descriptions which are actually describing the same thing; date validation to ensure that all dates use the same format (e.g. 2/2/18 and 2nd February 2018 are the same date formatted in two different ways, neither is necessarily right but all the dates need to be one or the other); unique field checks, which check the uniqueness of fields and rows, as data from multiple sources can contain unwanted dupicated fields and rows; and lookup checks, which are particularly useful in data integration as they check a field against a predefined set of data stored in a separate CSV file.

Spotless for all your data integration needs

Spotless can now offer offline processing of your data to ensure GDPR compliance. You can read our introduction to using our API to validate your data and then try out our service on your my filters page but, for this, you need to log in first. You can take advantage of our free offer of 500Mb of free data cleaning to see how much you like our service. If you haven't already done so, you can sign-up using your email address, Facebook, Google or GitHub accounts. You may also view our video on data cleaning an EPG file, which also explains how to use our API.

We use the https protocol, guaranteeing your data are secure while they are in our care and that they are not accessible by any third party, a responsibility we take very seriously.

Here is a quick link to our FAQ. You can also check out our range of subscription packages and pricing. If you would like to contact us, you can speak to one of our team by pressing on the white square icon with a smile within a blue circle, which you can find in the bottom right-hand corner of any of the web pages on our site.

If your data quality is an issue or you know that you have known sources of dirty data but your files are just too big, and the problems too numerous to be able to fix manually please do log in and try now