The four processes for getting your data refined

Processing the data from multiple computers and requires a data quality solution

In order to achieve the data quality you can trust in where everything works seamlessly requires data refining just as surely as crude oil does. Don't let rogue data destroy your platforms by using the Spotless API solution.

Data is increasingly being seen as the new oil. It is unfortunate that the concept of refining these data, which are increasingly big data, does not have in the public mind the same critical importance to the health of whatever platforms and projects it serves as oil refineries do to oil. At Spotless Data we argue that the same is the case with data; that unless these data have been properly refined, they are dirty or rogue data, full of errors and inconsistencies and not merely no use to the companies who are attempting to store and use them but actively harmful.

The processes required to refine your data

At Spotless we have identified four critical processes that are required to refine your data and make sure that, far from impeding the success of your company, the data enhance it. Data that have been successfully refined are indeed as valuable as refined oil and those companies which successfully refine their data will undoubtedly emerge as the winners in this and the coming decades in all areas of the business world.  These processes are data validation, data cleaning, data integration and data quality. Each one of them describes a process within the refining of data from start to finish so that you can trust your them before they enter your platforms. The difference between rogue data, full of bugs, errors and metatag inconsistencies, and reliable data is certainly as large as that between trying to drive a car on unrefined oil; it will merely destroy the engine! Data that have not been refined and then enter the platforms have the same hugely negative effects on said platforms and indeed throughout all the systems of businesses foolish enough to try to use said dirty data.

Spotless has developed its machine learning filters to tackle each one of these four data process, which are the keys to refining data and leaving them in a spotless state. These filters are literally building new refineries of the 21st Century with artificial intelligence using machine learning algorithms.

Data validation

All the data which enter your platform should be valid. And the last thing you should do is assume that your data are going to be refined unless they have been submitted to a thorough data refining process. Spotless Data's machine learning filters mostly either delete or modify data, based on a set of configurations that you, the data's owner, set up. However, because we know how critical it is that invalid data do not enter your databases, we have configured it so that when our filters are unsure about how to proceed with some data, they have the capacity to quarantine said data. The quarantined data may only be part of a file, in which case the invalid data are removed, the rest of the file quarantined and an email alert sent to the user. PagerDuty can also be used in these circumstances, especially if quarantining is happening regularly, which with complex rogue data is far from impossible.

Data cleaning

Data cleaning is about getting the data clean so that all the dirty data in the dataset have been substituted or removed, with any invalid fields replaced with clean data. Examples of cleaning include fixing unique keys which have been corrupted and are hence wrong or useless, using lookup rules in order to check one or more field against a predefined set of data, fixing corrupted or inconsistent date and time formats (inconsistency might be the difference between 9/11 and 11/9 as US and non-US formats for 11th September or 9th November), correcting inaccurate numbers in a column of numbers and using our machine learning filters to restore or add missing data based on lookalike models. This latter is a superb example of where Spotless learns for itself and becomes better over time.

Data integration

Arguably, data integration is the most challenging part of data refining. Typical problems include fields which lack consistency, data which either conflicts with other data or appear to do so, metadata issues and datasets that are missing. The main cause of data integration issues are having data from different sources, typically third-party sources. In larger businesses there is often the need to integrate data from different departments within the organisation as imposing data or formatting consistency on the staff in those departments can be a challenging tasks, especially when each department has traditionally been in charge of its own data but that suddenly the data have become so big and what is required is to integrate them all together. However, thanks to Spotless you no longer need to worry so much about these inconsistencies with how your company formats and tags its data; we can fix the issue for you because we have developed the business rules for data which ensure referential integration between different datasets regardless of where these datasets have come from. 

Some of the things we do to ensure the seamless and effective integration of your data include the validation and cleaning of all fields, making sure that a primary key or a compound primary key is unique within a database and the matching of foreign and primary keys.

Data quality 

Data which have data quality are the only ones which should be allowed to enter your platforms. These data will do what you want them to and other things as well, given that data are typically used for multiple purposes. Our definition of data quality involves measuring the quality of the datasets. Our machine learning algorithms automatically generate the business rules which are easily customised by you so that you can meet your specific business requirements.   

Data quality might mean that each feed which enters your platform is both well-formed and consistent with all the other feeds entering your platform. Or that the data entering your data lake are well referenced as well as well-formed. Or that your internal business data coming from different platforms are both consistent and valid because of the specific business rules you have defined. Data quality might also mean removing overlaps or filling gaps to make what was previously corrupted data work properly. Or that data which have been scraped from the Internet, notoriously difficult as they are in their raw state, due to both errors and inconsistencies, now work smoothly because those data that fail to conform to expectations have been removed or quarantined, so they do not enter the platform.

Spotless' data refining solution

Spotless Data can now offer offline processing of your data to ensure GDPR compliance. You can read our introduction to using our API to validate your data. You can take advantage of our free offer of 500Mb of free data cleaning to see how much you like our service. If you haven't already done so, you can sign-up using your email address, Facebook, Google or GitHub accounts. You may also view our video on data cleaning an EPG file, which also explains how to use our API.

We use the https protocol, guaranteeing your data are secure while they are in our care and that they are not accessible by any third party, a responsibility we take very seriously.

Here is a quick link to our FAQ. You can also check out our range of subscription packages and pricing. If you would like to contact us, you can speak to one of our team by pressing on the white square icon with a smile within a blue circle, which you can find in the bottom right-hand corner of any of the web pages on our site.

 

If your data quality is an issue or you know that you have known sources of dirty data but your files are just too big, and the problems too numerous to be able to fix manually please do log in and try now