Data cleaning is the first step towards ensuring data quality in six key sectors of the economy who are having to dea with the new challenges and opprtunities which modern data represent.
Data has always been around but the quantity of it was tiny in the past compared to the present, and its importance was also less. Of course manufacturers needed precise data in order to create their products, transport needed them to ensure the smooth functioning of a complex railway system or airport while airlines needed them to ensure their aeroplanes were in the right place at the right time. Yet these data were relatively simple, certainly compared to the complex and multivariate data these organisations have to deal with in 2018.
While in recent blogs Spotless has explored how data effects each one of these six sectors there are certain common factors which each one of them has in common. These include having to deal with data from multiple sources and which use different metatags and different names for the same thing.
So perhaps one source talks of Amazon Web Services while a second source talks of AWS. While most educated human beings realise these are actually the same set of remote computer services, part of the problems with the new data that these sectors are dealing with is that they are big data, so enormous that it is impossible for human beings to analyse them successfully. What is typically required is some kind of big data analytics software which can automatically analyse the entirety of the data (or a subset thereof), producing business intelligence that can then be read and absorbed by those humans who need access to this information. And an analytics software will fail to realise that AWS and Amazon Web Services are actually the same thing, and will treat them as separate entities.
Quite apart from these inconsistency issues, data also contain inaccuracies, and often a surprising amount of them. So a piece of data may claim that AWS was founded in 1992 (before Amazon even existed) instead of 2002. While this particular error may not matter there is no question but that errors do matter and to be able to identify the errors within a data set will make analysis easier and the business intelligence created much more reliable and simpler to understand.
We are a team of data scientists who have been working with the data of a whole range of clients from multinationals to SMEs who operate within these six sectors for the last 12 years. We have seen time and again how the quantity of data have exploded and how the twin pillars of rogue data, inaccuracies and inconsistencies, have reduced the value of said data, leaving the companies using the data struggling to make any sense out of them.
While getting the data so that they aid rather than get in the way of a company's success is a multi-step process we at Spotless were struck by how the need to have data cleaning so that the data were transformed from rogue data into data quality, so seamless that they were trustworthy, was occurring in company after company. During this time we had plenty of opportunities to take a look at the various data validation services on the market, most of which involved downloading a software app onto various devices with which they would then attempt to clean their data. We came to the conclusion that these programmes lacked an interactive element and thus there was no way of telling objectively how effective these programmes were at data cleaning.
For these reasons, we decided to set up a unique API based data cleaning service, and Spotless Data was born. With an API there is no need to be downloading software to multiple devices as any device with access to a web browser could access our service. We also introduced an interactive service so that our clients would upload their rogue data and would then receive a report from our own automated systems indicating what we believed were the data issues, and suggesting solutions. However, we left it to the client to decide which of our suggestions they would take on board, and how, by offering a set of specifications through which they could then actually clean their data. The client would choose the specifications they wanted to use. After all, the client knows their data and what they are required to do better than anyone.
While these processes can be automated for known sources of dirty data (and please do ask us about this once you have some experience of using our API), we would always recommend using a human to set the initial specifications when dealing with new sources of data.
You can read our introduction to using our API to validate your transport data and then try out our service on your my filters page but, for this, you need to log in first. You can take advantage of our free offer of 500Mb of free data cleaning to see how much you like our service. If you haven't already done so, you can sign-up using your email address, Facebook, Google or GitHub accounts. You may also view our video on data cleaning an EPG file, which also explains how to use our API.
We use the https protocol, guaranteeing your data are secure while they are in our care and that they are not accessible by any third party, a responsibility we take very seriously.
Here is a quick link to our FAQ. You can also check out our range of subscription packages and pricing. If you would like to contact us, you can speak to one of our team by pressing on the white square icon with a smile within a blue circle, which you can find in the bottom right-hand corner of any of the web pages on our site.
If your data quality is an issue or you know that you have known sources of dirty data but your files are just too big, and the problems too numerous to be able to fix manually please do log in and try now