Exploring data preparation issues facing modern companies

When a business employee touches a button to start off processes driven by validated, clean, well-integrated data all will work smoothly and seamlessly

The key to success with data is to get as many automated processes in place as possible, and the best place to start is by getting Spotless machine learning filters to validate, clean and integate the data before they enter your platform.

IBM has recently predicted that there will be 700,000 job vacancies for data scientists in 2020. Given that such a quantity of data experts simply doesn't exist the, two consequences of this are that experienced and well-qualified data scientists will demand higher salaries, and unqualified IT experts will try to do data science on the hoof, often after being cajoled into doing so by their bosses. Even so, there will be a massive shortfall of data science expertise.

One consequence of this is that any automated processes will become more critical to those organisations struggling with the challenges of big data. And given that data scientists spend 60-90% of their working lives on data cleaning tasks, and it is the most repetitive and tedious part of their job, this is an obvious area where when the preparation of big data can be automated.

Companies want to embrace clean data, but how to begin?

As the Spotless team of data scientists, our experience is that until a couple of years back the primary challenge with the companies we dealt with was to persuade them to use their big data at all. Skepticism abounded on every side about whether the big data were really as valuable as we were saying. And what we were actually saying is that if your company fail to use its big data, you will simply not exist in five years time, an observation we stand by, and one increasingly supported by similar statements from others. However, nowadays we never find our customers questioning the value of the data they have as everyone recognises that the losers in 2018 and beyond are going to be those who fail to use their big data to the maximum. And our customers now instinctively understand this.

We instead just have to explain how exactly they can use their data to stand out from their competitors, who will also be doing whatever they can to use their data. 62% of businesses are expected to use some form of machine learning on their data in order to maximise their value and do useful business stuff they couldn't do before. Until very recently machine learning was almost always feared by companies as something immensely complicated, beyond the capabilities and resources of their enterprise.

However, with Google this week making their online machine learning courses available to everyone few doubt that now is a great time to take the plunge and see how this set of techniques can help transform your company and keep it among the winners. Yet this can only be so when the data that underpin and power the machine learning are quality, validated data. Otherwise the data will merely undermine any attempt to implement the machine learning, which is only ever as good as the data which drive it. Data has been said to be the new oil, and just as oil powers aeroplanes, trucks and trains, so data power machine learning and artificial intelligence, but said data can only do so when they have been adequately refined, such as with the Spotless API solution.

At Spotless we use our own machine learning filters, which, like all machine learning, develop and grow and learn through their own experiences, in this case of processing data. However, we don't pretend to offer solutions to all your big data needs. All we can do is to validate and clean the data and then integrate the various data sets before they enter your data platforms to ensure an end product which can be said to have data quality in which you can trust.

The reality is that people use data for many things, from displaying a website to working out how much new stock a retailer needs to specifying how a car should be built. Yet the one thing that almost all businesses want from their data are sufficient big data analytics to be able to offer great reporting. We find this a useful benchmark in that data which are ready for analytics and reporting, processes which can themselves be done by using one of the many software analytic and/or reporting programmes available on the market, are likely to be clean and valid data.

The deceptiveness of clean looking data

However, merely because your Tableau software is not giving off lots of nulls and errors when you try to turn your data into their graphs for reporting is not a guarantee that your data are valid, clean and well-integrated. Deceptively clean-looking data can be far more dangerous to those companies which are new to exploiting their big data than really dirty looking, rogue data, obviously full of errors and inconsistencies. The latter just makes your Tableau software completely useless whereas the former appears to give you legitimate information which you then go and make extremely important decisions about your company or say about its marketing campaign based on. This is almost certain to result in complete disaster, wasting your company's valuable money, time and resources.

This problem can, and often is, exacerbated if the report the executive is reading appears to be good news. Internal bias means we all want our companies to be doing well and not badly and if we read information that we want to hear and naively believe that the data underpinning the news, and which have not been validated, are accurate, it is unlikely that we will question the erroneous report. So we at Spotless work to the rule that the only valid, clean, well-integrated data are those which have undergone a thorough cleansing process. It isn't the companies which use data which are going to survive the coming data but those who use validated and clean data.

Helping start your data preparation with Spotless Data

You can read our introduction to using our API to validate your data. You can take advantage of our free offer of 500Mb of free data cleaning to see how much you like our service. If you haven't already done so, you can sign-up using your email address, Facebook, Google or GitHub accounts. You may also view our video on data cleaning an EPG file, which also explains how to use our API.

We use the https protocol, guaranteeing your data are secure while they are in our care and that they are not accessible by any third party, a responsibility we take very seriously.

Here is a quick link to our FAQ. You can also check out our range of subscription packages and pricing. If you would like to contact us, you can speak to one of our team by pressing on the white square icon with a smile within a blue circle, which you can find in the bottom right-hand corner of any of the web pages on our site.


If your data quality is an issue or you know that you have known sources of dirty data but your files are just too big, and the problems too numerous to be able to fix manually please do log in and try now