Machine learning and data quality

A robot reading a book requires data quality to work properly

Data quality is the key to successful machine learning and our machine learning filters the key to quality data.

Machine learning is where computers learn things that they were not specifically designed to do. Traditional definitions of data quality define it as data which can do what they were designed to do. Spotless Data has recognised that this is an out-of-date definition which fails when it comes to Machine Learning and data integrity in general.

This traditional definition also fails when it comes to future-proofing the data one has or are likely to have. To be able to do new things with big data without having to completely overhaul the platform that took such effort and resources to create; now that is data quality! It is also the essence of Machine Learning as well as a sign of the data validation of all the data in one's data repository.

The concept of machine learning

The concept of machine learning, which originally emerged in the late 1950s from the fields of pattern recognition and computational learning theory, is to create algorithms which can both learn from the data they have and do data analysis and make predictions from the data.

Advanced machine learning programmes can even write their own algorithms to do new things that had not occurred to their data scientist human programmers, or that the humans would have liked to do in theory but had not a clue how to do in practice. Some of these algorithms are so complex that no human being can understand them, either because they are too complicated or simply because the data are too big and it would take years or a lifetime to grasp the data and fully understand what the Machine Learning programme is doing with them.

Examples of machine learning

Well known examples of machine learning include

1. Driverless Vehicles: where the machine learning software installed in the vehicle learns how to drive and navigate roads and obstacles for itself, much as each human driver has had to to do when learning to drive.

2. Fraud Detection: Machine learning programmes which sift through vast quantities of transaction data to be able to spot and flag possible fraud or attempts at fraud, a perennial problem for all retailers.

3. Recommendations, where the machine learning trawls through the habits of users to recommend new stuff based on what they like. Classic examples are television, where the data concerning the programmes you watch or display and interest in watching can allow the machine learning software to identify other shows you would like; and Facebook, where their machine learning programme works out which news items appear on your timeline based on your activity and commenting on the site.

Data quality as the pre-requisite to machine learning

An essential and fundamental truth concerning machinee learning is that the best-designed computer algorithms and other things machine learning can do are only going to be as good as the data the machine learning software works with. Data quality underpins the success or failure of machine learning.

Ambitious programmers who give their machine learning programmes large quantities of big data to work with are bound to be disappointed if the machine learning appears to have learnt nothing, or writes its own algorithms that then don't work. The cause is in the poor quality of the data, dirty, full of corruptions, mismatches, duplications and other inaccuracies. Unless the data the machine is working with are quality data which can be trusted in, it is hard to see how any Machine Learning programme could achieve anything very much that would be considered useful or worthwhile.

But if the data have already been cleaned before the machine learning software has to deal with them, i.e. at the point of entry, then a different and much brighter story will emerge. This is where Spotless Data comes in and can be so useful to ensure that the Machine Learning software has spotlessly clean data to work with in the first place. Then, ambitious programmers are likely to be astounded by the sheer ability of the Machine Learning programme to both learn for itself and to do those things it had not been designed for but which are tremendously useful and profitable for the company which owns it.

Spotless Data's unique web-based API solution to dirty data can be built into your machine learning software in its design or build phase or you can simply pass your data through our unique web-based API before entering the into the data lake or data warehouse where the machine learning computer software will start to work with it, producing the machine learning software and algorithms that will allow your company to stand out among its competitors and attract the lion's share of the pool of potential customers.

Take a look at our introduction to using our browseable API. You can also test our API on your my filters page though you will need to be logged in first to see this. Please do sign up for our service using your email address, Facebook, Google or GitHub accounts. You can also view our videos on data cleaning an EPG file and data cleaning a genre column which explain how to use our API.

Here is a quick link to our FAQ. You can also check out our range of subscription packages and pricing where we offer 500Mb of free data cleaning to each new customer. If you would like to contact us you can speak to one of our team by pressing on the white square icon with a smile within a blue circle, which you can find in the bottom right-hand corner of any of the web pages on our site.

Spotless Data, the one stop data quality solution API!

If your data quality is an issue or you know that you have known sources of dirty data but your files are just too big, and the problems too numerous to be able to fix manually please do log in and try now