Managing your data lake

Diver in a data lake is much more secure when the lake have data quality

Swimming in a lake of data requires primarily that the data are transformed into data quality, for instance using Spotless Machine Learning Filters.

It is widely recognised that the simplest way to ensure data quality is to minimise the variety of sources which your data come from. As Machine Learning Berkeley professor Michael Jordan states: "data variety leads to a decline in data quality". However, minimising the variety of data one has in one's data lake is not the ideal solution, given the tremendous value of data, even dark data, and how it drives most businesses in 2017. Reducing the variety of your data is one way where you might improve its quality but if your competitors have other ways of ensuring their data quality, such as using Spotless Data's API solution for ensuring that dirty data are transformed through data cleaning into data which is spotlessly free of duplications, mismatches, inaccuracies and other corruptions then they are likely to give their customers more satisfaction and be the leaders while you struggle to prevent bankruptcy.

Sources of data

Big data are rarely from one source and indeed can often be from 40, 50 or more different sources, creating a tremendous variety in the data, often resulting in the same items being described in many different ways. For instance, Amazon Web Services can also be described as AWS but they are the same thing! Yet the winners are those who successfully integrate these many data sources rather than those who try to reduce their number.

The sheer hugeness of the data and their disparate sources is essentially a data management issue, whether the data are intended for human beings or computers, the latter known as machine learning.

Machine learning demands data quality

Our own definition of data which pass the data quality test is that they can do not solely what they were designed to do but other, new things as well. These are data you can trust. This is particularly important in machine learning and the artificial intelligence which it drives because the Machine Learning programme may well want to do things it has not specifically been programmed to do but only if the data quality allows it to.

While artificial intelligence will always have a place in processing big data so enormous that even a large team of humans would be unable to process them in the time-frame available, not having the data quality which allows these programmes to innovate seems to be to negate their very purpose. For instance, once machine learning programmes have mastered and understood the big data in a way that no human has been able to do, their innovations may appear obvious, at least to them, and will ultimately be entirely based on the logic of the algorithms which programme them but the human programmers may, in the design phase, and not having mastered the complexities of the big data, have no idea what innovations the machine learning programmes will have in mind when they are released and set to work.

Data quality is the heart of machine learning

This is why data quality is so important and will remain at the heart of Machine Learning, at least for the foreseeable future. In order to achieve this data quality, businesses tend to either go for an internal data lake, a single repository where all the data is stored within the confines of the organization, or else they will use some type of cloud system in order to achieve the same goal. To give some idea of what a data lake might look like for a big organization, a large bank recently employed 150 people over a 2-year period simply in order to build them a data lake, with most of that time and effort spent in ensuring the data quality in the data lake was correct. Banks, like many other commercial organizations, will use data from multiple sources, which create not merely mismatching problems but also problems of inaccuracy. There are rarely cast-iron guarantees that the data you want to put in your data lake, often on a daily basis, is accurate in itself, and a single mistake can have disastrous consequences, messing up thousands of other pieces of data unless it is rectified by cleansing into quality data. Data lakes full of poor quality data are so common they even have their own name, a data swamp.

Using Spotless to ensure your data lake works seamlessly

Take a look at our introduction to using our browseable API. You can also test our API on your my filters page though you will need to be logged in first to see this. You can also view our videos on data cleaning an EPG file and data cleaning a genre column which explain how to use our API. You can sign-up to Spotless Data using your email address, Facebook, Google or GitHub accounts.

In order to help you demonstrate for yourself how smoothly our API works we are giving away 500Mb of free data cleansing to each new customer.

We guarantee your data remain secure and not available to any 3rd parties during the time they are in our care. If problems with the data cleansing process do come up, an automated flag alerts our data scientists who will then manually review the problem and, if necessary, contact you via your log-in details to resolve the issue together.

Here is a quick link to our FAQ. You can also check out our range of subscription packages and pricing. If you would like to contact us you can speak to one of our team by pressing on the white square icon with a smile within a blue circle, which you can find in the bottom right-hand corner of any of the web pages on our site.

If your data quality is an issue or you know that you have known sources of dirty data but your files are just too big, and the problems too numerous to be able to fix manually please do log in and try now