A brief history of data explores the causes of the huge explosion in big data in a process that reaches back to the 1970s, examines the fundamental problems associated with data and offers some solutions to rogue data issues in 2018.
The quantity of data in the world and available to the individual businesses which store them have been building in recent years. Most companies will likely be unable to say precisely when their data became big data, so vast that it is impossible to do anything with them unless automated processes have been configured as part of the overall usage of them. This includes ensuring that the data are all validated, cleaned and well integrated at the entry point to the data platforms, for instance by using Spotless Data's machine learning filters.
Most companies do though recognise the day they realised that they had to start using their big data, either to grab new opportunities for their companies (the early adopters) or merely to stay afloat (the late adopters) in a world where those companies which use their big data most effectively are emerging as the winners, and the rest are facing annihilation. While there are companies which are still not using their data to their best advantage, with the arrival of GDPR in May giving EU citizens ownership of the data held about them, companies will either have to start using these data or delete them before they enter their platforms.
Data began moving out of filing cabinets, to become digital, with the rise of the personal computer starting in the mid-1970s. The next data development was when the Internet marked the beginning of computers sharing data with each other in the 1990s and thus began the real explosion in the amount available in the world.
The massive popularity of smartphones, with over 60% of web pages currently accessed through these devices, has accelerated this process of data becoming big data as it has enabled far more people to access the Internet and thus to start generating vast quantities of data within the repositories of the websites they visit and use.
In recent years the cloud has been another significant innovation that has helped fuel the expansion of data. The cloud is fundamentally a way of safeguarding data so that if the office burns down or thieves break in the only loss will be the hardware devices as the data are already in a cloud repository, stored safely online. Then they can then be downloaded onto a new device with no hassle at all. Yet few doubt that one of the consequences of cloud storage has been for companies to store increasing amounts of data, typically nowadays more than can be stored on any individual device.
The roots of Spotless Data, and our involvement with data and the issues it generates started over a decade ago when a team of software developers began working with another team who were building a TV search engine. Data and the quality of the data were absolutely critical to the success of this project, as is the case with so many projects in 2018.
The data we were using came from two different sources. First, from publicly available, free TV listings, which contained some inaccuracies (and our subsequent experience indicates that while data which are free are likely to include errors and corruptions, paying for data is no guarantee that said data will be valid and clean). Secondly, they came from data which had been scraped from multiple websites, using employees, who, like all employees who do the same things repeatedly, occasionally make mistakes. However, if ten employees each make one mistake a day that is fifty mistakes a week, which is a lot, far too many for the data to safely enter the data platform.
Early on during that initial project, while doing basic routine testing, we began to notice that the search engine was giving poor results, and thus began our ongoing battle with rogue data, which were causing the errors. And, through a whole variety of projects, these two teams have worked on together ever since, this battle against rogue data has appeared time and again, giving Spotless Data substantial experience of particular types of rogue data.
We have developed some solutions to eliminate these dirty data, which include:
(a) Number validation, to check numbers of decimal places and minimum and maximum values, which can be defined by the data owner.
(b) String validation, to ensure maximum and minimum lengths, which again can be defined by the data's owner, as well as regular expression validation.
(c) Lookup rules, which check a field against a predefined set of data stored in a separate file.
(d) Unique field validation checks, which make sure that one or more fields in the field are unique, deleting any duplicate rows.
(e) Session validation rules, for where two or more date fields with the same format are present in a field.
So we came early to realise how important validating, cleaning and integrating data are, ensuring data quality in which you can trust. Data integration is typically integrating different datasets, such as data which have been scraped from different websites, with one's own data, which in a large organisation can themselves be inconsistent, especially if data governance has not historically been strong and departments have traditionally had to organise their own data. The result is that you can trust your data again, as your company used to when you stored your data were in filing cabinets.
In July 2017 it was estimated that 90% of the data in the world had been generated in the previous two years, with 2.7 zettabytes currently in existence. Due to GDPR, 2018 is already looking to become the year when businesses worldwide begin to treat their data seriously, both as an asset and as a liability. Spotless Data has the ideal solution in our machine learning filter, available and easy-to-use through an API, which filter out and, where required, modify all the bad, rogue data so that your now valid data are ready to enter your data platforms and be used.
The Internet has created large amounts of data, much of which is of use to third-party companies, either for general purposes such as a hedge fund wanting economic news to help make their financial decisions, or for specific purposes, such as a site specialising in a particular genre of television which wants to link to videos of that genre which content providers have made available online. These firms either build their own scrapers or use web scraping software apps to create vast quantities of unstructured data.
All these advances have created the big data world of today where getting the data to work for your company is the key to transforming it. However, these disparate data from many sources, using different metadata tags as well, will contain errors and inconsistencies, and can truly be called rogue data. Data being the new oil means that, just as with real oil, data need refining in order to be of any use whatsoever, as rogue data will corrode your platforms as surely as crude oil would wreck the engine of your car.
You can read our introduction to using our API to validate your data. You can take advantage of our free offer of 500Mb of free data cleaning to see how much you like our service. If you haven't already done so, you can sign-up using your email address, Facebook, Google or GitHub accounts. You may also view our video on data cleaning an EPG file, which also explains how to use our API.
We use the https protocol, guaranteeing your data are secure while they are in our care and that they are not accessible by any third party, a responsibility we take very seriously.
Here is a quick link to our FAQ. You can also check out our range of subscription packages and pricing. If you would like to contact us, you can speak to one of our team by pressing on the white square icon with a smile within a blue circle, which you can find in the bottom right-hand corner of any of the web pages on our site.
If your data quality is an issue or you know that you have known sources of dirty data but your files are just too big, and the problems too numerous to be able to fix manually please do log in and try now