Without veracity, expressed through data quality, your big data are worse than useless, a positive menace.
The three traditional Vs to describe big data are Volume, Variety, and Velocity, and, between them, they portray basic characteristics of data which are so vast and complex that they have been given this memorable name, one of the buzzwords of the present era. For us at Spotless Data, however, solely using these three Vs to describe and portray big data is unsatisfactory. We certainly agree that they are all great descriptions of big data, but we feel there needs to be a fourth V, Veracity, which means that the big data speak the truth, if they are to going to be the useful and dynamic drivers of change and prosperity for your business. The reality is that any particular big data may be fit for purpose or they may not, and this fitness depends upon their Veracity, that they are truthful, quality data which both you and your Artificial Intelligence software programmes can trust in and do new things with.
Without Veracity your big data, which seem to offer so much promise, may turn your pristine data lake into a mud pit, give a false view of your company, mess up any sales and marketing campaigns and make your website and your products and services a laughing stock. All of which lose you customers and thus cost your business time and money while doing genuine damage to its reputation and brand. This is why using Spotless Data's Machine Learning filters to filter out the rogue data that prevent your big data from having veracity is so critical. It ensures you don't merely have these big data but that they do both what you want them to do and also whatever new things your artificial intelligence software thinks they may be able to do with them.
First of all, let us take a look at the three traditional Vs of big data.
The name says it all. Big data are simply staggering in size or volume. However, it does not require a crystal ball to see that in a few years time the amount of big data available in 2017, both worldwide and to individual companies, will seem minute, such is set to be the explosion of big data in the coming decade. "What?" people will say,"Facebook only had 250 billion images stored on its website in 2017", and while this number appears without question enormous to us it will seem small by 2027.
The driving factor behind the explosion in big data has been the Internet and its child, social media. Until the Worldwide Web became popular, data not stored in filing cabinets were kept on individual computers that were not in touch with each other. While the early Internet began to generate much greater quantities of data, it was only with the rise of social media that billions of people started to add to the general pool of data, typically by sharing status updates and images, especially on their smartphones. All this was of great interest to advertisers. With the gradual but implacable rise of the Internet of Things, with its smart homes and smart cities, we can expect to see a much higher volume of data appearing very shortly.
Given that big data are so enormous it should be no surprise that they are also, and inevitably, hugely varied. Indeed part of what makes simple data big are their astonishing variety. So, like volume, variety can be considered an inherent quality of big data. Social media and other Internet activities, such as online shopping, mean that most of us leave massive data trails behind us every day of our lives and these footprints are all different types of data. Even SMEs nowadays have to deal with 50 or more different types of data. As technology evolves, so a lot of data becomes legacy data, having different formats from the new ones which are being developed and deployed as we go into the future.
One example which illustrates very well how variety is a key in big data is in the fintech world of lending money to individuals. Traditionally, financial institutions have made these loans based on their financial and credit history, creating credit ratings for people based on factors such as their credit history and the size of their previous loans.
The most forward-looking financial institutions are now planning on using much more varied methods based on big data to make these important loan decisions (very important to all of us considering how the great financial crash of 2008 was caused by toxic loans, many of them personal). Factors such as a person's educational qualifications, their employment history, where they live, their social media history, their browsing history and how much sleep they get are all now considered. Among other things, the loan companies and banks are looking for whether the person taking out the loan is actually telling the truth on their application form. In order to process such vast quantities of variable big data they are using Artificial Intelligence, simply because no human would be capable of processing such big data with the quick turnover modern business requires.
However, the data blending of a high variety of different data sources is a notorious cause of rogue data because of the different standards and specifications used by the various data sources. This data blending is absolutely an activity where the best solution is to begin by data cleaning to ensure they quality data by using Spotless Machine Learning filters, which are highly specialised in this type of work, and getting better over time as they learn from experience.
Speed is at the heart of big data, whether it is a Hedge Fund programme hedging its bets on the stock market, a social media company rooting out extremist content and fake news or an EPG updating its daily television content. Velocity is becoming ever more essential in our 21st Century society and is unquestionably one of the defining features of any useful as opposed to useless Big Data.
However, whereas volume and variety are the per se definitions of big data, it is possible to have big data sitting in a data warehouse which is already out of data and doesn't do very much. Velocity thus defines a useful feature of big data, to the point where we can characterise big data without velocity as useless.
Veracity or truthfulness is the fourth V. Big data without veracity are, like big data without velocity, truly worse than no big data at all because the damage they can do when they contain rogue data greatly outweighs any advantage they may offer.
The basic theory behind big data is that they are stored in a repository of some kind, typically a data lake or a data warehouse. They are then either used directly, for instance, data in an EPG for television shows appearing on particular TV channels over the coming week or data about the prices of goods appearing on a shopping website. Or they are used indirectly, where analytics software is applied to them so that they can be used as business intelligence for internal reporting or deciding such important things as what will be the next data-driven marketing or sales campaign. They can also be used directly by Machine Learning programmes in order both to do things the data scientists who designed the Machine Learning tools want them to do or, in the best of cases, other, new things which had not occurred to their data science programmers.
If the data do not have veracity, they may produce an EPG which, instead of telling your customer what is on a TV channel at a particular time, may misinform them so that when they switch on to watch their favourite show, they find that it started twenty minutes ago. Or they find that it was actually on yesterday, in which case they won't even know why it isn't on unless they check out your competitor's more accurate EPG.
In a shopping website, your customers may be delighted to find the new iPhone on sale at $300, though you certainly will not be happy to have to honour said customers purchase orders at a $700 loss. However, they will not be so delighted to see pop star Adele's latest hit song is also priced at $300, particularly when your rivals are selling it for the more reasonable price of $2.50. These kinds of mistakes are almost certain to have been caused by blanks and overlaps in your big data, something Spotless Data's Machine Learning Filters specialises in the removal and replacement of.
By ensuring that your big data not only have sufficient volume, variety, and velocity to be able to keep your business competitive and even ahead of its rivals in the 21st Century but also have the veracity which can only come from them being big data which have data quality you can trust in is the key first step towards managing your big data and ensuring success for any modern business.
You can read our introduction to using the Spotless API and then try using our service on your My Filters page but, for this, you need to be logged in first. If you haven't already done so you can sign-up using your email address, facebook, google or github accounts. You can also view our videos on data cleaning an EPG file and data cleaning a genre column which both explain how to use our API.
We are giving away 500Mb of free data cleaning to each newly signed-up customer so that you can see for yourself how seamlessly and swiftly our API filters work.
We guarantee your data are secure while they are in our care and that they are not accessible by any third party, a responsibility we take very seriously.
If your data quality is an issue or you know that you have known sources of dirty data but your files are just too big, and the problems too numerous to be able to fix manually please do log in and try now