How to get your big data working

Big data leaping off a man's tablet need to have data quality to work properly

Data quality is essential when getting big data to work properly.

What is big data?

Big data are data which are so large and complex that it can be challenging to produce meaningful information from them. However, large quantities of raw data captured and stored in your data lake, or your data warehouse, have limited value in themselves. They are not considered big data until they are analysed and curated to gain those useful insights from them that give them data quality which you can trust in, doing the work they are supposed to do, whether this is creating useful business intelligence or observing meaningful patterns.

For instance, a large quantity of data about prostate cancer is not any use by itself if said data just sit in your data lake and nobody can make any sense out of them. Analysing the data and extracting useful information from them, such as who is more vulnerable to this illness, or possible ways of trying to cure the disease can be genuinely called big data. Typically there is so much data that it might take a team of data scientists years to plough their way through them if they did not have automated applications to help them sift through the data and perform analytics to help decipher them at a reasonable speed.

Many useful things that big data can do require a quick response. For instance, a hedge fund needs to analyse up-to-date stock market data in order to make the best decisions on what to buy and sell. If the data it has are so large and complex that they only have last month's data analysis, they are going to make decisions which lose them money because they need the analytics and business intelligence for this month's data. So big data are large quantities of data which are analysed in a timely manner to pull out the required insights that can then help the owners of said data to either use them effectively or sell them on to others to do so.

Recognising the importance of high-speed processes in ensuring trustworthy, quality data, Spotless API is now able to process 1 million records per job, and 500,000 records per job for reference rules, to ensure that your data cleaning happens at the speed of business.

Big data examples

Given that both the valuation and the reputation of companies is increasingly dependent on how good the quality of their data is, ensuring that your big data are trustworthy is not an abstract consideration but instead needs to be at the heart of how your business operates. Examples of well-known companies which have prospered due to their skilful handling and effective use of big data include:

1. Uber

The transportation company, use their big data to set prices, designate resources and figure out where the demand for their services exists, often in areas where public transport is poor.

2. Netflix

The entertainment company, use their big data to offer TV recommendations as well as predicting what the viewer wants to watch, trying to discover or predict which the next immensely popular TV show will be. Spotless matches TV show titles due to the sheer number of clients who complain about inconsistency in TV show titles causing problems on their platform. Being unable to tell that Only Fools & Horses and Only Fools and Horses or Coronation St and Coronation Street are actually different names for the same shows is a classic example of how corrupted Big Data and rapidly changing data (e.g. a TV listings service where the entire data changes each week) can create problems for your company. The solution resides in our 17,000 TV titles which correctly matches apparently different TV titles which belong to the same show.

3. Airbnb

The homestay company, use their big data on how people travel and spend their holiday to target those letting their homes in what Airbnb have identified as peak times and popular destinations while ensuring they do so at competitive prices. They have developed their own recommendations and machine learning software which predict possible fraud in Airbnb transactions.

4. Palantir

The security company, co-founded by PayPal's Peter Thiel, analyses the big data in DNA databases, social media, surveillance records and from informants to fight fraud and terrorism, mixing automated methods with human intervention in the form of security experts.15:13

Big Data and the internet-of-things

Another example of big data is in what is known as the internet-of-things (IoT), which includes normal everyday items such as fridges, cookers, doorbell webcams, locks, heating and air-conditioning systems, televisions and more novel things such as Amazon Echo. All of these connect to the Internet and require and generate vast quantities of data. It is estimated that 50 billion things will connect to the Internet by 2020, with an expected annual compound growth rate of 24% and a market worth an estimated $147.5 billion, according to Markets and Markets.

IoT requires both the hardware to be able to do all the things promised, such as to be able to switch on your oven from your smartphone an hour before arriving home, and the software to send the message from the phone but also to monitor the process and ensure that if the oven fails to ignite, and thus risk a gas leak causing a fatal explosion, that the whole oven system can be shut down. This can be monitored by, for instance, tracking the oven's temperature and creating an alert, such as an SMS, if the oven isn't heating in the way it would be expected to, and then having the systems in place to be able to shut the oven down using the owner's smartphone.The simple answer is to have a disaster/recovery plan that foresees the possible problems your IoT product may produce if it goes wrong, and quality data that can flag when the disaster/recovery plan needs to implementing; though of course the disaster/recovery plan is also based on data. All data quality programmes, whether through Spotless Data's data quality API at the speed of business, or using software or cloud data quality solutions, take time to ensure data quality. However, IoT problems, such as the cooker that has not lit correctly, occur in real-time, and so the data structures in place need to be of a high enough quality to enable the customer to trust in the IoT product. If the gas in the oven has just been extinguished, or there needs to be an automated system in place that flags the problem and switches off the gas supply to the oven. Merely informing the owner via email or SMS that there is a problem is not sufficient if the owner is 100 miles away.

We still live in the early days of IoT and companies which have solved their data problems, so they have data of such quality and trustworthiness that their Internet-driven things always work at doing what they are supposed to do, are not sharing their methods and processes. While many companies that specialise in data issues are trying to move in on the Internet-of-Things, collaborating with manufacturers who know how to make the products but lack experience and expertise in data issues. So for any business investing in the Internet of Things, both securing your data and ensuring that they are quality data fit for purpose and even able to respond to the unexpected are the two things which will divide the winners from the losers. The simple solution is to use Spotless Data's unique web-based API to ensure that your data are of sufficient quality that they can do not merely what they are programmed for but also deal with the new and the unexpected.

How can you get your big data working?

Would that getting your big data working were as simple as capturing the data, through employees, user-generated-content and third-party sources, such as web scraping, storing them in one's data lake and then get some analytics software to analyse the data and magically produce all the insights and information that your company or organization requires to thrive and be the world-beater that you have always dreamed about.

When your big data are also quality data that can be trusted in, they can be a tremendous asset to your business.  They facilitate it rising above its competitors. Yet when the data are dirty, i.e. full of corruptions, inaccuracies, mixed formats, and just generally a mess, then your imagined pristine data lake is actually a mud pit. Even the best and most expensive data analytics software will struggle to make head or tail out of dirty big data. The solution is to cleanse the big data using Spotless Data's unique web-based data quality API solution, and to then upload them to your data warehouse, in a process known as extract, transform, load (ETL), and thus creating a single place where you can ask questions and access information about all your data. Analytics software does not clean the data, or modify them; it analyses them as they are and then draws conclusions based on said analysis. Whereas Spotless's data quality service checks the data to make sure they are accurate, and, if any are not, it modifies the piece of corrupt data, if that is possible, and flags it for manual intervention if it is not possible to automatically correct the problem. Spotless is also ideal for dealing with data from different sources, including legacy platforms. This is essential in building a capable data warehouse for your big data, and for avoiding duplications, often caused by different ways of saying the same thing. For instance, United States and US and USA all refer to the same country, and for your big data to work accurately in the data warehouse requires that these duplicate terms are recognised as such.

Data validation of big data

You can read our introduction to using our API and then try out our service on your my filters page but, for this, you need to log in first. If you haven't already done so, you can sign-up using your email address, Facebook, Google or GitHub accounts. You may also view our video on data cleaning an EPG file, which also explains how to use our API.

We use the https protocol, guaranteeing your data are secure while they are in our care and that they are not accessible by any third party, a responsibility we take very seriously. If you would like to contact us you can speak to one of our team by pressing on the white square icon with a smile within a blue circle, which you can find in the bottom right-hand corner of any of the web pages on our site.

Here is a quick link to our FAQ. You can also check out our range of subscription packages and pricing.

If your data quality is an issue or you know that you have known sources of dirty data but your files are just too big, and the problems too numerous to be able to fix manually please do log in and try now