Big data analytics part 2

Big data with data quality and data integration being controlled from a keyboard

Big data analytics are complicated but much less so when the data have been properly cleaned.

Continuing the blog on big data and analytics. Part 1 can be found here.

Different varieties of data

While data come from various sources and have differing file formats, an issue which is generally fixed by data cleaning the data and placing them all into a single data warehouse, there is a separate issue of the different types or varieties of data. This is particularly common when examining big data, often including dark data, not least because there is so much of these data. The process of ensuring the unification of these different data sources are into a single whole is known as Data Blending.

Ventana Research has pointed out that 80% of businesses have to deal with at least 20 different data sources, indicating that the need for data blending is an issue for the vast majority of businesses. These sources might include transactional data (the record of actual sales etc) and data about customers (how often they log into your website, records of their phone calls to your company, even the location where they sign-in from on, for instance, travel websites). They can also include social media information (such as what they are saying about the company and its actions on Twitter).

In getting these 20 or more sources into a single, blended dataset which all those in the organization who need such information can easily access, from the CEO and other top-level executives to people on the ground dealing with customers, such as those working in a call centre or a branch office. Data blending requires not simply the merging of different sources but also that the data being merged are clean data which passes the data quality test of being data that can be trusted in so that the analysing of said data can produce useful rather than confusing insights for the company. Getting the analytics wrong can be worse than doing no analytics at all, and avoiding this requires both big data that have been cleaned to a high quality and analysts who are capable of doing their job well and able to spot when something doesn't seem right. A good working relationship between the data science team and the analytics team will also help ensure the rapid rectification of mistakes when they are made.

Data preparation

Generally speaking, your analytics team should not be relying exclusively on your IT department to prepare the data. Equally, no one group within your organisation, whether it is the marketing and sales teams, the data science team, the analytics team or the high levels executives can prepare data on their own. Unless the data preparation is a collaborative effort between these and any other relevant teams within your business, then the data that have been prepared are likely to be inadequate, leaving your company exposed to the advantage conferred by the better data preparation of their rivals. The successful analysing of big data is no longer a choice in the vast majority of companies because those who "choose" not to engage with big data will simply fail in competitive markets

The data quality issue

A fundamental issue is the use of data quality technology, such as Spotless data quality API solution, to ensure that the data match rather than mismatch so that they have been validated and are clean. At the heart of this process are both avoiding mismatching issues and transforming the data but this is not something that can be done as a one-off before getting back to the work the various teams normally do. Instead, the cleansing needs to be integrated as an automatic process that is continuously extracting, cleansing, transforming and validating the data, given that the input of data is also going to be ongoing. For this reason, it makes sense to incorporate Spotless data quality into the structural set-up of your data processes so that your organisation can ensure that they have quality data not merely once but the whole time and at any given moment.

Importance of automation

The essential aim of automation should be to reduce the amount of time spent on preparing data to increase the amount of time spent on actually analysing them. Given that the various sources of data are continuously updated, automation is also essential for this updating to happen without the sheer cost of continuous human intervention.

It is sufficient that the analysts deal with the continuously updated data, which they should ideally be able to access through simply reloading the pages they are analysing to get the latest updates in real time and if necessary to change their recommendations for future actions based on these most recent updates. Failure to get the latest information can have catastrophic effects. For instance, if a company has released some new special offer as part of their marketing campaign and then the offer is consistently mocked and attacked on social media such as through Twitter and on their Facebook page then a failure to realise that the campaign is not going as well as was imagined when the new offer was devised would mean continuing the campaign in its present form, with its continued negative impact on the company's brand. Access to real-time data, including from Facebook and Twitter, can allow the company to tweak the campaign and speedily analyse the results of said tweak (say over a period of a few days). It may be that with one or two tweaks the public response to the offer can be transformed from profoundly negative to highly positive. Meanwhile, a competitor going through the same experience but, without the automated data updates, may just decide to abandon the offer once they realise the adverse effect it is having. Given the way the first company has turned their campaign around by tweaking the offer, this would be a mistake, resulting in negative effects on the brand, wasted effort in devising the offer and a need to do something else instead.

Updated social media data is but one source and decisions shouldn't necessarily be based entirely on this one data source. So if a lot of social media users mock a campaign but the transactional data shows that the pick-up on the special offer has been massive and it quickly becomes apparent to the analysts that while 500 people mock the campaign 500,000 people are voting with their wallets to buy the latest special offer, then cancelling the offer is the last thing the company wants to do.  They may, though, want to take a lot at the negative criticism and see if there is any way they can present the offer slightly differently based on their criticisms, or they may choose to present a robust defence of the offer on social media. The important thing is to be responsive to what the latest data is saying so that a robust company response will itself cause reactions, and whether these are positive or negative, having the most recent data to see what the reactions are in real-time can allow the analysts to judge how things are going on a daily basis.

Spotless Data

Here is an introduction to our service. You can also view our videos on data cleaning an EPG file and data cleaning a genre column which explain how to use our API. You can sign-up to Spotless Data using your email address, Facebook, Google or GitHub accounts.

We're giving away 500Mb of free data cleaning so you can test our product and demonstrate for yourself how well it works in reality. We guarantee your data will be secure and won't be available to any 3rd parties during whatever time they remain in our care. If issues with the data cleansing process do arise, an automated flag alerts our data science team who will then review the issue manually and, if necessary, contact you via your log-in details to talk about the problem.

Here is a quick link to our FAQ. You can also check out our range of subscription packages and pricing. If you would like to contact us you can speak to one of our team by pressing on the white square icon with a smile within a blue circle, which you can find in the bottom right-hand corner of any of the web pages on our site.

If your data quality is an issue or you know that you have known sources of dirty data but your files are just too big, and the problems too numerous to be able to fix manually please do log in and try now