Can you trust the quality of your data?

A handshake with digital content as it is now data quality

A handshake as a symbol of trust is never more important than with data quality.

Can you trust the quality of your data?

Nobody doubts that in 2017 most companies need to exploit their data, including their dark data, to use them to their maximum potential. The question Spotless Data wants to ask you is, "can you trust your data?" The fundamental definition of data quality is that they are fit for any purpose; i.e. they are data which you can trust and which have data integrity.

Poor quality data examples

A classic example of when data failed this quality test was when a book Amazon were selling was priced at more than a million dollars. The reason is that two rival booksellers using Amazon to sell their wares competed in trying to sell secondhand copies of this book, using tactics which Amazon had failed to anticipate. If data cannot handle the unexpected, they are not trustworthy, and here Amazon, in spite of the legion of data scientists in their employ, were unable to trust their data. The worst that could happen in that particular case is the mocking of and damage to Amazon's reputation.

More serious cases of lack of trust in poor quality data can arise when your data concern real people, who have a legally protected right to a certain amount of privacy in many parts of the world. If you cannot trust your data, given that you have a great deal of data about your customers, will said customers trust your company?

A company that created an algorithm which was able to work out when its female customers were pregnant, by noting factors such as said customers switching to shopping for non-perfumed lotions to avoid morning sickness, inadvertently let the father of a teenage daughter know that she was pregnant before she had told him herself, through intrusive advertising. When the case became public knowledge due to the fury of the outraged father, any pregnant woman who was not yet ready to tell her loved ones that she was pregnant, a not unusual situation, will have immediately stopped using that company's services. What we see here is that while the data were of a high enough quality to successfully identify when its female users became pregnant, the company who managed the data had failed to foresee and adjust its algorithm to the fact that not every woman who is pregnant has already told her nearest and dearest about this extremely important event, or wants them to know.

Sometimes poor quality data can simply make your company a laughing stock, such as when CNN covered Scotland's independence referendum. Viewers will forgive them for wrongly claiming after the vote and before the results came out that the Yes vote was going to win when it reality lost, but CNN also claimed that while 58% of people voted yes, 52% voted no. Viewers of the site lost no time in tagging the story with the dreaded #mockthemedia and #fakenews tags. Such erroneous data make any company look stupid, and generally the smaller the company, the more vulnerable its reputation is to such data sloppiness.

Given that there is so much data that even small companies have to deal with, and so many problems with corruption and inaccuracy within the majority of data sets, it is no surprise that there are many data quality products out there which promise to scrub up your data. The industry as a whole was worth an estimated $1.35 billion at the end of 2015.

The data quality industry

How well do these data quality offerings work? Probably the main complaint is that these services cost more than many companies are willing to pay, lack flexibility in their pricing models and fail to have a free tier of services so that potential customers can try them out without committing themselves financially. Other issues are that one has to download software onto a particular computer or store one's data in a cloud service. Gartner estimates that only 3% of companies use cloud-based data quality services. The slowness of the service is another common complaint, with customers reporting that with small amounts of data the data cleaning went quickly but when it came to large amounts of data, it seemed to take forever to cleanse the data. Not good if you are a TV listings company who receives large quantities of new data on a weekly basis which then needs to be cleaned within a few hours to ensure that one's competitors are not publishing next week's schedules before your business publish them. Customers, eager to know what is on telly next week, will tend to choose the service which publishes this information first.

Lack of customer support is another commonly sorted issue, and this includes what happens when problem occur during the cleansing, with insufficient support given for when, for whatever reason, the cleansing process encounters problems.

Unique API data quality solution

With these factors in mind, Spotless Data has developed its new and unique web-based API solution which, by spreading the load over various server farms, ensures that we clean your data at the speed of business. You can clean all your data simply by going online, uploading the data onto Spotless Data's API on its website, and then, when you receive the email that they are ready, downloading them again, suitably modified and cleansed to ensure spotless quality. Spotless has a tier of free services which you can try without having to commit your company. If there are any problems in the cleaning, these issues are immediately sent to our team of data scientists, who will manually review the problems and either quickly fix it themselves or let you know of the problem so that, between you and them, a solution can be found. This level of support is simply not available with software data quality products, where if there is a problem you will have to fix it yourself.

A common complaint is that data quality products are outdated and not really prepared for the new challenges of data quality in 2017, such as big data, machine learning and IoT; given Spotless Data is a new company which has only recently developed its API product, and which is constantly updating said product to ensure a better service, for instance by developing a new session validation solution as well as better error handling to help make sure the data validation of all your data.

Another common complaint about data quality products is that many of the vendors have more than one offering, often with little integration between the various offerings, making it confusing to decide which one will best serve the interests of one's business. While Spotless have four types of rules for improving data quality; regex, reference, duplication and session solutions, these are all accessed through Spotless one, simple-to-use, API solution, creating quality data which you can trust.

Take a look at our introduction to using our browseable API. You can also test our API on your my filters page though you will need to be logged in first to see this. Please do sign up for our service using your email address, Facebook, Google or GitHub accounts.

You can also view our videos on data cleaning an EPG file and data cleaning a genre column which explain how to use our API.

Here is a quick link to our FAQ. You can also check out our range of subscription packages and pricing. We are offering 500Mb of free data cleansing to all new customers so you can try out our solution for yourself. If you would like to contact us you can speak to one of our team by pressing on the white square icon with a smile within a blue circle, which you can find in the bottom right-hand corner of any of the web pages on our site.

Spotless Data, the One Stop Data Quality Solution API!

If your data quality is an issue or you know that you have known sources of dirty data but your files are just too big, and the problems too numerous to be able to fix manually please do log in and try now