The importance of data cleaning user generated content

User-generated-content is a classic example of data needing data cleaning

User-generated-content is one area where data quality has never been so important which is why using Spotless Data to ensure data integrity makes so much sense.

Corrupted data

At Spotless Data we estimate that 5% of overall data held by companies is corrupted and lacking in data integrity, though a recent report estimated that manually entered data could contain an error rate of anywhere between 2.3% and 26.9%. What this may mean is that if I own a company with 500,000 clients or users and, like Google, I estimate that each customer is worth $80 to my business, and if the primary contact I have with these customers is through a submitted email address, and 5% of those email addresses are badly formatted, then I will have lost 25,000 customers and $2 million in income. This might represent all my profit or be the difference between turning a profit and making a loss. Those customers might also end up consuming one of my competitor's services instead. This is especially so if, having given up on my company for not responding to their initial email submission, they then submit their faulty email to my competitor who, unlike me, is ensuring that they have data quality which they can trust. They can thus correctly identify their email address, and start building a relationship with said customers.


Traditionally a new television show would rely on reviews by professionals to make-or-break it. Nowadays the recommendations of other users are considered to have more value in the success or failure of a show, whether it is commenting on the show as it happens via Twitter or sharing their favourite current television shows on a whole range of sites and forums. Manually entered data, sometimes known as User-Generated Content (UGC) is also crucial as the fundamental and primary point of contact between a website and its users; often an email address but, sometimes, through telephone numbers and physical addresses as well. If your website attracts 500,000 users to it, hopefully, the quality of the site will then persuade 50,000 of them to give you their email address so that you can then log them into your website and send recommendations and other relevant information to them.Yet without the data validation of all these UGC data they can impede rather than help your company.

New web users

Traditional users of the Internet have tended to be well-educated people residing in developed countries. In 2016, approximately half the planet is connected to the Internet, with an estimated 230 million additional users accessing the web for the first time, most of whom will not be digital natives, that is, young people who are living in developed societies. We should not, though, glibly assume that people in developing countries are not going to spend money online and therefore are not worth attracting as customers. For instance, a report by the digital publicity firm iLifebelt recently highlighted that even in a relatively undeveloped region such as Central America six out of ten people buy goods or services online while four out of ten do so on a regular basis. The days when only the developed world made money for web companies is now long past!

Most of these new web users will probably be accessing the web through either smartphones or cheap tablets. A report released this month by Statcounter shows that for the first time more people worldwide are accessing the web through these smaller devices than through traditional PCs. This is even more marked in the developing world, so whereas in the developed UK 55.6% of Internet usage is still done through PCs, in developing India the figure is as low as 25%, that is, 75% of users access the web through smartphones or tablets. Another recent report shows that while the market in tablets that cost less than $200 is thriving, the market in higher quality and more expensive tablets is struggling.

What does all this mean for you?

What does all this mean for those who are reliant on their customers providing them with the personal details and the UGC that their businesses require to compete and prosper as web companies in 2016? It probably means having to deal with and correct more contaminated or dirty data than ever before.

To take one example, when a user submits their email address to a website they have to repeat the address due to a repeat email function, used to try to get the user to notice if they have made a mistake in their redaction of their email address. Such an approach will work better with users who have an understanding of how an email should be formatted. They will notice that a@spotlessdata,com is badly formatted because it has to be .com, whereas an inexperienced user may not even realise that an email requires an @, and is much more likely to repeat the mistake they made the first time. We should not expect too much help from auto-correct or predictive text either, as these are designed to help the user rather than any companies who are trying to collect data from them.

As the number of new users who are not well-educated in the norms of the web and who access it through cheaper devices continues to increase, we can surely expect the rates of dirty data submitted through UGC to increase as well. Be the one, rather than your competitors, to address this issue by getting your data cleaned. While a really big company can do this in-house, it simply ceases to be cost-effective to do so in a smaller company. Trying to figure out contaminated email addresses when you have a list of 25,000 of them will take many employee/months of work while more automated in-house solutions themselves are not cheap and require a level of technical expertise that your company may not have. We at Spotless Data offer a unique automated service for data you can trust, the only one of its kind, using a browseable web-based Data Quality API solution, and based on the latest innovations in the science and engineering of data science. Try our products; we think you will be impressed.

