TV, media and data ingestion

Tunnel media image seamlessly illustrating data validation

Using quality data has become the key to success in the TV industry which is why it makes sense to use our machine learning filters for all your data cleaning.

The television industry has changed immensely in recent years and is now simply drowning in data. The bottom line is these data need to be quality data before your platforms ingest them to make the most of them.

The Spotless Data machine learning filters solution

We at Spotless find that the bottom line is persuading television and media companies of the importance of using the data that they already have or that they can easily obtain to maximise their business potential. However, this can only be done successfully by addressing the issue of the quality of the data themselves. Indeed, what inspired us to set up Spotless as a new startup which applies artificial intelligence to data through machine learning filters was seeing media and television companies enthusiastic about using data to work for them but needing help to properly clean said data to ensure data quality that they could trust in. And while they were willing to recruit a data science team or outsource the work to a company specialising in data science, the data cleaning can be hugely laborious, occupying 60-90% of data science teams time. Not to mention that the difficulties of recruiting an experienced data science team are exacerbated when their job mostly involves cleaning data in a world where suddenly all businesses are waking up to the reality of the importance of data science to drive growth and to make sure they aren't overwhelmed by their more data-competent rivals.

Our Machine Learning Filters learn from experience and are available through a python API. We have thus created a new solution to all the problems of rogue data in television, making their data validation and cleaning a swift and seamless activity which you can build into the entry point of your media platforms.

Data variety

There are various different types of data which need ingesting into your platforms in order to ensure success in maximizing the value of said data, depending on your actual business model. These include metadata, EPG data, Viewing data, VOD data and CRM data.

Ingesting metadata

Metadata are the meta tags which define the data which describe the data about television shows. So whereas the name of a TV show is, say, Peppa Pig, the metadata is what identifies this piece of data as a TV show. Metadata are particularly important when the data you are ingesting into your platform comes from various sources which themselves use different metadata names to describe the metadata, e.g. one organisation might call a show a title, a second might call it a show title while a third might call it a TV show title. If you ingest your TV data from these three sources without first ensuring that the metadata of these three sources all match each other chaos will ensue. So if you define a TV show name as a title for your platform, then TV show title and TV show title are rogue data, not because they are inherently wrong but because they don't fit the metadata rules of your platform.

Other examples of metadata for TV include the broadcast time, the channel the show appears on and the genre of the show, which gives the viewers an idea of the content of the show so that if they like a particular genre, this meta tag helps them find other similar shows they want to watch.

Getting the metadata right is an essential pre-requisite for ensuring the other data ingested into your platform also have data integration and are fit for purpose, and because we know this our Machine Learning filters make sure that your metadata will always match. 

Ingesting EPG data

EPG data let people know what is on television and when. They typically come from multiple sources while multiple broadcasters may display popular programmes. So one source may call the successful children's TV show Peppa Pig by this name, a second may call it Peppa while a third may call it Peppa Pig SE04 EO49.  The last thing the EPG provider wants is for this successful show to appear to be three different shows, even though it appears this way in its initial raw data.

If the EPG platform recognises the show as Peppa Pig then the other two examples can be considered rogue data; however, simply removing this rogue data would be the wrong thing to do as an EPG with blank spaces instead of whatever is on that channel is definitely not fit for purpose. What is required is a modification of the titles so that all three pieces of data give the show the same name. Arguably the third example, Peppa Pig SE04 EO49, is best as it then distinguishes this from different episodes. However, care is needed to make sure that when the show lacks the series and episode data within its name that the EPG still recognised this as the same show and is, ideally, able to give it SE and EO numbers or deal with the problem in some other way.

Time and date information is another example of where EPG data from multiple sources can be complicated. While 8 pm on October 31st is the same time as 20:00 on 31/10 or even 20:00 on 10/31 (if the source of the data is from the US) most computer systems will see them as distinct data. You can either programme your computer systems with a complex algorithm that recognises these data and time varieties as all being the same date and time or you can get the data cleaned before they enter your platform by using our Machine Learning filters. These convert all the time and date formats into the time and data format your organisation has chosen to use in its EPG. There is no right format other than the one you have chosen, though in our experience data sources often do contain corrupted data which is wrong per se, such as identifying Frost the police drama as being in the genre of News and Weather. Rogue data consist of different formats that don't match the one you have chosen, and on our My filters page (please log-in to see this page) you can choose to specify your preferred number and other formats so that you get your data exactly the way you want them. Our session validation page is a particularly useful document for problems of gaps and overlaps which can plague EPGs.

Ingesting viewing data

Viewing data describe the number of people who watch a particular TV show, thus indicating which are the most popular shows. This information is particularly important for advertising as advertisers will expect to pay for the number of viewers their ads are watched by. However, viewer data is no less important to an ad-free public service broadcaster like the BBC, who are tasked to broadcast shows which people watch. Viewer data can also make or break new television shows, as countless successful series which go on to become cult viewing (eg Breaking Bad, Game of Thrones) as well as by the many failures which don't get a second series or even get pulled halfway through the first one. Viewing data is also used to decide when a show will be broadcast, with the most popular shows appearing at peak viewing times, normally mid-evening during the week when people are relaxing after a hard day's work, and on the most popular channels when a broadcaster has multiple channels. The rivalry between different broadcasters in a single country will also mean they all put their most popular programmes on at the same time, in the hope of catching as many viewers from their rivals as possible. All these testify to the power of viewer data.

Until fairly recently broadcasters were reliant on 3rd party organisations, such as Ofcom in the UK, to publish data viewing figures. Things have started changing rapidly as broadcasters now have far more data available to them than before from which they can work out their own viewing data figures by ingesting the various types of data they have which will indicate viewing data into their platforms. They can then make sense of them, typically through analytics software and then making the critical decision such as how much to charge advertisers for a particular show and what to broadcast and when to broadcast it.

Smart televisions and set-top boxes can now transmit customer viewing data and habits directly to broadcasters or to those who made and sold the televisions and set-top boxes.  They then sell the information on to the broadcasters, thanks to the rise of the Internet of things, though always taking care to respect the privacy of those whose data they are using. As with any case where multiple data sources are being ingested into the platform, viewing data can be full of mismatched and inaccurate viewing data. This is why building Spotless Data Quality API solution into the entry point of your data platform makes so much sense so that your company can get its viewing data right, and thus accurately make those important decisions based on viewing data which are so critical to the wellbeing of your business.

Ingesting VOD data

VOD is video-on-demand, where the viewer can watch a film or a TV show when they want to, rather than being tied into the traditional TV schedules. While VOD has existed for as long as video recorders, nowadays a great deal of VOD is found through services such as Netflix, YouTube, and the BBC's iplayer. Assuming that a particular VOD service has a wide variety of shows on offer makes it a great place where TV recommendations are particularly useful. This can be seen to the right of any video playing on youtube, where a list of other shows you might like to watch, based on the show you are watching but also on your history of watching shows, are displayed. Often recommendations access one's personal history of other choices, whether through cookies or the VOD website having persuaded you to sign-up for their services so that a longer-term history of your VOD watching is available to them.

VOD data is rarely from one single source. Ensuring that the different data sources are blended into one through matching TV show titles, show descriptions, show genres and other information such as the year of release and the original release broadcaster is a challenge that all companies which have VOD face in one form or another. Spotless data filters can help the customer make sense of the choices they face here. One example of this is in TV search, where mismatched titles will negatively affect search results and fail to give the seamless results that distinguish the better VOD companies from their more chaotic rivals. If you are showing Peppa Pig on VOD it makes sense to show the latest episode as the first search engine result rather than showing a show originally broadcast three years ago. Achieving the results you want for your VOD company require top quality data as the starting point; from there everything else will be that much easier.

Ingesting CRM data

CRM or customer relationship mManagement is a type of management which uses data analysis to improve the relationship one has with one's customers, with the primary aims of retaining existing customers, reaching out to new customers and driving sales growth. Viewing data is the primary but certainly not the only data which can be used in CRM, depending on the particularly television-related products and services one is offering one's customers. A television production company wants its customers to watch the shows they produce, a TV broadcaster wants customers to use its channel(s) as their preferred channel(s), so that Sky, the European satellite broadcasting company, doesn't care whether you like sport or films as long as you watch your preferred genre on their channels. A VOD service wants its users to watch on its particular VOD service while an EPG doesn't care which broadcaster viewers want to watch as long as they decide what to watch using your EPG rather than those of its rivals. For a newspaper who writes original content about TV shows, knowing which shows are most popular, not merely on the terrestrial channels but on any television-device, is critical whereas they don't care which broadcaster or device their customers use to access the TV or to decide what to watch. So all these companies require CRM data but will then want to use it in different ways, and may require the same data sources to be cleaned in slightly different ways. This is why we, after analysing our customers data and producing an instant report, let our customers define the specifications of how the data will be cleaned and modified in order to ensure they now have the data they require so that when they ingest it into their platform it does the things which they want it to do for their business.

Correlating viewing data with CRM data

Viewing data is arguably the most important feature of CRM and any CRM analysis without incorporating viewing data will have only limited value. Given that the basis of CRM data is analytics it is essential that the viewing data are in a state where they can be successfully analysed to extract the CRM intelligence that is required to maximise customer satisfaction. Which is where Spotless Data comes in.

One of the biggest changes to CRM in the TV sector has been that nowadays communication between customer and company goes both ways, meaning that customers are providing viewing data, whether through comments at the end of an article about a TV show, TV forums (including Facebook and Google Plus) or tweets on Twitter. Media companies now have a myriad of ways to engage with their customers and to find out what they are actually viewing and what they think about these shows, often as they actually watch them.

Instead of relying solely on a review of the latest TV series combined with the audience ratings viewing data, the company which broadcast it can now actually find out what their customers think of the show. So if of the 30,000 tweets during and after a new drama 29,980 are praising it this likely means they are onto a winner whereas if there are only 700 tweets, many of them negative, it may be worth ditching the show already, or at least removing it from peak viewing time, assuming the other data the broadcaster has match this twitter assessment. Modern viewing data can also give an idea of the demographics of a show so one that is exclusively popular with the middle-aged old shouldn't be screened at half eleven on a Friday night.

If you ingest your raw data straight into your platform and set your expensive, top-quality data analytics software to work on them to produce CRM insights you may be surprised at how little value they appear to have. However, if the same data are properly cleaned to remove rogue data and ensure data quality before your platform ingests them, the picture these data will portray once the analytics software has worked on them is much more likely to be accurate and trustworthy enough to make those important decisions such as whether to continue on with the show and whether it warrants a peak viewing spot which will drive customers to your platform, increase advertising revenue and allow your media company to stand out among its rivals.

Using Spotless Data's machine learning filters

You can read our introduction to using the Spotless API and then try using our service on your my filters page but, for this, you need to be logged in first. If you haven't already done so, you can sign-up using your email address, Facebook, Google or GitHub accounts. You may also view our videos on data cleaning an EPG file and data cleaning a genre column which both explain how to use our API.

We are giving away 500Mb of free data cleaning to each newly signed-up customer so that you can see for yourself how seamlessly and swiftly our API filters work.

We guarantee your data are secure while they are in our care and that they are not accessible by any third party, a responsibility we take very seriously.

Here is a quick link to our FAQ. You can also check out our range of subscription packages and pricing. If you would like to contact us you can speak to one of our team by pressing on the white square icon with a smile within a blue circle, which you can find in the bottom right-hand corner of any of the web pages on our site.

If your data quality is an issue or you know that you have known sources of dirty data but your files are just too big, and the problems too numerous to be able to fix manually please do log in and try now