The data validation of transport data

A self-driving car is a classic example of an area where the transport industry is having to deal with an explosion of data

Due to the rise of driverless vehicles and the Internet of Things data has suddenly taken a central place in the transport industry.

In recent years data have become increasingly critical for the transport industry, and this is set to increase rapidly with the rise of both driverless vehicles and the Internet of things, the latter being sensors which are used to create smart transportation systems, especially within cities. The basic features of transport data are data coming from vehicles themselves and data coming from outside vehicles, whether from smart roads, smart street lights, smart traffic lights or smart pollution sensors, among others.

These data, as they are transferred from still and moving devices to one or more data repositories, such as a data lake, will almost certainly be rogue data, that is, full of inconsistencies, corruptions and other inaccuracies, some of which will be genuine errors and a lot of which will be where data fail to match each other due to meta tag inconsistencies or blanks and overlapping issues.

Part of the problem is that transport data comes from various different sources and goes to various different places to be analysed and used. So the data may come from vehicle location data, vehicle sensors and road sensor data, which includes smart devices such as traffic light sensors. Fortunately, there is a solution to all your data validation needs in Spotless API solution, accessible via a web browser, meaning that having quality data has never been easier.

Vehicle location data

Vehicle location data includes both GPS data and automatic number-plate recognition (ANPR) data and generally refers to the location data of large numbers of vehicles. While government organisations certainly collect and use these data, and indeed have been subjected to legal demands for access to their location data, these data are also used by private companies. While failure to get the ANPR data correct might result in uninsured drivers not being apprehended, failure in getting the location data of driverless vehicles right could cause serious accidents when these data are rogue data that fail to do what they are supposed to do, and are in need of data cleaning if they are to be effective for the things they are meant to do.

This is why we have designed Spotless' machine learning filters, which can be implemented at the entry point of whichever data repository the data are about to enter so that they can undergo data validation and be cleaned as they enter the repository and thus be useful for the purposes for which they were designed.

Vehicle sensor data

Ford has reported that the data sensors in its vehicles produce roughly 25 gigabytes per hour of data. The latest vehicles have up to 100 sensors in them, and the data they produce is sent to one or more repositories via a wifi connection. These sensors may not all be from the same company, and individual pieces of data may end up in multiple data repositories, where they can be used for multiple different reasons.

At Spotless we have long argued that data quality which can be characterised as trustworthy, are those which can be used for multiple purposes rather than just being suitable for one purpose. Single purpose data might, for instance, successfully identify that a vehicle is uninsured but when a different organisation wants to use the data to estimate the number of cars using a stretch of road at a particular hour they are unable to do so.

Given the sheer quantity of vehicle sensor data and the multiple purposes for which they are likely to be used, their data cleaning is very important indeed. Our goal at Spotless is simply to clean and validate your data to the point where you can use them for almost any purpose. Then the data from different sensors can be merged so that they can be used together smoothly even where the sensors have been made by different companies with different meta tags and different description tags (e.g. one company may say steering wheel and another may say driving wheel, but they are referring to the same part of the vehicle).

Road sensor data

This refers to any data outside a vehicle that relates to transport, whether it is data from cat's eyes in the road, from traffic lights, street lamps or other similar devices found in or beside a road. It can include pollution sensors to measure contaminants, the results of which can, if necessary, lead to the shutting down of a street or restrictions placed on which cars can use it, to reduce pollution levels.

As with other transport data, these data are likely to come from different sensors produced by different companies. And with pollution sensors, if the data are wrong and produces false readings this might mean roads being unnecessariñly closed or cars unnecessarily banned from using them, which will have a tremendous economic cost far outweighing the cost of cleaning the data to ensure they are accurate. However, the opposite case is even worse, where roads are not closed or have restrictions placed on them even though pollution levels are higher than those considered to be safe, which could easily result in the deaths of children or old people from asthma or other breathing problems.

Road sensor data has short-term uses such as ensuring the roads remain congestion-free or that a traffic light is not on green with no cars waiting while the red light on the other corner has a huge queue but the very same data also have long-term uses such as measuring the number of cars passing a particular spot in a given moment (which used to be done by people standing on the pavement and counting!) in order to decide whether a new ring road is needed. However, if the data were inaccurate, this might mean imposing a new, expensive and politically controversial new road construction based on faulty data when such a road is actually not required.

The data validation of transport data with the Spotless Data solution

You can read our introduction to using our API to validate your transport data and then try out our service on your my filters page but, for this, you need to log in first. You can take advantage of our free offer of 500Mb of free data cleaning to see how much you like our service. If you haven't already done so, you can sign-up using your email address, Facebook, Google or GitHub accounts. You may also view our video on data cleaning an EPG file, which also explains how to use our API.

We use the https protocol, guaranteeing your data are secure while they are in our care and that they are not accessible by any third party, a responsibility we take very seriously.

Here is a quick link to our FAQ. You can also check out our range of subscription packages and pricing. If you would like to contact us, you can speak to one of our team by pressing on the white square icon with a smile within a blue circle, which you can find in the bottom right-hand corner of any of the web pages on our site.

If your data quality is an issue or you know that you have known sources of dirty data but your files are just too big, and the problems too numerous to be able to fix manually please do log in and try now