Ensuring transparency in machine learning

A robot examining complex algorithms made so much easier with Data Quality

The best way to ensure transparency in artificial intelligence and machine learning is through having quality data.

The recent spat between Elon Musk and Mark Zuckerberg brought the alleged dangers of artificial intelligence and machine learning the world's attention. Musk evoked old fears of Artificial Intelligence, explored previously by the science-fiction writer Isaac Asimov and in the Terminator films, that robots powered by Artificial Intelligence could rise up, take control and even kill us all.

Regulation and compliance in machine learning

More practically, Musk proposed, as have various other commentators recently, that artificial intelligence and machine learning need tight government regulation to "be safe". Others have demanded transparency for artificial intelligence and machine learning so that governments and other interested parties can understand what these programmes do and why they do them.

At Spotless we believe that the best and most fundamental way to address both the possible dangers and any lack of transparency in artificial intelligence and machine learning is to ensure that the quality of the data which are being used by any of these programmes, and the algorithms which drive and underpin them, are of such spotless cleanliness that anyone, given enough time, can understand them. This is the best way to ensure both safety and transparency issues as well as allowing organisations to fulfil their duty in compliance with the authorities in the territories where those building artificial intelligence and  machine learning applications operate.

Essentially there are two elements that makeup artificial intelligence programmes. Big data, which are so huge that it would take a team of humans years to read and understand them, and machine learning applications, using these data to do things the applications were not specifically designed to do.

An example - chess

A great and yet simple example of these two elements of artificial intelligence can be illustrated through certain games, such as chess, which in recent years have been the subject of world headlines due to specialised computer programmes beating the world's greatest players.

The programmes can examine and learn from literally millions of possible game scenarios before playing an actual chess game. They can then play and beat the human world champion even though those who programmed the chess-playing software were not capable of beating the world champion themselves. Thus the  machine learning programme was engaging in behaviour it was not specifically designed to do, i.e. the brilliant moves that ensured it won the games. It was able to do so because of what it had learnt from studying such a vast number of games that no human, even in a whole lifetime, would have been able to study even the tenth part of them. Nor could they have recalled all the details of every game they had studied while engaged in playing against the world champion.

Trying to create a world-beating chess software application is relatively simple compared to many of the other tasks we try and get artificial intelligence software to do, but it does follow the same basic principles. The key is to build a data lake or even better a data warehouse and to fill it full of the millions of games which the artificial intelligence/machine learning software can examine before the real game starts. If the data science team can fill the warehouse with game scenarios, real or imagined, whose format they have produced themselves, then the repository has data quality, and so the artificial intelligence/machine learning application will find it straightforward to read and grasp the intricacies of all the games to which it has access. However, if instead the data science team simply gathers together records of all the publicly available chess games from history and puts them straight into a data lake it is unlikely that the artificial intelligence/machine learning software could beat even the worst chess player within the data science team.

The Artificial Intelligence programme would be baffled time and again; eg while in the majority of games the two players are white and black, occasionally games are between white and red. What does this 3rd colour represent, the software may ask itself? In reality, it just represents dirty data, as black and red are actually the same colour in chess, the one which opposes white and starts second. The data science team could simply set an instruction in the algorithm they are writing for the programme that red and black are synonymous terms. Yet if they do this for every dirty data anomaly they can identify they will soon have hugely unwieldy algorithms, all in order to avoid cleaning the data! For instance, there are at least four different chess notation systems which allow artificial intelligence/machine learning software or a human to understand what move has taken place, and that is in English. There are many different languages in which top quality chess games notations will have been written, with Russian, home to many human world champions, not even sharing the same alphabet as English. Even in a simple case like chess standardising the data is a necessity and not a luxury, in order to ensure that the data quality is such that you can trust it.

The possibilities of dirty data contaminating the data lake and turning it into a data swamp are massive even with simple examples. How much more is this going to be the case with more complex, real-life issues such as an Artificial Intelligence programme working for a hedge fund examining financial data on a daily basis to make the decisions which will make or break the fund?

A regulatory approach to machine learning

Many people tend to think that artificial intelligence and machine learning are things that big US-based companies such as Facebook and Amazon engage in but the reality is that there are many thousands of start-ups throughout the world which are using Machine Learning to develop new products and services that did not exist before. Not to mention the many established non-IT companies who have lots of big data and are now starting to use Machine Learning to analyse them to develop business intelligence

We at Spotless are an interested party, recognising that our own data quality API solution is based on machine learning filters and a patent-pending algorithm, which literally filter out the dirty data, and which learn as they go along. This is particularly useful for our regular clients, who come back time and again with the same type of dirty data issues. Over time the Machine Learning filters become experienced at dealing with the particular problems they present to Spotless, thus offering a better service than a software programme not based on Machine Learning and which does not learn as it goes along.

However, far from representing a threat to human civilisation, we believe our own contribution to the field of artificial intelligence/machine lLearning will make for better quality, more transparent Artificial Intelligence by helping others have clean data which works and which can be more easily understood both by humans and by other Machine Learning programmes. This is likely the case for 99% of the artificial intelligence/machine learning applications being developed right now.

Most artificial intelligence being built to help humanity

Of course, only a tiny minority of artificial intelligence is likely to actually offer a threat to human civilisation.

If we take two scenarios: one an artificial intelligence programme coming up with novel solutions when it is tasked to improve efficiency in the supply and demand network of a supermarket chain by analysing all the factors involved in ensuring that the food and other supermarket products which customers want are always available on the supermarket shelf and at a price the customer is willing to pay. And the other where an artificial intelligence programme is set to the task of examining all the battles in history as well as using services such as Google Earth in order to be able to examine terrains and suggest strategies that would help an army to win future battles anywhere in the world. We can surely say that this second artificial intelligence programme might represent more of a threat to humanity than the first one which does not pose any kind of a threat except to inefficiency.

Unfortunately, criminals are also likely to build their own artificial intelligence/machine learning programmes, but criminals also use cars, and, as a society, we regulate their use and install CCTV, we don't propose banning cars.

The vast majority of machine learning applications, including our own Spotless data quality API offering, do not represent any existential threat but are simply trying to earn a living for their human owners and employees by offering better products and services, more efficiently deployed than ever before.

If governments want regulation, as is their right, the best that artificial intelligence and machine learning-driven companies can offer in the way of transparency, and ignoring any problems of their competitors being able to view their hard work and thus steal their intellectual property, is to say to the authorities, "here are our big data, which are so spotlessly clean that a vast team of analysts, or more realistically your own government artificial intelligence programme, can examine them at their leisure and thus understand what the data mean. Here are our algorithms and these are the new things they can do which hadn't occurred to us because our human teams weren't large enough to understand the big data but which did occur to our artificial intelligence/machine learning applications, and which allow us to provide these products and services to help make a better and safer world.

Using Spotless Data's machine learning filters

Here is our introduction to using our browser-based API. You can try out our service on your my filters page; however, you will need to be logged in to access this. You can sign-up to Spotless Data using your email address, Facebook, Google or GitHub accounts. You can also view our videos on data cleaning an EPG file and data cleaning a genre column which explain how to use our API.

To show you how smoothly our API filters work we are giving away 500Mb of free data cleansing to each new customer so you can test it and see why it works for your organisation.

We guarantee that your data are secure and cannot be accessed by any third parties during the time they are in our care, a responsibility we take very seriously. If problems during our data cleaning process do arise, an automated flag alerts our data scientists who will then manually review the problem. If necessary they will then contact you via your log-in details so that you and they can resolve the issue together, though with our easy-to-use filters and you, the customer, defining the variables to be cleaned this only happens very occasionally.

Here is a quick link to our FAQ. You can also check out our range of subscription packages and pricing. If you would like to contact us you can speak to one of our team by pressing on the white square icon with a smile within a blue circle, which you can find in the bottom right-hand corner of any of the web pages on our site.

If your data quality is an issue or you know that you have known sources of dirty data but your files are just too big, and the problems too numerous to be able to fix manually please do log in and try now