The recent spat between Elon Musk and Mark Zuckerberg brought the alleged dangers of Artificial Intelligence (AI) and Machine Learning (ML) to the world's attention. Musk evoked old fears of AI, explored previously by the science-fiction writer Isaac Asimove and in the Terminator films, that robots powered by AI could rise up, take control and even kill us all.
More practically, Musk proposed, as have various other commentators recently, that AI and ML need tight government regulation in order to "be safe". Others have demanded transparency for AI and ML so that governments and other interested parties can understand what these programmes do and why they do them.
At Spotless we believe that the best and most fundamental way to address both the possible dangers and any lack of transparency in AI and ML is to ensure that the quality of the data which are being used by any of these programmes, and the algorithms which drive and underpin them, are of such spotless cleanliness that anyone, given enough time, can understand them. This is the best way to ensure both safety and transparency issues as well as allowing organizations to fulfill their duty in compliance with the authorities in the territories those building AI and ML applications operate in.
Essentially there are two elements that make up AI programmes. Big data, which are so huge that it would take a team of humans years to read and understand them, and ML applications, using these data in order to do things the applications were not specifically designed to do.
A good and simple example of these two elements of AI can be illustrated through certain games, such as chess, which in recent years have been the subject of world headlines due to specialized computer programmes beating the world's greatest players.
The programmes are able to examine and learn from literally millions of possible game scenarios before playing an actual chess game. They are then able to play and beat the human world champion even though those who programmed the chess-playing software were clearly not capable of beating the world champion themselves. Thus the ML programme was clearly engaging in behaviour it was not specifically designed to do, ie the brilliant moves that ensured it won the games. It was able to do so because of what it had learnt from studying such a vast number of games that no human, even in a whole lifetime, would have been able to study even the tenth part of them let alone recall all the details of each and every game they had studied while engaged in playing against the world champion.
Trying to create a world beating chess software application is actually relatively simple compared to many of the other tasks we try and get AI software to do but it does follow the same basic principles. The key is to build a data lake or even better a data warehouse and to fill it full of the millions of games which the AI/ML software can examine before the real game starts. If the data science team are able to fill the warehouse with game scenarios, real or imagined, whose format they have produced themselves, then the repository has data quality and so the AI/ML application will find it straightforward to read and grasp the intricacies of all the games it has access to. However, if instead the data science team simply gathers together records of all the publicly available chess games from history and puts them straight into a data lake it is unlikely that the AI/ML software could beat even the worst chess player within the data science team.
The AI programme would be baffled time and again; eg while in the majority of games the two players are white and black, occasionally games are between white and red. What does this 3rd colour represent, the software may ask itself? In reality it just represents dirty data, as black and red are actually the same colour in chess, the one which opposes white and starts second. Realising this, the data science team could simply set an instruction within the algorithm they are writing that red and black are synonymous terms. Yet if they do this for every dirty data anomaly they can identify they will soon have hugely unwieldy algorithms, all in order to avoid cleaning the data! For instance, there are at least four different chess notation systems which allow AI/ML software or a human to understand what move has taken place, and that is in English. There are many different languages in which top quality chess games notations will have been written, with Russian, home to many human world champions, not even sharing the same alphabet as English. Even in a simple case like chess standardize the data is a necessity and not a luxury, in order to ensure that the data quality is such that you can trust it.
The possibilities of dirty data contaminating the data lake and turning it into a data swamp are massive even with simple examples. How much more is this going to be the case with more complex, real-life issues such as an AI programme working for a hedge fund examining financial data on a daily basis in order to make the decisions which will make or break the fund?
Many people tend to think that AI and ML are things that big US-based companies such as Facebook and Amazon engage in but the reality is that there are many thousands of start-ups throughout the world which are using ML in order to develop new products and services that did not exist before. Not to mention the many established non-IT companies who have lots of big data and are now starting to use ML in order to analyse them to develop business intelligence. This means that any regulatory approach will be worldwide and not merely US-centric.
We at Spotless are an interested party, recognising that our own data quality API solution is based on Machine Learning filters and a patent-pending algorithm, which literally filter out the dirty data, and which learn as they go along. This is particularly useful for our regular clients, who come back time and again with the same type of dirty data issues as the ML filters become experienced at dealing with the particular problems they present to Spotless, thus offering a better service than a software programme not based on ML and which does not learn as it goes along.
However, far from representing a threat to human civilization we believe our own contribution to the field of AI/ML will make for better quality, more transparent AI by helping others have clean data which works and which can be more easily understood both by humans and by other Machine Learning programmes. This is likely the case for 99% of the AI/ML applications being developed right now.
Of course only a tiny minority of AI is likely to actually offer a threat to human civilization.
If we take two scenarios: one an AI programme coming up with novel solutions when it is tasked to improve efficiency in the supply and demand network of a supermarket chain by analysing all the factors involved in ensuring that the food and other supermarket products which customers want are always available on the supermarket shelf and at a price the customer is willing to pay. Another AI programme is set to the task of examining all the battles in history as well as using services such as Google Earth in order to be able to examine terrains and suggest strategies that would help an army to win future battles anywhere in the world. We can surely say that this second AI programme might represent more of a threat to humanity than the first one which does not represent any kind of a threat except to inefficiency.
Unfortunately criminals are also likely to build their own AI/ML programmes, but criminals also use cars, and as a society we regulate their use and install CCTV, we don't propose banning cars.
The vast majority of ML applications, including our own Spotless data quality API offering, do not represent any kind of an existential threat but are simply trying to earn a living for their human owners and employees by offering better products and services, more efficiently deployed than ever before.
If governments want regulation, as is their right, the best that AI and ML-driven companies can offer in the way of transparency, and ignoring any problems of their competitors being able to view their hard work and thus steal their intellectual property, is to say to the authorities, "here are our big data, which are so spotlessly clean that a vast team of analysts, or more realistically your own government AI programme, can examine at their leisure and thus understand what the data mean; here are our algorithms and these are the new things they can do which hadn't occurred to us because our human teams weren't large enough to understand the big data but which did occur to our AI/ML applications, and which allow us to provide these products and services to help make a better and safer world.
Here is our introduction to using our browser-based API. You can try out our service on your My Filters page; however, you will need to be logged in to access this, which you can do here using your email address or using your facebook/github accounts.
In order to show you how smoothly our API filters work we are giving away 100Mb of free data cleansing to each new customer so you can test it and see why it works for your organisation.
We guarantee that your data are secure and cannot be accessed by any third parties during the time they are in our care, a responsibility we take very seriously. If problems during our data cleansing process do arise, an automated flag alerts our data scientists who will then manually review the problem and, if necessary, contact you via your log-in details so that you and they can resolve the issue together, though with our easy-to-use filters and you, the customer, defining the variables to be cleaned this only happens very occasionally.