Beyond the ethical dilemmas on the development of facial recognition systems3, the Buolamwini case clearly shows how a system based on artificial intelligence can acquire a bias and better perform the task for which it was designed in one group of individuals than in another. This idea that we illustrate with images can be extrapolated to other types of data that we were talking about: if we wanted to train a system to learn how to translate text from English to Spanish, we would need many texts written in both languages.
To infer a person's mood from their voice, we would need audio recordings of people speaking, and the corresponding label indicating whether they are happy or sad. If we thought of a system that automatically detects pathologies from radiographic would need pairs of images with their corresponding medical diagnosis. Or if we wanted to train a model to detect faces in images, we would need a database of photos of people, with labels indicating where each person's face is located.
As we can see, data plays an essential role in training systems through machine learning, since they are the source of information that will tell the system when it has reached correct conclusions and when it has not. Something that is fundamental in this process, and that is not always taken into account, is that a system is rarely built to make predictions with the data with which it was trained. On the contrary, the models are expected to be able to draw correct conclusions about data never seen during “learning” – the test data – and whose labels are not known.