An artificial intelligence (AI) system is only as good as its training.
For AI machine learning (ML) and deep learning (DL) frameworks, the training data sets are a crucial element that defines how the system will operate. Feed it skewed or biased information and it will create a flawed inference engine.
MIT recently removed a dataset that has been popular with AI developers. The training set, 80 Million Tiny Images, was scraped from Google in 2008 and used in training AI software to identify objects. It consists of images that are labeled with descriptions. During the learning phase, an AI system will ingest the dataset and ‘learn’ how to classify images. The problem is that many of the images are questionable and the labels were inappropriate. For example, women are described with derogatory terms, body parts are identified with offensive slang, and racial slurs were sometimes used to label minority people. Such training should never be allowed.
AI developers need vast amounts of training data to train their systems. Collections are often created out of convenience, without consideration for courteous content, copyright restrictions, compliance to licensing agreements, people’s privacy rights, or respect for society. Unfortunately, many of the available sets were haphazardly created by scraping the internet, social sites, copyrighted content, and human interactions without approval or notice.
Many of the most used training datasets have issues. A large number were created by unethically acquiring content, some contain derogatory or inflammatory information, and for others, the sample is not representative because it excludes certain groups that would benefit from inclusion.
The problem has become worse over time. Flawed datasets, that were made openly available to the developer community early-on, became so popular that they are now considered a standard. These benchmarks are used to check accuracy and performance across different AI systems and configurations.
Too few are vetted for inclusion, content, accuracy, or socially acceptable content. Using such flawed records is simply unethical because the resulting systems can be racially charged, biased, and promote inequality.
We cannot have good AI if the commonly used datasets create unethical systems. All files should be vetted and both the creators and product developers held responsible. Just as chefs are held accountable for the ingredients they put into their prepared dishes, so should the AI community be held responsible for allowing poor data to result in harmful AI systems.
Matthew Rosenquist is an industry-recognized pragmatic, passionate, and innovative strategic security expert with 28 years of experience. He thrives in challenging cybersecurity environments and in the face of ever shifting threats. A leader in identifying opportunities, driving industry change, and building mature security organizations, Matthew delivers capabilities for sustainable security postures. He has experience in protecting billions of dollars of corporate assets, consulting across industry verticals, understanding current and emerging risks, communicating opportunities, forging internal cooperation and executive buy-in, and developing practical strategies. Matthew is a trusted advisor, security expert, and evangelist for academia, businesses, and governments around the world. A public advocate for best-practices, and communicating the risks and opportunities emerging in cybersecurity. He delivers engaging keynotes, speeches, interviews, and consulting sessions at conferences and to audiences around the globe. He has attracted a large social following of security peers, is an active member on advisory boards, and quoted in news, magazines, and books. Matthew is a recognized industry expert, speaker, and leader who enjoys the pursuit of achieving optimal cybersecurity. Matthew Rosenquist is experienced in building world class teams and capabilities, managing security operations, evangelizing best-practices to the market, developing security products, and improving corporate security services.