The effectiveness of big data analytics, artificial intelligence, and other data-dependent technologies rely heavily on the quality of data that is processed by them. While many methods of data cleansing are being explored and tested by researchers, very few have had the efficacy that is comparable to data cleaning with artificial intelligence (AI).
You can have the most sophisticated algorithms. You can have the most powerful hardware architecture. You may also have the most capable data scientists and researchers that you can invest in, and still, fail in your analytics or AI initiative. Why? The lack of clean, quality data will be the most likely answer. The phrase “Garbage in, garbage out” isn’t thrown around by data scientists for nothing -- it reminds us of the cardinal fact that the quality of data determines the quality of results achieved by any application that relies on it.
Processing unfiltered, unorganized, and unclean data with any AI or analytics application will lead to inaccurate, less-than-ideal results. And since these applications drive critical business decisions, any inaccuracies in their output will protract to suboptimal plans which may lead to ineffective action. And, depending on the complexity of the applications, the outcome of using unclean data may be much worse than just ineffective action. For instance, AI systems trained on bad data can lead to extremely skewed responses as demonstrated by the experiment of creating a “psychopathic AI” by training it with highly biased data. This experiment -- along with many other cases of AI applications going wrong -- point to the quality and kind of data used in the training as a recurring and major cause of bad outcomes.
Thus, ensuring that the data used for analytics and training AI is free from error, bias, and other bad components is necessary to ensure risk-free operation of these tools. The practice of data cleaning with AI is now emerging as the best way of eliminating bad data and ensuring that all data is usable by -- among other tools and technologies -- AI itself.
Bad data, while not having an accurate definition, has many forms or characteristics, as it were. Bad data can be any bit of data that gives a less-than-clear picture of the situation it represents, leading to any decision being less-than-ideal for the given situation. Since bad information is worse than no information, bad data leads to decisions made under wrong assumptions and premises, leading to anywhere from slightly off to outright disastrous decision-making. Following are a few of the key characteristics of bad data:
As mentioned above, misinformation is worse than no information. Due to multiple reasons, mostly arising from manual data entry, the information stored in a business’s databases is not accurate. Thus, it does not represent the state or situation it purports to represent accurately. These errors are not easy to spot unless they are unreasonably skewed. Even if they are unreasonably skewed, identification of such data usually happens only when the outcome turns out to be less than ideal. And the impact of erroneous data can be severe depending on the criticality of the data and the operations it is used in. For instance, using erroneous data in healthcare applications may potentially lead to serious harm to human life, let alone impacting the concerned organization.
Incomplete data is just as bad as inaccurate data. This is because similar to inaccurate data, it fails to give decision-makers the wrong picture, or in this case, an incomplete picture. And, missing information, in some cases, can be accounted for our at least estimated thorough techniques like interpolation, but still may end up being far from accurate. Thus, incomplete information also leads to weakly-founded decisions.
Inconsistencies can occur in databases when the data is sourced and compiled from different locations, or from different hardware or different platforms. This means that the data on a database that is used for analytics, or a database that is fed to an AI algorithm can contain information that is stored in different formats, leading to errors in interpretation. This may lead to a skewed analysis of the subject or situation under consideration, again leading to wrong or suboptimal decisions and outcomes.
Data that is used for analytics or any other purpose should be relevant to the decision being made or the problem being solved as possible. Relevance could be in terms of date, location, or any other variable parameter. For, the data from a customer sentiment survey report from Europe would be irrelevant and hence, invalid if used for gauging the customer sentiment in the Asia-Pacific region. Although the data is not exactly inaccurate, it is only accurate and valid for making decisions under specific constraints and conditions.
Although duplication of data may seem to be a harmless flaw, non-unique records can be as big a problem as any other. Sometimes datasets, especially those that are commercially acquired from different sources may consist of non-unique entries. Using such databases will be unideal as analyzing large volumes of data from such databases may lead to the same results as is achieved using inaccurate data.
Using bad data for analytics means making bad decisions, or at the very least, uninformed decisions. In a business context, it means non-profitable decisions and actions. It has been estimated that the cost of bad data annually amounts to over $3 trillion just in the US. Decisions made using such data are nothing short of making blind guesses, which defeats the purpose of deliberate decision-making exercises. Relying on analytics using bad data is a big risk, and using similar kinds of bad data for training and driving AI algorithms can lead to disastrous outcomes. This is because, AI algorithms, in addition to being able to analyze data just as analytics tools do, are also capable of acting on the results of those analyses, without human supervision. An AI's ability to act autonomously means that the effects of using bad data will only become evident after the fact. And, depending on the task assigned to the AI, the effects could be potentially devastating. For instance, using AI to suggest medications for a patient can go horribly wrong if the data pertaining to a particular patient's medical history and health conditions is not in order. Similarly, AI systems used to manage enterprise-wide operations by supporting decision-making must have good, high-quality data. Anything but accurate, complete, consistent, and valid data can lead to bad decisions.
Data cleansing, despite being important, can be hard, time-consuming, inefficient and yet potentially ineffective. To make the be process more effective, data scientists can perform data cleaning with AI. Businesses can use AI to cleanse large volumes of data in significantly lesser time to ensure the consistency, completeness, and validity of the data. AI can also help in using statistical techniques like interpolation and imputation to deal with incomplete data sets. This can ensure that even missing values are reasonably estimated based on existing values to maintain the integrity of the dataset. Businesses can automate the entire process of data gathering, data validation, and data cleaning with AI to ensure that the right information is always ready to be accessed by the concerned personnel.
Thus, there is a clear interdependence between good data and AI. Businesses should realize this and, while investing in better ways of utilizing data for business decision-making, they should also invest in tools for cleaning the data they use. Although AI may not be able to fully take over the role of data scientists, performing data cleaning with AI can definitely make their jobs easier and help them become more productive.
Naveen is the Founder and CEO of Allerin, a software solutions provider that delivers innovative and agile solutions that enable to automate, inspire and impress. He is a seasoned professional with more than 20 years of experience, with extensive experience in customizing open source products for cost optimizations of large scale IT deployment. He is currently working on Internet of Things solutions with Big Data Analytics. Naveen completed his programming qualifications in various Indian institutes.