Data Scientists emerged about four years ago as THE must-have employee. Everyone in tech scrambled to brush off the old statistics books from courses they’d taken in college, spent some serious time relearning Python Pandas and R, learned the latest in Machine Learning theory, and bought new lab coats for good measure. I know I did.
If you were a Hadoop developer it was also the place to be, because everyone knew that you couldn’t be a good data scientist if you couldn’t map/reduce. It may even have staved off the imminent collapse of Hadoop companies for a few years more, with Indian programmer mills churning out new Hadoop programmers and data science “specialists” by the thousands to take advantage of the next big thing.
Companies bought into it, big time. Every company worth its place on the Nasdaq board paid these data scientists BIG BUCKS, with the idea that before you knew it, their companies would be surging against their competitors, and sales managers and C-Suite executives could count on powering up their iPads in the morning to see exactly how well their company was operating right then and there. Dashboards became the next big status symbols — senior executives would get the ultra-deluxe dashboards with the 3D visualizations and real time animated scatter plots, while their more junior counterparts would get the flat-tone 2D versions and minimal summary versions.
And yet, for all of that, nothing really changed. The data scientists (most with advanced degrees and years of experience in areas such as pharmaceutical analysis or advanced materials engineering) would come to the realization that the quality of the data they had to work with … well, not to put to bad a spin on it, sucked. People discovered that simply because you had a thousand databases scattered hither and yon in various silos, that you had a huge amount of data within your organization, and that all of it was valuable.
What they discovered instead was that much of it was stale, poorly formatted, often with data models that were suited for whatever application the programmer who had created the data had needed at the time. They discovered that much of it was in spreadsheets, where it had been modified repeatedly without any process or control (or oversight) and that far from having records of truth, they had a lot of one off data-sets that were poorly documented, had column names like MFGRTL3QREVPRJ, and has absolutely no consistency of keys.
Put another way, the data that they had was pretty much useless for any kind of analysis, let alone the kind of analysis that people who specialized in test result analysis for drug trials did routinely.
Now, you’re being paid $150,000 a year to provide dashboards for account executives who don’t know the first thing about statistics, but who are desperate for something that will land them that million dollar plus commission. Your data is both messy and pretty much useless, but entreaties to them about rebuilding their databases are met with cries of horror, because it would be a multi-million dollar initiative that is seen as unnecessary. You can of course just lie about what they are getting and rig up a random number generator that probably provides more accurate data than what they have now, but the thing about people who work with data is that they really have a problem trying to be dishonest, because it goes against their basic objectives of trying to be accurate. So what do you do?
Now, I can put on my semantics evangelist hat and tell you that you should develop a semantic data hub. You should, actually, it’s not that hard to do and there are some real benefits to doing so in this space, but I’ll also say that it is not a magical solution. It makes it easier to get the data into a form where you can do something with it (if nothing else, figure out what’s garbage so you can get rid of it), but the reality is that this is not a data science problem — it’s a data quality and ontological engineering problem.
So, shifting over to those of you who are wearing the executive suits, it’s worth making a few things much clearer. You have a data problem. Your data scientists have all kinds of useful tools to bring to the table, but without quality data, what they produce will be meaningless. This is not their fault. It’s yours, and every day that you waste expecting the fancy dashboards that’ll win you that ten million dollar contract is a day where you’re watching money go out the door.
Your job is not simple. What you need to do is first determine the information that you are actually wanting to track, then to spend time talking with your data scientists and your data ontologists to figure out what data you need. Do not expect that you can point to a database and expect that data to magically be there.
Databases for the most part are used by programmers to write applications, not provide deep metrics within your company. Sit down and work out what resources you do have, with the understanding that this will mean that people who depend upon those databases for their own work are going to be VERY reluctant to give you access, especially access that could impact their responsibilities. Furthermore, understanding that most databases are at best poorly documents (most are not documented at all) and much of that data consequently will need to be ferreted out from cryptic references. This is called forensic computing, and most programmers hate doing it, because it means getting into the head of other programmers who are 1) no longer there, 2) of unproven levels of competency, and 3) have likely forgotten what they wrote ten years ago.
Relational data lakes do not solve this problem. The only thing that data lakes do solve is making all of the data accessible to the same computer processes. This is a necessary part of such forensic computing, but it is neither the hardest nor the most expensive part. The most expensive part is figuring out what that data actually means. and getting disparate data sets to even recognize when they are talking about the same things. There’s no off the shelf solution for that, and if anyone tells you there is, they are blowing smoke.
Again, I’ll make a plug in here for semantic solutions — graph triple stores, RDF, ontology management, query and the whole nine yards. It isn’t an out of the box solution, but it is a tool that can make this kind of forensic analysis feasible and can put the means for managing this process into the hands of programmers.
However, understand that this will often require you to rethink the whole process of data flow, of understanding how you are capturing information in the first place and how to funnel it into the appropriate channels early. It requires that your programmers and database administrators give up a certain degree of autonomy and work from a centralized (if federated) store, and it means that you as an executive need to become more familiar with the world of data governance and provenance.
This is a pretty radical shift for people in business, more than a few of whom see getting their hands dirty dealing with IT as beneath them. However, businesses today are transforming (and for the most part have transformed) into data management companies that happen to sell goods or services. The role of a CEO today is as much knowing what the data inputs and data outputs are of their organization as it is managing sales, of being able to insure that the quality of their data is the best that it can be, not just for the sake of regulatory compliance but because the integrity of that data is ultimately what will make you succeed in the marketplace.
This means working with your executive data teams to determine the scope of what you need to know, what you would like to know, and what’s irrelevant, then to establish the processes necessary to gather the data that is relevant to your business needs. Simply pointing a socket at a database and extracting its contents is not going to do anything but increase your overall disk storage costs, and hiring a data scientist to analyse crap data is only going to produce crappy analysis. It will be pretty, mind you, full of gradients and three dimensional effects, but useless.
Kurt Cagle is a self-described consulting Ontological Engineer and Principal at Semantical LLC. He also writes extensively on data management, fututrism and data science issues under the hash tag #theCagleReport on Medium, Future Syn, Data Science Central, Linked In and elsewhere. He can be reached at email@example.com.
Kurt is the founder and CEO of Semantical, LLC, a consulting company focusing on enterprise data hubs, metadata management, semantics, and NoSQL systems. He has developed large scale information and data governance strategies for Fortune 500 companies in the health care/insurance sector, media and entertainment, publishing, financial services and logistics arenas, as well as for government agencies in the defense and insurance sector (including the Affordable Care Act). Kurt holds a Bachelor of Science in Physics from the University of Illinois at Urbana–Champaign.