Why Programmers Are Not Data Scientists (and Vice Versa)

Kurt Cagle 30/06/2020 6

Hot jobs go in waves, and not surprisingly, the information technologies sector is as prone to following fashions as religiously as teenagers.

There is a good reason for this, of course. The hot IT jobs are where the money is, and if you want to play in that market, then you need to have the skills or training to participate. Otherwise, you run the risk of watching your income fall as you're relegated to lesser paying jobs, or worse, are forced into IT management, doomed never to touch a compiler again, while never quite managing to play in the big leagues with the C Suite (I may be exaggerating a bit here, though not necessarily by much).

Over the years the role of programmers as generalists havs faded even as their importance as tool creators to assist others has grown dramatically.

Programmers Are Tool Makers, Data Scientists Are Tool Users

Around 2015, after the Big Data / Hadoop hoopla was beginning to fade, the big tech industry analysts came to the realization that with all of this big data out there, you needed someone to make sense of that data. Data analytics tools have been around forever, but for the most part they were specialty tools that mathematicians, biostatisticians, actuaries, and others of that particular ilk used - SAS, SRSS, Mathematica, Matlab.

Then along came R. R was not intended as a programming language, but rather as a data analysis language, though it traced it's language to the S language, which was itself an extension of the Scheme language. R is not new - it debuted in 1993 and helped to facilitate a lot of the heavy lifting that statisticians needed to create pipelines and work with datasets. Statisticians use it because when you're working with crunching numbers, being able to parameterize functions is important, and sometimes a command line interface (a CLI) is precisely the right tool for the job.

However, it's worthwhile noting that for all that Scheme (and hence R) is a programming language, the purpose of running R is generally not to build applications - it is to generate reports based upon the analysis of data by people who understand how to analyze data. They have statistical training, they generally have a pretty good idea about concepts such as distributions, margins of error and data sampling, and they are usually asking a particular question - why does the data look the way that it does? What is the story that data is trying to tell?

Now, it turns out that there's another sector out there who ask much the same question: business intelligence analysts. Note the word analyst here. Programmers, in general, ask a different question: how can I build a tool to solve a problem. Note that there is still a certain aspect of analysis here - decomposing a problem so that it can be recomposed via some kind of modular framework to components that ultimately ends up working as a functional unit - but that the focus, in general, is not on the data itself, it's upon the tools that manipulate that data. Analysts basically look at the data, using the tools that programmers make, in order to extrapolate conclusions that are consistent with sound statistical principals.

To a typical business person, this distinction might not really make that much difference. A programmer and an analysis are, to borrow a term from the television show Bones, squints, people who spend all of their time doing weird mysterious things with computers that require close focus, and hence, reading glasses. However, and again to borrow from a somewhat dubious source (MBTI), programmers are INTJs while analysts are INTPs.

Programmers are engineers. They are fascinated by building things and they see systems as basically gigantic tinker toy sets that allow them to build ever more complex things. Analysts, on the other hand, are interested in understanding how and why systems work, and as such are much more focused on ascertaining the patterns that allow for classification. They often work together - theoretical physicists (INTPs) work out the theory, then experimental physicists (INTJs) build the tools necessary to verify or disprove that theory. But their mindsets ultimately are fairly different. This is important from a business standpoint.

How Big Data Led To Big Data Scientists

Fifteen years ago, business intelligence systems were all the rage, and ultimately a class of business person called the business analysts sprung up. Business analysts wore suits, but you could always tell that they weren't "real" managers because they were the ones always hunched over their keyboards building up complex business models in Excel. They were the ones that gravitated towards the BI suites when they came out because it allowed them to do better statistically analysis and to work with intricately connected datasets, though even they would readily admit that the BI tools were not quite as accurate as the vendors told them they could be.

Once big data put large amounts of what had previously been siloed application data and transactional logs into the hands of the business analysts, those same people came to realize that what they were doing was not really all that different from what scientists were doing - modeling, analysing, presenting. Their shift to the darkside was complete. They had become data nerds, and the big business analysis firms picked up on the trend, dubbing this new generation of analysts data scientists.

On the face of it, data scientist is an oxymoron: the scientific method is based upon the accumulation of evidence and testability, which means that all scientists are in fact data scientists. As a marketing term, however, it did what the marketers hoped it would do - it gave a veneer of scientific respectability to finance, which, along with psychology and sociology, has always struggled to attain some kind of scientific legitimacy. Suddenly, it became fashionable for people to walk around wearing lab-coats and designer nerd glasses, not because they needed either, but because they were scientists, dammit.

Data scientists became the next big career, and for the first time in a long while, people with PhDs in mathematics were making really good money. Companies wanted their own data scientists to make sense of all of this data that they were generating because surely that Big Hadoop Data Lake they'd spent the last five years creating held some kind of insights in it. Otherwise, they would have wasted their money, and no manager worth his bonus would dare admit that they made a mistake wasting money on dead-end technologies.

Systems thinking is the domain of the data analyst - understanding the inputs and outputs of what keeps a system functioning.

The Importance of System Thinking

The business analysts (who were, in fact, the true subject matter experts in their domain) began to be pushed aside by doctorates who could do differential geometry in their head but had likely never dealt with a business model in their life, and were then told to do magic. You can see where this is going. The computations became fancier, the analysis likely became more rigorous, but the modeling, which ALWAYS comes down to understand your domain of expertise, became sloppier, especially once machine learning neural networks came into play.

Have you ever watched a flock of birds in flight? That's a neural net. Each bird gets sensory input that tells them where their nearest neighbors are, where the ground and other obstacles are, where predators are if they are on the outside edge of the flock, and likely some basic sense about magnetic fields in the immediate area. Any given bird does not have a complete understanding of the whole flock, but it adjusts its actions based upon a few inputs with varying weights. This works reasonably well in most circumstances, and is usually pretty good for determining the actions of a system of autonomous but dependent agents (such as companies in an economy) but if the data that any given sensor (such as a bird or a company) receives is inadequate or not modeled properly, it can result in catastrophe for that particular sensor, and quite possibly for the whole flock or economy.

Many ecologists and economists are systems thinkers - they understand the tools of modeling, and they also understand a particular subject domain, which means that they are pretty good at knowing the limitations of those tools within that domain. Programmers are subject matter experts as well, but primarily in the domain of building tools or algorithms. I can tell you as a programmer how to translate the mathematics of differential geometry (which is what autonomous agent modeling really is) into a numerical method approximation, but understanding what particular variables are important (or even independent) is likely not something I can do very well because I don't understand the domain.

Programmers look at the command line and think that because they understand the function set they can do data science. This has even been reinforced by the appearance of statistical and deep learning tools in Python (which many programmers are comfortable with), and to be honest by the potential to go from the average wage for a Python programmer (now around $85K in the US) to a data scientist (around $110K in the US). Some actually do make the jump successfully, but these were people who in general came into programming from other domains and as such have a good basic grasp of the squishier factors of their respective fields.

At the same time, many people with PhDs who enter as data scientists struggle with the fact that the expectations on them include data processing functions that they might ordinarily not even have to think about in controlled environments because they controlled the data collection in previous jobs or academia. Clinical data in controlled environments is relatively clean, business data, especially systemic enterprise data dumped into what could be called a data swamp is anything. Such data invariably has hidden assumptions to it, a mixture of encoding formats and frequently poor modeling, and because it was collected primarily as an artifact of a specific, usually different, process than for analytics, making that data say anything useful can be challenging even when you do know the domain, let along when you don't.

On a related note, I want to stress again the importance of enterprise knowledge graphs and metadata/identity management. One of the central problems that both data engineers and data scientists face is the need for consistent data across teams for organizational data that requires the least amount of retranslation possible. This doesn't necessarily mean that every database needs to have the same labels and definitions, but it does mean that if you have divergence there is some means of accessing information through a common ontology.

This means that ontologists and curators still play a role (and a growing one) in the overall mix of data professionals in an organization. I hope to address this more in an upcoming article, but keep in mind that the people who manage and organize the metadata of that organization are the ones that make the consistency of results possible.

The days of the lone analyst are long over. In most cases, you need a team of different people: data engineers, analysts, visualizers and storytellers, to be effective.

Data Science Is A Team Effort

From a management perspective, there are many lessons to be gained here. Don't try to turn your programmers into data analysts unless they have a solid analytics background already. If you are getting into deep data analytics as part of your company flow, make sure that you have a good knowledge engineer and data quality crew hired first to handle ingestion from data sources, and bring in your analyst primarily to help the knowledge engineer know what kind of data they need to have to perform their jobs properly.

Once you've proven out this process, then you can start bringing in other analysts, keeping in mind that their goal is both to make sense of the data that they're being handed and ultimately to build models that make predictions about future behavior possible. Hiring a data strategist first is an even better solution because ultimately the goal of such models is to inform decisions about future actions, and having someone who oversees this process is essential.

Recognize that an end to end data strategy requires thinking about the data lifecycle as a team effort. In my experience, there is a world of difference between a data analyst - who can build and interpret a given model based upon existing data - and a data whisperer, someone who can take this interpretation and put it into terms that a lay audience can readily understand - and can work with data visualizer who are programmers who are adept at creating meaningful visualizations of this information (an R graphic may be suitable for a dissertation or paper, but will likely be meaningless to the typical business person).

Finally, in the post-Covid environment affecting everyone, it may be more useful working with an outsourced data analytics team at first to get an idea about how best to utilize data analytics in your own organization before necessarily committing to one inhouse. If you use such teams to generate quarterly reports, it's often more cost effective to go external, but if you're at a stage where your data analytics are actually fueling other initiatives, at that point building an internal team makes far more sense.

Conclusions

There are few things stopping programmers from becoming data analysts and vice versa, but it is important within any business to understand what the difference is between the two roles, and how one can (and should) support the other. A data analyst (even one who is part of a broader team) is ultimately the definition of a subject matter expert, someone who can place the necessary context of a given field into perspective to determine the answers to questions. Programmers, for the most part, are tool builders who provide the tools necessary for analysts to better perform their own rules, as well as to help to visualize and otherwise prepare the analysis for dissemination. Both should know their way around a command line, but what they do with it can be very, very different.

Share this article

Leave your comments

Post comment as a guest

Comments

Comments (6)

Kyle Higgins

Data science is more fun, but less stable.

about 4 years ago
Reply
Rob Sinclair

Thanks for the explanation

about 4 years ago
Reply
Nayan Pate

Programming is more exciting for me. I like the adrenaline.

about 4 years ago
Reply
Sam Green

Well explained, thank you !

about 4 years ago
Reply
Robbie Beames

Data scientist is a fairly new job title and the labor force of data scientist is pretty small.

about 4 years ago
Reply
Jenny Grace

Spot on Kurt !

about 4 years ago
Reply

Kurt Cagle

Tech Expert

Kurt is the founder and CEO of Semantical, LLC, a consulting company focusing on enterprise data hubs, metadata management, semantics, and NoSQL systems. He has developed large scale information and data governance strategies for Fortune 500 companies in the health care/insurance sector, media and entertainment, publishing, financial services and logistics arenas, as well as for government agencies in the defense and insurance sector (including the Affordable Care Act). Kurt holds a Bachelor of Science in Physics from the University of Illinois at Urbana–Champaign.

Why Programmers Are Not Data Scientists (and Vice Versa)