There are so many articles and social media posts out there attempting to define the difference between data analyst and data scientist roles, a goal that I’ve never quite understood.
This distinction varies so widely from industry to industry and company to company that it seems impossible to draw clear, generalizable lines between the two titles. One of the few commonalities that I’ve seen in these distinctions is the designation of data analysts as second-rate data professionals, which I believe leads many job-seekers to reject data analyst roles together. As someone who thoroughly enjoys her data analyst role, in part due to its challenges and constant opportunities for learning, I hope to counteract some of these generalizations by providing insight into what this title means at my organization. What I write below is based on my experiences and will be generalizable to other research organizations in varying degrees — yet again, there are few universal truths when it comes to data-related jobs and context is always important!
Here are 10 ways data analyst roles are different in a research organization:
Many organizations have a promotion structure that goes something like this: data analyst → data scientist → senior data scientist → principal data scientist or manager. Our organization has no comparable structure. We have research assistants, data analysts and project directors, and then principal investigators who lead teams consisting of the other three positions. There is also a rigid promotion ceiling based upon the highest degree you hold — research assistants have Bachelor’s degrees, data analysts and project directors have Master’s degrees, and principal investigators need PhDs. This strict ceiling is why, as much as I love my job, I will need to move to another organization when I’m ready to move up the ladder.
With the exception of data analysts, none of these roles are explicitly data-focused and can be associated with a wide range of day-to-day tasks. Research assistant positions are entry-level, and I was hired as one immediately following college graduation. Much of the work that I did in my first year as a research assistant was similar to that typically associated with a data analyst — I ran lots of frequency tables and simple regression models, among other tasks. I spent two years as a research assistant while I was in graduate school, and was able to grow in this role to work on more complicated data analysis tasks in my second year. During that second year I was doing more work conventionally associated with data scientists — complex time series analyses, random forest analyses, helping to determine analysis plans for our data.
In terms of entry-level positions for aspiring data professionals, research assistant positions are somewhat unique in that much of the work is not specifically related to data. In addition to developing various data-related skills, I was also working on the qualitative coding of crisis counselor behavior in Lifeline crisis chats and the occasional literature review — tasks which have provided me with important substantive knowledge and given me a deeper understanding of the data I work with now. Plenty of research assistant positions do not involve any data work at all, and can consist of any range of tasks that support research studies. It’s therefore very important to get a clear sense of what’s expected of you before accepting this kind of role.
As a data analyst now, I deal almost exclusively with data-related tasks requiring a wide range of skill levels. I still run plenty of frequency tables and basic regression models, but most of my time is spent determining and executing various analysis plans for our studies.
There may be a range of team members in attendance at any given meeting where I’m presenting results, but I’m almost always targeting my presentation to our principal investigator and an experienced project director. Both have advanced degrees and decades of experience analyzing data and interpreting analysis results. While neither does much data analysis at this point in their careers and I may have to explain the finer details of an analytic technique, I know I’m speaking to an audience who is perfectly aware of what “statistical significance” means or why we need to group by crisis center for that regression analysis.
The drawback of this environment is that you don’t get as much experience communicating to a general audience. It’s challenging to present complex statistical analyses in a way where the key takeaways are clear and your methods seem trustworthy to an audience with zero background in statistics. Plenty of data-related positions require this kind of communication, and this is an important skill to build if you can.
That being said, there is also a significant benefit to this team environment — since your team largely understands what you’re doing, you’re going to have to answer many tough questions about specific choices you made along the way that will help you learn. If you can’t provide a good answer for why you did something, or fully explain what part of your output means or what the algorithm was doing, you need to go back and do your research so you can address those questions in the next meeting.
While there are less consequential deliverables along the way, such as annual reports for projects and or short internal presentations for weekly meetings, all of the work that you’re doing is towards the end deliverable of an article for publication in an academic journal. This has a couple of implications for your work.
First, your analyses are fundamentally different from those conducted in many other kinds of organizations in that your goal is to use statistics to explain rather than predict. Let’s take logistic regression as a simple example. In industry, this regression analysis may be used as a classifier — say, which customers are likely to cancel a subscription? Once identified, these customers can be targeted with a special offer. In this case, it’s important to determine which variables create the best model, but we don’t really care about understanding all of the factors determining this subscription cancellation risk.
On our team, however, we’re never applying a regression model to new data. Instead, we’re trying to understand the factors that explain the difference between a 1 outcome and a 0 outcome. For example, we might use a logistic regression analysis to determine if someone referred to treatment ended up in treatment. There are probably important differences that we can see by looking at the coefficients in our model — we can identify disparities by factors such as race, gender, presenting condition, or income with our model, publish these results so that health practitioners are aware of these disparities, and help to build targeted interventions to help close those gaps. Our model is used to help us understand factors associated with getting treatment, but not to help us predict which individuals will do so.
Second, as the data analyst on a project you are going to have to write the methods and results sections of any resulting journal articles. This means that writing skills are vitally important as there is a high standard for the writing published in peer-reviewed journals. It also means that you need to think critically about how you analyze data and how you represent results so that your approach and findings can be communicated clearly to a readership who knows nothing about your project (but lots about statistics!).
Like many organizations, we do collect much of our data itself. This process looks different, though, and involves numerous partnerships. There is no infrastructure for automated or routine data collection — rather, studies are approved by an Institutional Review Board (IRB) and data collection occurs over a specified period of time. Data collection is primarily done by our research assistants, which often involves administering surveys or qualitatively coding transcripts. These data collection processes typically involve collaboration with organizations such as schools or crisis centers.
We also receive a significant amount of data directly from these other organizations, such data on call volume to crisis centers, web-visits to an organization’s website, hospital admissions data from various healthcare providers, and mortality data from state organizations. We typically receive these data in the form of a csv file or Excel workbook, and often spend lots of time on data cleaning.
A final bucket of the data that we use is publicly available data, such as Google Trends and Twitter data, or aggregated mortality data available through CDC Wonder.
To be fair, there are plenty of situations where small-n large-p problems emerge outside of research, and plenty of research groups that infrequently face these issues. For research such as ours that involves time-intensive data collection with each subject, however, these problems are inevitable. If it is ethically necessary to have a survey conducted by a person rather than a computer (which is the case when surveying suicidal individuals, as my team sometimes does), you’re probably going to end up with a sample size of a few hundred persons with dozens to hundreds of data points on each participant.
This means that much of your time may be spent combating model convergence issues, and that your choice of potential modeling techniques may be limited.
While much of the work that you’re tasked with stems directly from projects conceived by your principal investigator, there’s lots of room to suggest sub-analyses or entirely new questions for your team to consider. Although there are some serious data collection limitations as indicated in point 4, research teams usually have lots of interesting data to dive into. And there’s no limit to what can be done with publicly available datasets such as Twitter or Google Trends.
Since your end goal is publication and a furthering of some body of literature rather than a specific business goal, that side question you find interesting can really add value to a publication or your team’s understanding of a topic. For example, my team was already using Google Trends data to assess suicidality and help-seeking behavior when the COVID-19 pandemic hit. We were all interested in assessing the impact that the pandemic may have had on these search terms, and had the freedom to begin and prioritize a separate project assessing the impact of COVID-19 on suicidality and help-seeking via Google Trends data in the pandemic’s early stages.
Some data professionals are blessed with data that perfectly represents their population of interest — maybe you are interested in patients in your hospital and have EMR data on all of them, or are interested in your customers and have data on all of their transactions. To be fair, the situation usually isn’t this great, but it’s particularly tricky in research. We’re interested either in the general population, or some segment of it, but only have data on a biased sample.
One of my team’s primary research goals is evaluating the National Suicide Prevention Lifeline. It is often necessary to follow-up with callers to the Lifeline in order to do so, but who you think is most likely to be willing to participate? Probably those who had a really terrible experience, or a really fantastic one, or who really need any small financial incentive that we offer. These groups aren’t representative of the overall caller-base for the Lifeline, but our evaluation and feedback impact the service that all callers receive. The Lifeline, as one of the most accessible mental health resources in the US, is a vitally important service for millions of suicidal individuals. Failing to represent the true population in a study therefore affects the lives of real people in important ways, rather than reducing the impact of a marketing campaign or some other consequence with purely business implications.
That being said, far more data projects than people realize fail to represent the true diversity of populations in crucially important ways, such as on the basis of race or gender, with important consequences (think resumé review algorithms that prioritize applicants who look like previous hires — white and male), so this should actually be a consideration in most data-driven positions.
While plenty of researchers do use Python, many of us initially learned to code in SAS. Over the past several years, there’s been a shift towards R in many organizations and graduate programs. My team is split between R and SAS, as was my graduate coursework, and some proficiency in both is expected. I’ve pushed to do every possible task in R, both because I find it far more efficient, intuitive, and collaboration-friendly than SAS, and because I know that my days of coding in SAS will probably be over as soon as I leave research.
A/B testing doesn’t exist. Randomized controlled trials do, though. As do a rich selection of other study designs that are predominantly used in epidemiological contexts, such as case-control studies.
A model is more likely to be discussed in terms of the odds ratios, relative risks, or rate ratios that it produces than in terms of its accuracy. The statistical fundamentals are the same, but the context in which we speak about them is everything.
This can actually be one of the biggest perks of a research job — while everyone around me was scrambling to meet year-end deadlines around the holidays, my workload remained unchanged.
Each project you work on will have an annual report that’s due based on when the grant began, but these are entirely separate from the calendar year or the organization’s fiscal year. There tends to be a natural spacing-out of project timelines, so unless your annual report deadlines are close together or you’re trying to submit several journal articles simultaneously, the workload can be far more consistent and less stressful than in other positions.
Data analyst positions at research organizations have their advantages and drawbacks like any other position. I think that these positions tend to be poorly understood, though, especially by those just entering the data science field, and that their advantages tend to be underestimated. In a market overwhelmed with junior data science talent, I recommend broadening your job search to include relevant data analyst positions. You may be surprised by the challenges and rewards to be found in these roles.
Emily is a data analyst working in psychiatric epidemiology in New York City. She is a suicide-prevention professional who is enthusiastic about taking a data-driven approach to the mental health field. Emily holds a Master of Public Health from Columbia University.