The Importance of SQL in Practicing Data Science

A number of times last year I was asked “What do you think is the most important skill in data science?” I always replied “SQL”. Although this response was always met with a nod of agreement, I was often told that wasn’t the typical response. Understandable, the Python vs. R debate is apparently more sexy. But on day 1 of your first data job they’re going to introduce you to their data warehouse. This is the data you’ll utilize to analyze data, by writing SQL queries.

I’ve used SQL at multiple jobs throughout my career, but I wanted to make sure that other companies were doing the same. Here is a current job listing at Google for a Data Scientist, and they want experience with SQL:

The major cloud providers are now offering relational databases in the cloud:

Also, Google Cloud SQL and Azure Database for PostgreSQL. The data is getting bigger, but SQL is here to stay (and scale). The industry is getting larger, and there are a whole host of major companies around now to help optimize SQL databases, like RedGate and Sentryone, who both offer SQL server monitoring.

If you read my article on data science FAQs, we saw that 51% of job openings titled “Data Scientist” in the US were asking for SQL.

Companies are using it, there is demand for the skill, and it’s here to stay. Even if for some reason you do not need SQL at your first job, I’m sure you’ll need it at some point during your career if you’re in data science.

What about Unstructured Data?

Yes, I sometimes work with unstructured data in the Big Data environment. But if I’m in there and find a variable that would be relevant for repeated use, we’ll typically have the big data team make it available in the data warehouse.

I pull some grouped data (using Hive which is quite similar to SQL) onto my local machine and do an analysis.

At some point in the future (length of time depending on the current priorities) I’ll have that data available for me to use in one of the tables in the database. And life keeps moving on.

I could spend more time in the big data environment, but queries run much faster in the relational database.

Maybe Someone Other than Yourself Pulls your Data for You

We all know that it is important to understand how the tables are related and the logic behind the data. I want to write the query that builds the model I’m putting my name on. Understanding all of the intricacies and nuances of the fields. Having a full understanding of the potential bias and caveats that will need to go along with my model allows you to communicate these caveats with the business. I also like to think I’m pretty creative when thinking about variables. This is partly due to having a good understanding of the different tables in the relational database.

There are often questions you’ll want to be able to answer by yourself. If something doesn’t seem right with your data, you’ll want to be able to dig into the discrepancy to find out what is going on. You don’t want to be blindly following data that someone else provided, and you don’t want to get held up if the data doesn’t seem quite right. I want to dig in and look for answers immediately.

Maybe You Just Want to Use Python or R

Cool, I pull data from the database into Python and R too. However, I start my query in SQL. I find that for complex queries where I am joining multiple tables it makes sense to write my query in SQL first. The errors when I misspell something are much easier to catch and track down when I’m directly in SQL rather than when I write a query directly in Python and then find that it doesn’t run for some reason. Python just let’s me know that there is an error, it’s not going to give me hints about what the problem was like I’d get in SQL.

Although you can use Python or R syntax that is not SQL to speak to the database, you still have to understand the schema and how relational databases work to be successful querying this way. It’s fairly easy to learn, even for total newbies.

Summary

The learning curve is quite easy, so you’ll be writing queries in no time if you decide to learn.

Learn it once, use it again and again in your career.

A version of this article first appeared here

Share this article

Leave your comments

Post comment as a guest

  • Michael Nolan

    Fantastic insight.

  • Will Carroll

    Thank you for the clear explanation