Scraping Data from Wikipedia Tables

Scraping Data from Wikipedia Tables

Emily Halford 05/05/2021 4
Scraping Data from Wikipedia Tables

Demographics.png

Library_tidyverse.png


Next, we need to give R the url for the webpage that we’re interested in:

URL_Wikipedia.png


We then use the read_html() function to translate this webpage to information that R can use, followed by the html_nodes() function to focus exclusively on the table objects contained within the webpage:

Houston_HML.png


It looks like the Houston Wikipedia page contains 19 tables, although some of these class descriptions are more informative than others:

XM_Nodset.png


Next, we pull out our table of interest from these available tables. The nth() function specifies that we want the 4th table from the above list. Determining the right table to specify here may take some trial and error when there are multiple tables on a webpage. You can do your best to guess the correct number by looking at the webpage, as well as just viewing different tables until you see what you’re looking for:

Pop_Table.png


We get the following output, and the Wikipedia table is now in R! As often happens with web scraping, however, this table isn’t really usable yet. All four columns have the same name, the first and last rows don’t contain data, and there is an extra column in the middle of our data frame:

Historial_Population.png


Let’s do some quick clean-up with this table to make it more usable. We can’t do much of anything before our columns have unique names, and we also need to restrict this table to its 2nd-19th rows:

colnamespoptable.png


We’re not quite there yet, but the output is looking much better:

Year_Table.png


Let’s do some final cleaning. First, we’ll get rid of the blank column. All columns are also currently stored as character variables, whereas year should be a date and population and percent_change should be numeric. We remove unnecessary strings from the percent_change and population columns, then convert all columns to their appropriate formats:

Poptable.png


It’s as simple as that. Everything in the table now looks as we would expect it to:

New_Stat_Table.png


The population data is now fully usable and ready to be analyzed. Web scraping is a great tool for accessing a wide range of data sources, and is far preferable to manually copying values contained within online tables given its reproducibility and reduced likelihood for human error. The code contained in this article can additionally be built upon to scrape numerous tables at once, allowing for even greater efficiency and access to even more data.

Share this article

Leave your comments

Post comment as a guest

0
terms and condition.
  • Jonathan Turner

    Thanks for the clear explanation

  • Laura Woodhead

    Useful info

  • Diego L

    Good job Emily, nailed it with your explanation.

  • Robert Squirrell

    This is very inspirational and helpful for data scientists.

Share this article

Emily Halford

Data Science & Mental Health Expert

Emily is a data analyst working in psychiatric epidemiology in New York City. She is a suicide-prevention professional who is enthusiastic about taking a data-driven approach to the mental health field. Emily holds a Master of Public Health from Columbia University.

   
Save
Cookies user prefences
We use cookies to ensure you to get the best experience on our website. If you decline the use of cookies, this website may not function as expected.
Accept all
Decline all
Read more
Analytics
Tools used to analyze the data to measure the effectiveness of a website and to understand how it works.
Google Analytics
Accept
Decline