Obtaining Data: A Journalist’s Guide to Scraping PDFs and Converting Them to CSVs

This post is the first of a series dedicated to obtaining, cleaning, analyzing and visualizing the causes of death in North Carolina.

We start our journey with two basic questions:
1) What are the most common causes of death in North Carolina by county?
2) What kind of trends can we see over time?

The North Carolina Department of Health and Human Services maintains detailed mortality statistics for the state, and that’s where we’ll start. For our first prototype, we will focus on the 2015 data found on their website. 

The first problem I ran into was that all of the detailed mortality statistics were stored in individual county pages accessible via a drop-down menu. Manually opening and saving each of the 100 PDFs would be a pain. Time for some web scraping!

Building the Scraper

The first step to building a scraper is finding a pattern in the URL that you can follow to loop through the pages. The pattern for these links is easy to see:

http://www.schs.state.nc.us/data/vital/dms/2015/[COUNTYNAME].pdf

For each link, we just need to replace [COUNTYNAME] with all lowercase, no spaces (New Hanover, the only county that has more than one word in its name, simply translates to newhanover in the link) county name.

I chose Python to write this scraper because it’s lightweight—and, because I have a Mac—comes preinstalled on my computer. Building a basic scraper isn’t terribly difficult, even if you’ve never written Python before. If you Google “download file from url python,” the second link will get you the few lines you need.

If you open up the text editor of your choice, you can start with these few lines of Python to scrape the site:

import urllib testfile = urllib.URLopener() testfile.retrieve("http://www.schs.state.nc.us/data/vital/dms/2015/alamance.pdf", "alamance.pdf")
console.log(“script complete!”)

Save this file as a .py file and you’ve got a Python script! To run it, simply navigate to the directory where your .py file is located and run:

python [yourfilename].py

You should see the log “script complete!” and the alamance.pdf file will appear in your directory.

We don’t want to have to manually write out a new line for each county, so we’ll have to stick it in a loop. To do that, we’ll need a Python array with a string for each county. Our array will look something like this, with all 100 counties included:

counties = [

“alamance”,

“alleghany”,

...

]

How can we make this array as efficiently as possible without having to handwrite it?

I already had a list of each N.C. county in a spreadsheet from a previous project. In one column, I used the lower() function to get our string into all lowercase. I also made it a little easier to directly paste those counties into the array format by adding some quotes and commas:

Getting the quotes and commas in the second column is a little tricky:

Because quotes in most programming languages (including Google Sheets syntax) indicate strings, if you want to use a quote in a concatenate() function, you must escape it. If you Google “google sheets escape quotes,” the first link will tell you that Google Sheets has a char() function that can read Unicode digits and translate those into characters. The unicode number for a quote is 34, so we just use CHAR(34) to indicate a quote in our concatenate function.

Now we can paste that column into our Python array between the brackets (**remember, you’ll have to remove the space in “new hanover”!**), and we’re ready to write our loop!

for i in counties:

   testfile.retrieve("http://www.schs.state.nc.us/data/vital/dms/2015/" + i + ".pdf", i + ".pdf")

If you run this script, you’ll get all of your PDFs in your current directory. Of course, we want our data in a delimited format—so now we’ll have to use Tabula to translate those PDFs into CSVs.

Using Tabula

If you install Tabula and pull in each PDF individually, you can get the CSVs you need. I was having some trouble with the time it was taking my computer to process the PDFs, so I manually wrote a script to run Tabula through my terminal.

If you go to the Tabula GitHub, you’ll find a Ruby version of Tabula that you can install on your computer. You’ll need JRuby for it to work.

I used RVM to install JRuby, so first you should install RVM, which is just a matter of pasting these two commands in your terminal:

gpg --keyserver hkp://keys.gnupg.net --recv-keys 409B6B1796C275462A1703113804BB82D39DC0E3

And then:

\curl -sSL https://get.rvm.io | bash -s stable

Once you have RVM installed, just run this command to install JRuby:

rvm install ruby

And now we can install Tabula:

jruby -S gem install tabula-extractor

Now, if you navigate to the directory where your PDFs are located, you can try translating one of your PDFs with this command:

tabula --outfile alleghany.csv alleghany.pdf

You’ll now see alleghany.csv in your directory, but it’s a little smaller than you might expect. That’s because by default, Tabula will only translate the first page of the PDF. We’ll need to add on the –pages option to tell the script to get more pages. The Alleghany PDF has 52 pages, so to get all of the pages translated we’ll run:

tabula --pages 1-52 --outfile alleghany.csv alleghany.pdf

Now we know the format our commands need to run in, but it would be a huge pain to have to write down the page counts for each PDF and then write each command! But hey, that’s what code is for: automating tedious tasks.

Automating Tabula

It was at this point that I switched to writing a shell script, because I was trying to loop through the tabula command, which is a shell script. If you know Ruby, you could also use the example in the Tabula extractor repository to write in Ruby.

At this point, I needed just a few elements to my shell script: an array of the county names, an array of the number of pages in each respective PDF, and a loop to put it all together. You can Google “shell script array” and “shell script loop” to figure out these basic syntax structures, but I’ll save you some time.

The county array loop we need looks like this (remember to change “new hanover” to “newhanover”!):

declare -a COUNTIES=(

alamance

alexander

alleghany
…

)

We can paste the county list we had before without the quotes and commas to quickly populate the array. Once you have the array in a new file, save that file as [yourfilename].sh (I called mine process.sh).

Our pages array will have the same syntax, but the question is, how can we get the script to count the pages in each PDF for us?

If you Google: “count pages in pdf shell script,” the first result will give you some options. The long script that is the first option wasn’t working for me, so I decided to go with the pdfinfo package option:

foo=$(pdfinfo pdffile.pdf | grep Pages | awk '{print $2}')

How do you install the pdfinfo package? Googling “how to install pdfinfo” will tell you that it’s part of the xpdf package, which you can install with this command if you have Homebrew:

brew install homebrew/x11/xpdf

If you don’t have Homebrew, then you should be able to manually install the xpdf package here.

Now we can write a loop using our COUNTIES array to return a list of all the PDF’s respective page numbers:

for ((i=0;i<${#COUNTIES[@]};++i)); do

echo $(pdfinfo "${COUNTIES[i]}".pdf | grep Pages | awk '{print $2}')

done

echo "pages printed."

Now you can save your .sh file and run it in the terminal with the command:

shell [yourfilename].sh

You should see a list of all the page numbers in your terminal. Copy that list and make yourself a pages array! You can now comment out the page printing loop with some hashtags on each line.

declare -a PAGES=(

52

23

13

…

)

Now we’re ready to write our master loop! It’ll look like this:

for ((i=0;i<${#COUNTIES[@]};++i)); do

 printf "%s has %s pages\n" "${COUNTIES[i]}" "${PAGES[i]}"

 tabula --pages 1-"${PAGES[i]}" --outfile "${COUNTIES[i]}".csv "${COUNTIES[i]}".pdf

 done

 echo "TABULA COMPLETE."

You don’t necessarily need the printf or echo lines, but it’s nice to see the progress of your script as it runs. This process will take a little while to complete, but when it’s done, you’ll have a CSV version of every PDF in your folder!

For our next post, we’ll take a look at the format of these CSVs and see how we can clean them up for analysis.

Leave a Reply

Your email address will not be published. Required fields are marked *