Gathering Text from the Web

Hi everyone! I don’t really feel like working too hard today, so I decided to write a blog post about how my student Will and I used rvest to mine articles from several different news sources for a project. All the scripts and current ongoings of this project can be found on our OSF page - this project is also connected to the GitHub folder with the files.

First, we picked four web sources to scrape - The New York Times, NPR, Fox News, and Breitbart because of their known political associations, and specifically, we focused on their political sections. To get started, you need the rvest library. After you load the library, you can set your url that you want to pull articles from.

library(rvest)
#Specifying the url for desired website to be scrapped
url <- 'https://www.nytimes.com/section/politics'

Now, this url is just where we expect to find a list of links to open for each individual article that was written by the Times. In many rvest tutorials, they focus on pulling only the information from one page - in this blog, I am showing you how to use loops to pull a bunch of separate pages/posts - this tutorial would also work well for pulling from blog type pages.

Next, we read in the main webpage:

#Reading the HTML code from the website - headlines
webpage <- read_html(url)
headline_data <- html_nodes(webpage,'.story-link a, .story-body a')

> headline_data
{xml_nodeset (48)}
 [1] <a href="https://www.n ...
 [2] <a href="https://www.n ...
 [3] <a href="https://www.n ...

Specifically, read_html pulled in the entire webpage, and the html_nodes function helped us find what we were looking for. In this part, we used the Selector Gadget extension to find the right parts we were looking for. If you know a bit of CSS, you can view page source on your target page, and then find the class/id properties you are searching for. For the non-web people, essentially, this tool allows you to find the specific parts of a website you want to extract. In our case, we were looking for the story headlines and their individual page links a for a href, which is code for links on the web.

From there, we extracted the attributes of the story links, which created a big list of the headlines and other attributes about them. I really only wanted the links to the individual stories though - not all the information about them. html_attrs created mini-lists of all the attributes for each part of the page we had scraped.

attr_data <- html_attrs(headline_data) 

> attr_data
[[1]]
                                                                                     href 
"https://www.nytimes.com/2018/05/07/us/politics/don-blankenship-trump-west-virginia.html" 
                                                                                data-rref 
                                                                                       "" 

To get only the links, we tried this:

urlslist <- unlist(attr_data)
urlslist <- urlslist[grep("http", urlslist)]
urlslist <- unique(urlslist)
urlslist

> urlslist
 [1] "https://www.nytimes.com/2018/05/07/us/politics/don-blankenship-trump-west-virginia.html"                               
 [2] "https://www.nytimes.com/2018/05/06/us/politics/giuliani-says-trump-would-not-have-to-comply-with-mueller-subpoena.html"

unlist took out the list of lists and created the attribute data with only one giant vector. Then I used the grep function to find the urls. Therefore, grep("http", urslists) returns the vector number of each item with http in it. I wanted the actual urls, not just the item numbers, so I stuck that inside urslist[...]. The unique function was necessary, as links often repeated, and we really only needed them once.

A warning: websites don’t always use absolute links. Sometimes they use references to folders or relative links. We found this with two of our sites, and solved that problem in a couple of ways. The solution will depend on how exactly the website references their other pages.

urlslist3 <- urlslist3[grep("http|.html", urlslist3)]
##fix the ones without the leading foxnews.com
urlslist3F <- paste("http://www.foxnews.com", urlslist3[grep("^http", urlslist3, invert = T)], sep = "")
urlslist3N <- urlslist3[grep("^http", urlslist3)]
urlslist3 <- c(urlslist3N, urlslist3F)
urlslist3 <- unique(urlslist3)

On Fox, we could find the urls in our attributes with http OR (that’s the pipe |) .html. On Breitbart, we had to use the folder name by doing urlslist4 <- urlslist4[grep("http|/big-government", urlslist4)]. Then we created the absolute link by sticking the homepage on the front when necessary with the paste function. The urlslist3N here found all the ones with the http at the front ^ that we didn’t have to fix. Then we combined the fixed and non-fixed ones and found only the unique set.

From there, we started a blank data frame for storing the final data. Then the real magic occurs.

##start a data frame
NYtimesDF <- matrix(NA, nrow = length(urlslist), ncol = 3)
colnames(NYtimesDF) <- c("Source", "Url", "Text")
NYtimesDF <- as.data.frame(NYtimesDF)

##for loops
for (i in 1:length(urlslist)){
  
  ##read in the URL
  webpage <- read_html(urlslist[i])
  
  ##pull the specific nodes
  headline_data <- html_nodes(webpage,'.story-content') 
  
  ##pull the text
  text_data <- html_text(headline_data)
  
  ##save the data
  NYtimesDF$Source[i] <- "NY Times"
  NYtimesDF$Url[i] <- urlslist[i]
  NYtimesDF$Text[i] <- paste(text_data, collapse = "")
    } ##end for loop

For a good loop tutorial, see here. What this code does is loop over the url list you created at the start. For each separate post page it:

  1. pulls in the entire page by reading that one url,

  2. pulls out just the story (again figured out with selector gadget how to just get the words instead of headlines this time),

  3. uses html_text to get the text in our text section,

  4. saves the data for further use. Notice we used paste with the collapse argument to make sure it did not return a list but rather one giant cell of text.

We ran this for DAYS (about twice a day for a month). Websites often use things like “see more” or “older articles” to collapse the site - or in the case of Fox (I think), when you scroll paste the current information, more is automatically added (like Facebook). This process saves loading time for the user. We couldn’t really force that action to happen from this script, so we simply ran it multiple days to get newer data. The use of unique really allowed us to make sure we weren’t getting duplicate data - and if I had to write this again, I would make sure we also pulled in the old data and filtered out more at the beginning rather than the end (but either way works). If you check out our whole script, you can see some other things we did to make this work more efficiently, such as adding all the sub-pages that Fox uses to post politics articles, as they don’t all make it to the homepage (or it’s going by so fast we weren’t getting them even at twice a day).

At the moment, we are still analyzing the data, but the analysis script in our github folder can give you a preview of the next blog post to come about working with text data. Enjoy!

comments powered by Disqus