This tutorial covers how to extract and process text data from web pages or other documents for later analysis. The automated download of HTML pages is called Crawling. The extraction of the textual data and/or metadata (for example, article date, headlines, author names, article text) from the HTML source code (or the DOM document object model of the website) is called Scraping. For these tasks, we use the package “rvest”.
Create a new R script (File -> New File -> R Script) named “Tutorial_1.R”. In this script you will enter and execute all commands. If you want to run the complete script in RStudio, you can use Ctrl-A to select the complete source code and execute with Ctrl-Return. If you want to execute only one line, you can simply press Ctrl-Return on the respective line. If you want to execute a block of several lines, select the block and press Ctrl-Return.
Tip: Copy individual sections of the source code directly into the console (2) and run it step by step. Get familiar with the function calls included in the Help function.
First, make sure your working directory is the data directory we provided for the exercises.
# important option for text analysis options(stringsAsFactors = F) # check working directory. It should be the destination folder of the extracted # zip file. If necessary, use `setwd("your-tutorial-folder-path")` to change it. getwd()
If not done yet, please install the
webdriver package for R and install the phantomJS headless browser. This needs to be done only once.
install.packages("webdriver") library(webdriver) install_phantomjs()
Now we can start an instance of PhantomJS and create a new browser session that awaits to load URLs to render the corresponding websites.
require(webdriver) pjs_instance <- run_phantomjs() pjs_session <- Session$new(port = pjs_instance$port)
In a first exercise, we will download a single web page from “The Guardian” and extract text together with relevant metadata such as the article date. Let’s define the URL of the article of interest and load the rvest package, which provides very useful functions for web crawling and scraping.
url <- "https://www.theguardian.com/world/2017/jun/26/angela-merkel-and-donald-trump-head-for-clash-at-g20-summit" require("rvest")
A convenient method to download and parse a webpage provides the function
read_html which accepts a URL as a parameter. The function downloads the page and interprets the html source code as an HTML / XML object.
To make sure that we get the dynamically rendered HTML content of the website, we pass the original source code dowloaded from the URL to our PhantomJS session first, and the use the rendered source.
# load URL to phantomJS session pjs_session$go(url) # retrieve the rendered source code of the page rendered_source <- pjs_session$getSource() # parse the dynamically rendered source code html_document <- read_html(rendered_source)
NOTICE: In case the website does not fetch or alter the to-be-scraped content dynamically, you can omit the PhantomJS webdriver and just download the the static HTML source code to retrieve the information from there. In this case, replace the following block of code with a simple call of
html_document <- read_html(url) where the
read_html() function downloads the unrendered page source code directly.
HTML / XML objects are a structured representation of HTML / XML source code, which allows to extract single elements (headlines e.g.
<a>, …), their attributes (e.g.
<a href="http://...">) or text wrapped in between elements (e.g.
<p>my text...</p>). Elements can be extracted in XML objects with XPATH-expressions.
XPATH (see https://en.wikipedia.org/wiki/XPath) is a query language to select elements in XML-tree structures. We use it to select the headline element from the HTML page. The following xpath expression queries for first-order-headline elements
h1, anywhere in the tree
// which fulfill a certain condition
[...], namely that the
class attribute of the
h1 element must contain the value
The next expression uses R pipe operator %>%, which takes the input from the left side of the expression and passes it on to the function ion the right side as its first argument. The result of this function is either passed onto the next function, again via %>% or it is assigned to the variable, if it is the last operation in the pipe chain. Our pipe takes the
html_document object, passes it to the html_node function, which extracts the first node fitting the given xpath expression. The resulting node object is passed to the
html_text function which extracts the text wrapped in the
title_xpath <- "//h1[contains(@class, 'content__headline')]" title_text <- html_document %>% html_node(xpath = title_xpath) %>% html_text(trim = T)
Let’s see, what the
## Angela Merkel and Donald Trump head for clash at G20 summit
Now we modify the xpath expressions, to extract the article info, the paragraphs of the body text and the article date. Note that there are multiple paragraphs in the article. To extract not only the first, but all paragraphs we utilize the
html_nodes function and glue the resulting single text vectors of each paragraph together with the
intro_xpath <- "//div[contains(@class, 'content__standfirst')]//p" intro_text <- html_document %>% html_node(xpath = intro_xpath) %>% html_text(trim = T) cat(intro_text)
## German chancellor plans to make climate change, free trade and mass migration key themes in Hamburg, putting her on collision course with US
body_xpath <- "//div[contains(@class, 'content__article-body')]//p" body_text <- html_document %>% html_nodes(xpath = body_xpath) %>% html_text(trim = T) %>% paste0(collapse = "\n")
## A clash between Angela Merkel and Donald Trump appears unavoidable after Germany signalled that it will make climate change, free trade and the manage
date_xpath <- "//time" date_object <- html_document %>% html_node(xpath = date_xpath) %>% html_attr(name = "datetime") %>% as.Date() cat(format(date_object, "%Y-%m-%d"))
date_object now contain the raw data for any subsequent text processing.
For this, investigate the URL patterns of the page and look into the source code with the `inspect element’ functionality of your browser to find appropriate XPATH expressions.
2020, Andreas Niekler and Gregor Wiedemann. GPLv3. tm4ss.github.io