1 Contents

In the digital age, large collections of text are an increasingly attractive data source for analysis in the social sciences. Corpora from thousands up to several millions of retro-digitized or natively digital documents cannot be investigated with conventional, manual methods alone. (Semi-)automatic computational analysis algorithms, also known as text mining, provide interesting opportunities for social scientists to extend their toolbox.

To realize complex designs in empirical social research, scientists need basic knowledge of computational algorithms to be able to select those appropriate for their needs. Specific projects may further require certain adaptations to standard procedures, language resources or analysis workflows. Instead of relying on off-the-shelf analysis software, using script programming languages is a very powerful way to fulfill such requirements. The course teaches an overview of text mining in connection with data acquisition, preprocessing and methodological integration using the statistical programming language R (www.r-project.org).

In sessions alternating between lectures and tutorials, we teach theoretical and methodological foundations, introduce exemplary studies and get hands on programming to realize different analyses.

2 Goals

Participants will learn about opportunities and limits of text mining methods to analyze qualitative and quantitative aspects of large text collections. With example scripts provided in the programming language R, participants will learn how to realize single steps of such an analysis on a specific corpus. We cover a range of text mining methods from simple lexicometric measures such as word frequencies, key term extraction and co-occurrence analysis, to more complex machine learning approaches such as topic models and supervised text classification. The goal is to provide a broad overview of several technologies already established in social sciences. Participants will be enabled to identify their own priorities and to lay foundations for further independent studying tailored to their individual needs.

3 Requirements

We expect willingness to learn about algorithmic foundations of computational and statistical text analysis technologies. For the hands-on part, we rely on scripts in the programming language R. Thus, we strongly recommend some basic knowledge of R, to successfully take part in the tutorial sessions of the course.

If you already have a certain amount of knowledge in another programming language, learning R will be easy for you. However, since R is a statistical programming language, some of its concepts largely differ from other languages.

For participants without basic knowledge of R, we strongly recommend to learn at least a little in preparation of the course.

For a very brief overview of common R commands see: Tutorial_0_R-Intro.html

4 Tutorials

The course consists of 8 tutorials written in R-markdown and further described in this paper.

You can use knitr to create the tutorial sheets as HTML notebooks from the R-markdown source code.

  1. Web crawling and scraping
  2. Text data import in R
  3. Frequency analysis
  4. Key term extraction
  5. Co-occurrence analysis
  6. Topic models (LDA)
  7. Text classification
  8. Part-of-Speech tagging / Named Entity Recognition

5 License

This course was created by Gregor Wiedemann and Andreas Niekler. It was freely released under GPLv3 in September 2017. If you use (parts of) it for your own teaching or analysis, please cite

Wiedemann, Gregor; Niekler, Andreas (2017): Hands-on: A five day text mining course for humanists and social scientists in R. Proceedings of the 1st Workshop on Teaching NLP for Digital Humanities (Teach4DH@GSCL 2017), Berlin.

6 Session info

The tutorials load some of the following packages.

package_list <- c(
  "dplyr",
  "ggplot2",
  "igraph",
  "irr",
  "LDAvis",
  "LiblineaR",
  "Matrix",
  "NLP",
  "openNLP",
  "openNLPdata",
  "pals",
  "png",
  "quanteda",
  "readtext",
  "reshape2",
  "rvest",
  "topicmodels",
  "tsne",
  "webdriver",
  "wordcloud",
  "wordcloud2"
)
lapply(package_list, require, character.only = TRUE)

These are the package versions, we used for this tutorial.

sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] wordcloud2_0.2.1    wordcloud_2.6       RColorBrewer_1.1-2  webdriver_1.0.5    
##  [5] tsne_0.1-3          topicmodels_0.2-11  rvest_0.3.6         xml2_1.3.2         
##  [9] reshape2_1.4.4      readtext_0.76       quanteda_2.1.1      png_0.1-7          
## [13] pals_1.6            openNLPdata_1.5.3-4 openNLP_0.2-7       NLP_0.2-0          
## [17] Matrix_1.2-18       LiblineaR_2.10-8    LDAvis_0.3.2        irr_0.84.1         
## [21] lpSolve_5.6.15      igraph_1.2.5        ggplot2_3.3.2       dplyr_1.0.2        
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.2         maps_3.3.0         jsonlite_1.7.1     RcppParallel_5.0.2
##  [5] showimage_1.0.0    stats4_4.0.2       yaml_2.2.1         slam_0.1-47       
##  [9] pillar_1.4.6       lattice_0.20-41    glue_1.4.2         digest_0.6.25     
## [13] colorspace_1.4-1   htmltools_0.5.0    plyr_1.8.6         tm_0.7-7          
## [17] pkgconfig_2.0.3    purrr_0.3.4        scales_1.1.1       processx_3.4.4    
## [21] tibble_3.0.3       generics_0.0.2     usethis_1.6.3      ellipsis_0.3.1    
## [25] withr_2.2.0        magrittr_1.5       crayon_1.3.4       evaluate_0.14     
## [29] ps_1.3.4           stopwords_2.0      fs_1.5.0           tools_4.0.2       
## [33] data.table_1.13.0  lifecycle_0.2.0    stringr_1.4.0      munsell_0.5.0     
## [37] callr_3.4.4        compiler_4.0.2     rlang_0.4.7        debugme_1.1.0     
## [41] grid_4.0.2         dichromat_2.0-0    htmlwidgets_1.5.1  base64enc_0.1-3   
## [45] rmarkdown_2.3      gtable_0.3.0       curl_4.3           R6_2.4.1          
## [49] knitr_1.29         fastmatch_1.1-0    modeltools_0.2-23  rJava_0.9-13      
## [53] stringi_1.5.3      parallel_4.0.2     Rcpp_1.0.5         vctrs_0.3.4       
## [57] mapproj_1.2.7      tidyselect_1.1.0   xfun_0.17

2020, Andreas Niekler and Gregor Wiedemann. GPLv3. tm4ss.github.io