Data Appendix for ERS

A machine-learning history of English caselaw and legal ideas prior to the Industrial Revolution

References:
Part 1: Grajzl P, Murrell P (2021). A machine-learning history of English caselaw and legal ideas prior to the Industrial Revolution I: generating and interpreting the estimates. Journal of Institutional Economics 17, 1–19. https://doi.org/10.1017/S1744137420000326
Part 2: Grajzl P, Murrell P (2021). A machine-learning history of English caselaw and legal ideas prior to the Industrial Revolution II: applications. Journal of Institutional Economics 17, 201–216. https://doi.org/10.1017/S1744137420000363

Overview

This webpage provides the data used and produced in research for the two-part paper, "A machine-learning history of English caselaw and legal ideas prior to the Industrial Revolution I" (download for Part 1 and Part 2 here). The Structural Topic Model estimation results reported in this paper were implemented using R's stm package. The first table below contains two R data files generated in the two key stages of data processing and analysis. The second table contains a subset of the most pertinent objects that are extractable directly from the two R data files, but are here presented in more universal data formats.

To replicate any commands listed in the second table, you may have to install and load the stm package using the following code.

install.packages('stm')
library(stm)

Please note that due to package updates, some results may not be exactly identical to the originals. At the time of this site's creation, the 1.3.6 version of stm was used, wheareas the original .rda and .rds files were created with the 1.3.3 version.

Key conceptual elements of a topic model
Topic: probability distribution over vocabulary.
Document: probability distribution over topics.

R Data Files	Basic Documentation
Processed Data: ERSProcessedData.rda	Output of the prepDocuments method. Includes document-level vocabulary distributions and document metadata. Contains the following R objects (see the official stm documentation for details): vocab - a character vector containing the words in alphabetical order that were used in thet statistical analysis meta - a table containing metadata for every document (i.e. author, year, etc) as well as document texts in a form prior to R processing but after the pre-processing described in the paper. This means that these are not the original documents.
The STM Model: ERSSTMestimates.rds	Output of the stm method. Includes estimated parameters for analysis and topic correlation data. Contains the following R objects (see the official stm documentation for details): beta - matrix showing for every word and topic the log probability that a word belongs to a topic vocab - same as in the .rda file theta - see "Topic-Proportion Matrix" below version - the version number of the package with which the model was estimated

R Data Files

Basic Documentation

Processed Data: ERSProcessedData.rda

Output of the prepDocuments method. Includes document-level vocabulary distributions and document metadata.

Contains the following R objects (see the official stm documentation for details):
vocab - a character vector containing the words in alphabetical order that were used in thet statistical analysis
meta - a table containing metadata for every document (i.e. author, year, etc) as well as document texts in a form prior to R processing but after the pre-processing described in the paper. This means that these are not the original documents.

The STM Model: ERSSTMestimates.rds

Output of the stm method. Includes estimated parameters for analysis and topic correlation data.

Contains the following R objects (see the official stm documentation for details):
beta - matrix showing for every word and topic the log probability that a word belongs to a topic
vocab - same as in the .rda file
theta - see "Topic-Proportion Matrix" below
version - the version number of the package with which the model was estimated

NOTE: No use of R is required to use the files in the "Extracted Data + Description" column. The "R Code" column is simply provided to show how experienced users can load these files directly into R.

Extracted Data + Description	R Code
Bag of Words: corpusVocab.csv CSV file containing every word in the corpus not eliminated by the textProcessor method. Includes a corpus-level frequency count as well.	How to obtain from the .rda object: First load the RDA object (by default its variable name will be "out") `load("/insertYourPathHere/ERSProcessedData.rda") bag_of_words <- sort(c(out$vocab, out$words.removed)) corpusVocab <- data.frame(bag_of_words, out$wordcounts) write.csv(corpusVocab, file="corpusVocab.csv")`
The Topics: topics.txt Words that are most used by each topic, according to four different criteria.	How to obtain from the .rds object: First load the RDS object with a variable name of your choosing `stmModel <- readRDS("/insertYourPathHere/ERSSTMestimates.rds") sink("topics.txt") # To run the following command you will need to have installed and loaded the 'stm' package in R labelTopics(stmModel, n = 30, frexweight = 0.25) sink()`
Topic-Proportion Matrix: mtheta.csv Matrix of estimated topic prevalences across documents. Rows contain documents and columns contain topics.	How to obtain from the .rds object: First load the RDS object (if you haven't already) `stmModel <- readRDS("/insertYourPathHere/ERSSTMestimates.rds") mtheta <- stmModel$theta write.csv(mtheta, file="mtheta.csv")`
Topic-Correlation Matrix: topicCorr.csv Matrix of estimated document-level topic correlations.	How to obtain from the .rds object: First load the RDS object (if you haven't already) `stmModel <- readRDS("/insertYourPathHere/ERSSTMestimates.rds") # To run the following command you will need to have installed and loaded the 'stm' package in R topCorr <- topicCorr(stmModel) write.csv(topCorr$cor, file = "topicCorr.csv")`

Extracted Data + Description

R Code

Bag of Words: corpusVocab.csv

CSV file containing every word in the corpus not eliminated by the textProcessor method. Includes a corpus-level frequency count as well.

How to obtain from the .rda object:

First load the RDA object (by default its variable name will be "out")

load("/insertYourPathHere/ERSProcessedData.rda")
bag_of_words <- sort(c(out$vocab, out$words.removed)) 
 corpusVocab <- data.frame(bag_of_words, out$wordcounts) 
 write.csv(corpusVocab, file="corpusVocab.csv")

The Topics: topics.txt

Words that are most used by each topic, according to four different criteria.

How to obtain from the .rds object:

First load the RDS object with a variable name of your choosing

stmModel <- readRDS("/insertYourPathHere/ERSSTMestimates.rds")
sink("topics.txt")
# To run the following command you will need to have installed and loaded the 'stm' package in R
labelTopics(stmModel, n = 30, frexweight = 0.25)
sink()

Topic-Proportion Matrix: mtheta.csv

Matrix of estimated topic prevalences across documents. Rows contain documents and columns contain topics.

How to obtain from the .rds object:

First load the RDS object (if you haven't already)

stmModel <- readRDS("/insertYourPathHere/ERSSTMestimates.rds")
mtheta <- stmModel$theta
write.csv(mtheta, file="mtheta.csv")

Topic-Correlation Matrix: topicCorr.csv

Matrix of estimated document-level topic correlations.

How to obtain from the .rds object:

First load the RDS object (if you haven't already)

stmModel <- readRDS("/insertYourPathHere/ERSSTMestimates.rds")
 # To run the following command you will need to have installed and loaded the 'stm' package in R
topCorr <- topicCorr(stmModel)
write.csv(topCorr$cor, file = "topicCorr.csv")