This webpage provides the data used and produced in research for the paper, "A Macroscope of English Print Culture, 1530-1700, Applied to the Coevolution of Ideas on Religion, Science, and Institutions." (access here). The Structural Topic Model estimation results reported in this paper were implemented using R's stm package. The first table below contains two R data files generated in the two key stages of data processing and analysis. The second table contains a subset of the most pertinent objects that are extractable directly from the two R data files, but are here presented in more universal data formats.
To replicate any commands listed in the second table, you may have to install and load the stm package using the following code.install.packages('stm')
library(stm)
Please note that due to package updates, some results may not be exactly identical to the originals. At the time of this site's creation, the 1.3.6 version of stm was used, wheareas the original .rda and .rds files were created with the 1.1.3 version.
Key conceptual elements of a topic model
Topic: probability distribution over vocabulary.
Document: probability distribution over topics.
R Data Files | Basic Documentation |
---|---|
Processed Data: TCPProcessedData.rda | Output of the prepDocuments method. Includes document-level vocabulary distributions and document metadata. Contains the following R objects (see the official stm documentation for details): vocab - a character vector containing the words in alphabetical order that were used in thet statistical analysis meta - a table containing metadata for every document (i.e. author, year, etc) as well as document texts in a form prior to R processing but after the pre-processing described in the paper. This means that these are not the original documents. |
The STM Model: TCPSTMestimates.rds | Output of the stm method. Includes estimated parameters for analysis and topic correlation data. Contains the following R objects (see the official stm documentation for details): beta - matrix showing for every word and topic the log probability that a word belongs to a topic vocab - same as in the .rda file theta - see "Topic-Proportion Matrix" below version - the version number of the package with which the model was estimated |
NOTE: No use of R is required to use the files in the "Extracted Data + Description" column. The "R Code" column is simply provided to show how experienced users can load these files directly into R.
Extracted Data + Description | R Code |
---|---|
Bag of Words: corpusVocab.csv CSV file containing every word in the corpus not eliminated by the textProcessor method. Includes a corpus-level frequency count as well. |
How to obtain from the .rda object: First load the RDA object (by default its variable name will be "out") load("/insertYourPathHere/TCPProcessedData.rda") |
The Topics: topics.txt Words that are most used by each topic, according to four different criteria. |
How to obtain from the .rds object: First load the RDS object with a variable name of your choosing stmModel <- readRDS("/insertYourPathHere/TCPSTMestimates.rds") |
Topic-Proportion Matrix: mtheta.csv Matrix of estimated topic prevalences across documents. Rows contain documents and columns contain topics. |
How to obtain from the .rds object: First load the RDS object (if you haven't already) stmModel <- readRDS("/insertYourPathHere/TCPSTMestimates.rds") |
Topic-Correlation Matrix: topicCorr.csv Matrix of estimated document-level topic correlations. |
How to obtain from the .rds object: First load the RDS object (if you haven't already) stmModel <- readRDS("/insertYourPathHere/TCPSTMestimates.rds") |