JSTORr 
Simple exploratory text mining and document clustering of journal articles from JSTOR's Data for Research service.
ObjectiveThe aim of this package is provide some simple functions in R to explore changes in word frequencies over time in a specific journal archive. It is designed to solve the problem of finding patterns and trends in the unstructured text content of a large number of scholarly journals articles from the JSTOR archive.
Currently there are functions to explore changes in:
- a single word (ie. plot the relative frequency of a 1-gram over time)
- two words independantly (ie. plot the relative frequency of two 1-grams over time)
- sets of words (ie. plot the relative frequency of a single group of mulitple 1-grams over time)
- correlations between two words over time (ie. plot the correlation of two 1-grams over time)
- correlations between two sets of words over time (ie. plot the correlation two sets of multiple 1-grams over time)
- all of the above with bigrams (a sequence of two words)
- the most frequent words by n-year ranges of documents (ie. top words in all documents published in 2-5-10 year ranges, whatever you like)
- the top n words correlated a word by n-year ranges of documents (ie. the top 20 words associated with the word 'pirate' in 5 year ranges)
- various methods (k-means, PCA, affinity propagation) to detect clusters in a set of documents containing a word or set of words
- topic models with the lda package for full R solution or the Java-based MALLET program (if installing that is an option, currently implemented here for Windows only)
This package will be useful to researchers who want to explore the history of ideas in an academic field, and investigate changes in word and phrase use over time, and between different journals.
How to installFirst, make sure you've got Hadley Wickham's excellent devtools package installed. If you haven't got it, you can get it with these lines in your R console:
install.packages(pkgs = "devtools", dependencies = TRUE)
Then, use the install_github() function to fetch this package from github:
library(devtools)# download and install the package (do this only once ever per computer)install_github("benmarwick/JSTORr")
Error messages relating to rJava on Windows can probably be fixed by following exactly the instructions here. On OSX, try R CMD javareconf at the command line, then R install.packages("rJava",type='source').
First, go to JSTOR's Data for Research service and make a request for data. The DfR service makes available large numbers of journal articles in a format that is convenient for text mining. When making a request for data to use with this package, youmust chose:
- CSV as the 'output format', not XML, which is the default
- Word Counts and bigrams as the 'Data Type'
Second, once you've downloaded and unzipped the zip file that is the 'full dataset' from DfR then you can start R (it's highly recommended to use RStudio when working with this package, much easier to manage the plot output) and work through the steps in the next section.