全部版块 我的主页
论坛 计量经济学与统计论坛 五区 计量经济学与统计软件 winbugs及其他软件专版
844 0
2017-05-22
JSTORr

Simple exploratory text mining and document clustering of journal articles from JSTOR's Data for Research service.

Objective

The aim of this package is provide some simple functions in R to explore changes in word frequencies over time in a specific journal archive. It is designed to solve the problem of finding patterns and trends in the unstructured text content of a large number of scholarly journals articles from the JSTOR archive.

Currently there are functions to explore changes in:

  • a single word (ie. plot the relative frequency of a 1-gram over time)
  • two words independantly (ie. plot the relative frequency of two 1-grams over time)
  • sets of words (ie. plot the relative frequency of a single group of mulitple 1-grams over time)
  • correlations between two words over time (ie. plot the correlation of two 1-grams over time)
  • correlations between two sets of words over time (ie. plot the correlation two sets of multiple 1-grams over time)
  • all of the above with bigrams (a sequence of two words)
  • the most frequent words by n-year ranges of documents (ie. top words in all documents published in 2-5-10 year ranges, whatever you like)
  • the top n words correlated a word by n-year ranges of documents (ie. the top 20 words associated with the word 'pirate' in 5 year ranges)
  • various methods (k-means, PCA, affinity propagation) to detect clusters in a set of documents containing a word or set of words
  • topic models with the lda package for full R solution or the Java-based MALLET program (if installing that is an option, currently implemented here for Windows only)

This package will be useful to researchers who want to explore the history of ideas in an academic field, and investigate changes in word and phrase use over time, and between different journals.

How to install

First, make sure you've got Hadley Wickham's excellent devtools package installed. If you haven't got it, you can get it with these lines in your R console:

install.packages(pkgs = "devtools", dependencies = TRUE)

Then, use the install_github() function to fetch this package from github:

library(devtools)# download and install the package (do this only once ever per computer)install_github("benmarwick/JSTORr")

Error messages relating to rJava on Windows can probably be fixed by following exactly the instructions here. On OSX, try R CMD javareconf at the command line, then R install.packages("rJava",type='source').


First, go to JSTOR's Data for Research service and make a request for data. The DfR service makes available large numbers of journal articles in a format that is convenient for text mining. When making a request for data to use with this package, youmust chose:

  • CSV as the 'output format', not XML, which is the default
  • Word Counts and bigrams as the 'Data Type'

Second, once you've downloaded and unzipped the zip file that is the 'full dataset' from DfR then you can start R (it's highly recommended to use RStudio when working with this package, much easier to manage the plot output) and work through the steps in the next section.

本帖隐藏的内容

JSTORr-master.zip
大小:(5.02 MB)

 马上下载



二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

相关推荐
栏目导航
热门文章
推荐文章

说点什么

分享

扫码加好友,拉您进群
各岗位、行业、专业交流群