Text Mining in WEKA Cookbook

农村固定观察点

2367

收藏 2014-12-10

Text Mining in WEKA Cookbook

In this page I intend to provide useful hints and tips (recipes) for working with text data in WEKA. The information is organized as a list of blog posts and references, plus additional material like code and text collections.
I suggest to read my following posts on text classification with WEKA in the publication order:

Text Mining in WEKA: Chaining Filters and Classifiers explains how and why you should so when evaluating your text classifiers using cross-fold validation. The explanation is done using the Explorer tools, and it helps as a quick introduction to the process of building a text classifier in WEKA, along to the FilteredClassifier class.
Text Mining in WEKA Revisited: Selecting Attributes by Chaining Filters describes how to complete the life-cycle of the learning process by adding feature selection to it, by using the MultiFilter class.
Command Line Functions for Text Mining in WEKA presents how perform previous experiments with the FilteredClassifier and MultiFilter classes but now in the command line interface instead on WEKA's Explorer.
A Simple Text Classifier in Java with WEKA presents and discuses two little programs as examples of how to integrate WEKA into your Java code for text mining.
URL Text Classification with WEKA, Part 1: Data Analysis shows an application of text classification to processing URLs text as a complement to URL database-based filtering in Web Filters. This first post just explains how I have built the dataset, while an upcoming post will explain my ongoing experiments.
Mapping Vocabulary from Train to Test Datasets in WEKA Text Classifiers discusses three ways of mapping the set of terms used in the representations of the training and test sets of a text dataset for enabling learning, namely using batch filters, the FilteredClassifier class and the InputMappedClassifier class.
Language Identification as Text Classification with WEKA explains how to build an automated language guesser for texts as a complete example of a Text Mining process with WEKA, and in order to demonstrate a more advanced usage of theStringToWordVector class.
Baseline Sentiment Analysis with WEKA shows how to configure and run an experiment on sentyment analysis and opinion mining using WEKA, and specially the TextDirectoryLoader and the NGramTokenizer classes.
Comparing baselines of keyword and learning based sentiment analysis provides a basic example of using SentiWordNet for a keyword-based approach to sentiment classification, and compares it with a learning-based approach based in WEKA.
Sample Code for Text Indexing with WEKA shows how to index a text dataset using your own Java code and the StringToWordVector filter in WEKA.
Performance Analysis of N-Gram Tokenizer in WEKA, which analyzes the WEKA class NGramTokenizer in terms of performance, as it depends on the complexity of the regular expression used during the tokenization step.
Do you want me to deal with some specific topic? Just let me know.

I have some other posts on WEKA, like the following ones:

A note on WEKA limitations and big data, which discusses how to go big data when staring with WEKA.
Class imbalanced distribution and WEKA cost sensitive learning, which explains how to deal with the class imbalance problem by using cost-sensitive clasifiers in WEKA.

All my posts related to WEKA can be found using the label WEKA.
Interesting references for working with WEKA include:

Use WEKA in your Java code provides an excelent introduction to how to use the classes Instances, Filter, Classifier, Clusteres, Evaluation and AttributeSelection, in your own code.
WEKA programmatic use describes the learning process life-cycle and, more importantly, it explains how to deal with attributes in your Java code.
Text Categorization with WEKA deals with transforming a directory structure of classes (directories) and documents (inside those directories) into ARFF format for further processing. The code is available at ARFF files from Text Collections.

For testing your classifiers and integrating WEKA in your own code, I provide the following stuff:

The full spam SMS text collection.
An ARFF-formatted mini spam SMS collection used in most of my posts.
The test example used in my post about integrating WEKA in your Java code for text mining.
The programs MyFilteredLearner.java and MyFilteredClassifier.java, which demonstrate how to integrate WEKA in your Java code for text mining.
The train/test split from the previous mini spam SMS collection provided for the post in which I explain how to map the vocabular from the training to the test collection in WEKA. It features a training subset and a test subset.
The URL list dataset for my posts on analyzing URL content for Web Filtering (part 1, part 2). This dataset is over 40Mb. This list is based on an SquidBlackList.org list licensed under Commons Attribution 3.0 Unported License: Blacklists(Squidblacklist.org) / CC BY 3.0.
The training and test datasets for the Language Identification demonstration, along with the sample scripts (*.sh) and the program LanguageIdentifier.java that makes use of the learned model.
The SFU Review Corpus formatted for WEKA experiments (English, Spanish) as used in the Baseline Sentiment Analysis with WEKA post.
The SentimentClassifier.java program for learning-based sentiment classification with WEKA.
The SentiWordNetDemo.java program for keyword-based sentiment classification with SentiWordNet (see Comparing baselines of keyword and learning based sentiment analysis).
The IndexTest.java class, which demonstrates the process of indexing a text collection in Java.

You will find most of this stuff at my tmweka Github repository.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

扫码加我 拉你入群

分享

扫码加好友，拉您进群

扫码加我拉你入群