In this page I intend to provide useful hints and tips (recipes) for working with text data in WEKA. The information is organized as a list of blog posts and references, plus additional material like code and text collections.
I suggest to read my following posts on text classification with WEKA in the publication order:
- Text Mining in WEKA: Chaining Filters and Classifiers explains how and why you should so when evaluating your text classifiers using cross-fold validation. The explanation is done using the Explorer tools, and it helps as a quick introduction to the process of building a text classifier in WEKA, along to the FilteredClassifier class.
- Text Mining in WEKA Revisited: Selecting Attributes by Chaining Filters describes how to complete the life-cycle of the learning process by adding feature selection to it, by using the MultiFilter class.
- Command Line Functions for Text Mining in WEKA presents how perform previous experiments with the FilteredClassifier and MultiFilter classes but now in the command line interface instead on WEKA's Explorer.
- A Simple Text Classifier in Java with WEKA presents and discuses two little programs as examples of how to integrate WEKA into your Java code for text mining.
- URL Text Classification with WEKA, Part 1: Data Analysis shows an application of text classification to processing URLs text as a complement to URL database-based filtering in Web Filters. This first post just explains how I have built the dataset, while an upcoming post will explain my ongoing experiments.
- Mapping Vocabulary from Train to Test Datasets in WEKA Text Classifiers discusses three ways of mapping the set of terms used in the representations of the training and test sets of a text dataset for enabling learning, namely using batch filters, the FilteredClassifier class and the InputMappedClassifier class.
- Language Identification as Text Classification with WEKA explains how to build an automated language guesser for texts as a complete example of a Text Mining process with WEKA, and in order to demonstrate a more advanced usage of theStringToWordVector class.
- Baseline Sentiment Analysis with WEKA shows how to configure and run an experiment on sentyment analysis and opinion mining using WEKA, and specially the TextDirectoryLoader and the NGramTokenizer classes.
- Comparing baselines of keyword and learning based sentiment analysis provides a basic example of using SentiWordNet for a keyword-based approach to sentiment classification, and compares it with a learning-based approach based in WEKA.
- Sample Code for Text Indexing with WEKA shows how to index a text dataset using your own Java code and the StringToWordVector filter in WEKA.
- Performance Analysis of N-Gram Tokenizer in WEKA, which analyzes the WEKA class NGramTokenizer in terms of performance, as it depends on the complexity of the regular expression used during the tokenization step.
- Do you want me to deal with some specific topic? Just let me know.
I have some other posts on WEKA, like the following ones:
All my posts related to WEKA can be found using the label WEKA.
Interesting references for working with WEKA include:
- Use WEKA in your Java code provides an excelent introduction to how to use the classes Instances, Filter, Classifier, Clusteres, Evaluation and AttributeSelection, in your own code.
- WEKA programmatic use describes the learning process life-cycle and, more importantly, it explains how to deal with attributes in your Java code.
- Text Categorization with WEKA deals with transforming a directory structure of classes (directories) and documents (inside those directories) into ARFF format for further processing. The code is available at ARFF files from Text Collections.
For testing your classifiers and integrating WEKA in your own code, I provide the following stuff:
You will find most of this stuff at my tmweka Github repository.
|