【独家发布】A Simple Text Classifier in Java with WEKA

1321

收藏 2014-12-10

A Simple Text Classifier in Java with WEKA

In previous posts [1, 2, 3], I have shown how to make use of the WEKA classesFilteredClassifier and MultiFilter in order to properly build and evaluate a text classifier using WEKA. For this purpose, I have made use of the Explorer GUI provided by WEKA, and its command-line interface.
In my opinion, it is a good idea to get familiar with both the Explorer and the command-line interface if you want to get a feeling of the amazing power of this data mining library. However, where you can take full advantage its power is in your own Java programs. Now it is time to deal with it.
Following Salton, and Belkin and Croft, the process of text classification involves two main steps:

Representing your text database in order to enable learning, and to train a classifier on it.
Using the classifier to predict text labels of new, unseen documents.

The first step is a batch process, in the sense that you can do it periodically (as long as your labelled data set gets improved with time -- bigger sizes, new labels or categories, corrected predictions via user feedback). The second step is actually the moment in which you get advantage of the knowledge distilled by the learning process, and it is online in the sense that it is don by demand (when new documents arrive). This distinction is conceptual, I mean that modern text classifiers retrain on the added documents as soon as they get them, in order to keep or improve accuracy with time.
In consequence, what we need to demonstrate the text classification process is two programs: one to learn from the text dataset, and another to use the learnt model toclassify new documents. Let us start showing a very simple text learner in Java, using WEKA. The class is named MyFilteredLearner.java, and its main() method demonstrates its usage, which involves:

Loading the text dataset.
Evaluating the classifier.
Training the classifier.
Storing the classifier.

The most interesting parts of the process are:

We read the dataset by simply using the method getData() of anArffReader object that wraps a BufferedReader.
We programmatically create the classifier by combining aStringToWordVector filter (in order to represent the texts as feature vectors) and a NaiveBayes classifier (for learning), using the FilteredClassifierclass discussed in previous posts.

The process of creating the classifier is demonstrated in the next code snippet:

trainData.setClassIndex(0);
filter = new StringToWordVector();
filter.setAttributeIndices("last");
classifier = new FilteredClassifier();
classifier.setFilter(filter);
classifier.setClassifier(new NaiveBayes());

So we set the class of the dataset as being the first attribute, then we create the filter and set the attribute to be transformed from text into a feature vector (the last one), and then we create the FilteredClassifier object and add the previous filter and a newNaiveBayes classifier to it. Given the attributes above, the dataset has to have the class as the first attribute, and the text as the second (and last) one, like in my typical example of the SMS spam subset example (smsspam.small.arff).
You can execute this class with the following commands to get the following output:

$>javac MyFilteredLearner.java
$>java MyFilteredLearner smsspam.small.arff myClassifier.dat
===== Loaded dataset: smsspam.small.arff =====

Correctly Classified Instances 187 93.5 %
Incorrectly Classified Instances 13 6.5 %
Kappa statistic 0.7277
Mean absolute error 0.0721
Root mean squared error 0.2568
Relative absolute error 25.8792 %
Root relative squared error 69.1763 %
Coverage of cases (0.95 level) 94 %
Mean rel. region size (0.95 level) 51.75 %
Total Number of Instances 200

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0,636 0,006 0,955 0,636 0,764 0,748 0,943 0,858 spam
0,994 0,364 0,933 0,994 0,962 0,748 0,943 0,986 ham
Weighted Avg. 0,935 0,305 0,936 0,935 0,930 0,748 0,943 0,965
===== Evaluating on filtered (training) dataset done =====
===== Training on filtered (training) dataset done =====
===== Saved model: myClassifier.dat =====

The evaluation has been performed with default values except for the number of folds, that has been set to 4 as shown in the next code snippet:

Evaluation eval = new Evaluation(trainData);
eval.crossValidateModel(classifier, trainData, 4, new Random(1));
System.out.println(eval.toSummaryString());

For the case you don want to evaluate the classifier on the training data, you can omit the call to the evaluate() method.
Now let us deal with the classification program, which is far more complex but only for the process of creating an instance. The class is named MyFilteredClassifier.java, and its main() method demonstrates its usage, which involves:

Reading the text to be classified from a file.
Reading the model or classifier from a file.
Creating the instance.
Classifying it.

Creating the instance is performed in the makeInstance() method, and its code is the following one:

// Create the attributes, class and text
FastVector fvNominalVal = new FastVector(2);
fvNominalVal.addElement("spam");
fvNominalVal.addElement("ham");
Attribute attribute1 = new Attribute("class", fvNominalVal);
Attribute attribute2 = new Attribute("text",(FastVector) null);
// Create list of instances with one element
FastVector fvWekaAttributes = new FastVector(2);
fvWekaAttributes.addElement(attribute1);
fvWekaAttributes.addElement(attribute2);
instances = new Instances("Test relation", fvWekaAttributes, 1);
// Set class index
instances.setClassIndex(0);
// Create and add the instance
DenseInstance instance = new DenseInstance(2);
instance.setValue(attribute2, text);
// instance.setValue((Attribute)fvWekaAttributes.elementAt(1), text);
instances.add(instance);

The classifier learnt with MyFilteredLearner.java expects that an instance has two attributes: the first one is the class, it is a nominal one with values "spam" or "ham"; the second one is a String, which is the text to be classified. Instead of creating one instance, we create a whole new dataset which first instance is the one that we want to classify. This is required in order to let the classifier know the schema of the dataset, which is stored in the Instances object (and not in each instance).
So first we create the attributes by using the FastVector class provided by WEKA. The case of the nominal attribute ("class") is relatively simple, but the case of the String one is a bit more complex because it requires the second argument of the constructor to benull, but casted to FastVector. Then we create an Instances object by using aFastVector to store the two previous attributes, and set the class index to 0 (which means that the first attribute will be the class). As a note, the FastVector class is deprecated in the WEKA development version.
The latest step is to create an actual instance. I am using the WEKA development version in this code (as of the date of this post), so we have to use a DenseInstance object. However, if you make use of the stable version, then you can use Instance (link to the stable version doc), and must change this code to:

Instance instance = new Instance(2);

As a note, I have commented in the code a different way of setting the value of the second attribute. I must note that we do not set the value of the first attribute, as it is unknown.
The rest of the methods are (more or less) straightforward if you follow the documentation (weka - Programmatic Use, and weka - Use WEKA in your Java code). You get the class prediction on your text with the following lines:

double pred = classifier.classifyInstance(instances.instance(0));
System.out.println("Class predicted: " + instances.classAttribute().value((int) pred));

And if you feed this classifier with a file (smstest.txt) that stores the text "this is spam or not, who knows?", and the model learnt with MyFilteredLearner.java(that is stored in myClassifier.dat), then you get the following result:

$>javac MyFilteredClassifier.java
$>java MyFilteredClassifier smstest.txt myClassifier.dat
===== Loaded text data: smstest.txt =====
this is spam or not, who knows?
===== Loaded model: myClassifier.dat =====
===== Instance created with reference dataset =====
@relation 'Test relation'

@attribute class {spam,ham}
@attribute text string

@data
?,' this is spam or not, who knows?'
===== Classified instance =====
Class predicted: ham

It is interesting to see that the class assigned to the instance before classifying it is "?", which means undefined or unknown.