

In the next picture, you can see the AttributeSelection configuration window with the evaluator set up to Information Gain, and the search set up as Ranker, with the default options.

The Ranker evaluator has two main properties:
In consequence, if we want to keep those attributes scoring over 0, we have just to write that number in the threshold area of the window we get when we click on the Ranker at the previous window:

By clicking OK on all the previous windows, we get a configuration of the AttributeSelection filter which involves keeping those attributes with Information Gain score over 0. If we apply that filter to our current collection, we get the following result:

As you can see, we get a ranked list of 82 attributes (plus the class one), in which the top scoring attribute is the token "to". This attribute occurs in 69 messages (value 1), but many of them are spam ones, so it is quite predictive for this particular class. We can see as well that we only keep a 5.93% of the original attributes (82 over 1382).
Now we can go to the "Classify" tab and select the rule learner PART ("weka > classifiers > rules > PART") to be evaluated on the training collection itself ("Test options" area, "Use training set option"), getting the next result:

We get an accuracy of 95.5%, much better than before. Of course, these results cannot be compared because this quick experiment is a test on the training collection, not done with 3-fold CV and the FilteredClassifier. But if we want to run a CV experiment, how to do it as we have 2 filters instead of one, in our set up?
What we need now is to start with the original text collection in ARFF format (no STWV yet), and to use the MultiFilter that WEKA provides for these situations. We start then with the original collection, and go to the "Classify" tab. If we try to choose any classic learner (J48 for the C4.5 decision tree learner, SMO for Support Vector Machines, etc.), it will be impossible because we have just one attribute (the text of the SMS messages) along with the class, but we can use the weka.classifiers.meta.FilteredClassifier. After selecting it, we will see something similar to the next picture:

If we click on the name of the classifier at the "Classifier" area and we select weka.classifiers.rules.PART as the classifier (with default options), we get the next set up in the FilteredClassifier editor window:

Then we can choose the weka.filters.MultiFilter in the filter area, which starts with a dummy AllFilter. Time to set up our filter combining STWV and AttributeSelection. We click on the filter name area and we get a new filter edition window with an area to define the filters to be applied. If we click on it, we get a new window that allows to add, configure and delete filters. The selected filters will be applied in the order we add them, so we start deleting the AllFilter and adding a STWV filter with the default options, getting something similar to the next picture:

Filters are added by clicking on the "Choose" button to select them, and clicking on the "Add" button to add them to the list. We can now add the AttributeSelection filter with the Information Gain evaluator and the Ranker with threshold 0 search, by editing the filter when clicking on the "Edit" button with the AttributeSelection filter selected in the list. If you manually re-dimension the window, you can see a set up similar to this one:

The set up is nearly finished. We close this window by clicking on the "X" button, and click on the "OK" button at the MultiFilter and FilteredClassifier windows. In the "Classify" tab at the explorer, we select "Cross-validation" in the "Test options" area, entering 3 as the number of folds, and we select the class attribute as "spamclass". Having done this, we can just click on the "Start" button to get the next result:

So we get an accuracy of 83.5%, which is worse than the one we got without using feature selection (which was 86.5%). Oh oh, all this clever (?) set up to get a drop of 3 points in accuracy! :-(
But what happens if, instead of using a relatively weak learner on text problems like PART, we turn to Support Vector Machines? WEKA includes the weka.classifiers.functions.SMOclassifier, which implements John Platt's sequential minimal optimization algorithm for training a support vector classifier. If we choose this classifier with default options, we get a quite different results:
So we get an improvement of accuracy on a more accurate learner, what is nice. However, the difference is just 0.5% (1 message in our 200 instances collection), so it is moderate. Moreover, we get one more false positive, what is bad for this particular problem. In spam filtering, it is much worse to make a false positive (sending a legitimate message to the spam folder) than the opposite, because the user has the risk to miss an important message.
扫码加好友,拉您进群



收藏
