【独家发布】Text Mining in WEKA: Chaining Filters and Classifiers

1194

收藏 2014-12-10

Text Mining in WEKA: Chaining Filters and Classifiers

One of the most interesting features of WEKA is its flexibility for text classification. Over the years, I have had the chance to make a lot of experiments on text collections with WEKA, most of them in supervised tasks that are commonly mentioned as Text Categorization, that is, classifying text segments (documents, paragraphs, collocations) into a set of predefined classes. Examples of Text Categorization tasks include assigning topics labels to news items, classifying email messages into folders, or, more close to my research, classifying messages as spam or not (Bayesian spam filters) and web pages as inappropriate or not (e.g. pornographic content vs. educational resources).
WEKA support for Text Categorization is impressive. A prominent feature is that this package supports breaking text utterances into indexing terms (word stems, collocations) and assigning them a weight in term vectors, a required step in nearly every text classification task. This tokenization and indexing process is achieved by using a super-flexible filter named StringToWordVector. Lets me show an example of how it works.
I will start with a simple text collection, which is an small sample of the publicly availableSMS Spam Collection. Some colleagues and me built this collection for experimenting with Bayesian SMS spam filters, and it contains 4,827 legitimate messages and 747 mobile spam messages, for a total of 5,574 short messages collected from several sources. I will make use of an small subset in order to better show my points in this post. The subset is made with the first 200 messages, and it is the following one right formatted in the suitable WEKA ARFF format:

@relation sms_test
@attribute spamclass {spam,ham}
@attribute text String
@data
ham,'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'
ham,'Ok lar... Joking wif u oni...'
spam,'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C\'s apply 08452810075over18\'s'
ham,'U dun say so early hor... U c already then say...'
ham,'Nah I don\'t think he goes to usf, he lives around here though'
spam,'FreeMsg Hey there darling it\'s been 3 week\'s now and no word back! I\'d like some fun you up for it still? Tb ok! XxX std chgs to send, ￡1.50 to rcv'
...
ham,'Hi its Kate how is your evening? I hope i can see you tomorrow for a bit but i have to bloody babyjontet! Txt back if u can. :) xxx'

In the first 200 messages of the collection, 33 of them are spam and 167 are legitimate ("ham"). This collection can be loaded in the WEKA Explorer, showing something similar to the following window:

The point is that messages are featured as string attributes, so you have to break them into words in order to allow learning algorithms to induce classifiers with rules like:

if ("urgent" in message) then class(message) == spam

Here is where the StringToWordVector filter comes to help. You can just select it by clicking the "Choose" button in the "Filter" area, and browsing the folders to "weka > filters > unsupervised > attribute" one. Once selected, you should be able to see something like this:

If you click on the name of the filter, you will get a lot of options, which I leave for another post. For the my goals in this one, you can just apply this filter with the default options to get an indexed collection of 200 messages and 1,382 indexing tokens (plus the class attribute), shown in the next picture:

If you want to see colors showing the distribution of attributes (tokens) according to the class, you can just select the "class" attribute as the class for the collection in the bottom-left area of the WEKA Explorer. So, you can see that the attribute "Available" occurs just in one message, which happens to be a legitimate (ham) one:

Now, we can make our experiments in the Classify tab. We can just select cross-validation using 3 folds (1), point to the appropriate attribute to be used as a class (which is the "spamclass" one) (2), and select a rule learner like PART in the classifier area (3). You can find that classifier at the "weka > classifiers > rules" folder when clicking on the "Choose" button at the "Classifier" area. This setup is shown in the next figure:

The selected evaluation method, cross-validation, instructs WEKA to divide the training collection into 3 sub-collections (folds), and perform three experiments. Each experiment is done by using two of the folds for training, and the remaining one for testing the learnt classifier. The sub-collections are sampled randomly, the way that each instance belong only to one of them, and the class distribution (50% in our example) is kept inside each fold.
So, if we click on the "Start" button, we will get the output of our experiment, featuring the classifier learnt over the full collection, and the values for the typical accuracy metrics averaged over the three experiments, along with the confusion matrix. The classifier learnt over the full collection is the following one:

PART decision list
------------------
or <= 0 AND
to <= 0 AND
2 <= 0: ham (119.0/3.0)
￡1000 <= 0 AND
FREE <= 0 AND
call <= 0 AND
Reply <= 0 AND
i <= 0 AND
all <= 0 AND
final <= 0 AND
50 <= 0 AND
mobile <= 0 AND
ur <= 0 AND
text <= 0: ham (26.0/2.0)
i <= 0 AND
all <= 0: spam (30.0/3.0)
: ham (25.0/1.0)
Number of Rules : 4

This notation can be read as:

if (("or" not in message) and ("to" not in message) and ("2" not inmessage)) then class(message) == ham
...
otherwise class(message) == ham

And the confusion matrix is the next one:

=== Confusion Matrix ===
a b <-- classified as
17 16 | a = spam
12 155 | b = ham

Which means that the PART learner is able to get 17+155 correct classifications, and it makes 12+16 mistakes. It leads to an accuracy of 86%.

But we have done it wrong!

Do you remember the "Available" token, which occurs only on one of the messages? In which fold is it? When it is on a training fold, we are using it for training (making the learner trying to generalize from a token that does not occur in the test collection). And when it is on the test collection, the learner should not even know about it! Moreover, what happens with attributes that are highly predictive for the full collection (according to their statistics when computing e.g. the Information Gain metric)? They may have worse (or better) statistics when a subset of their occurrences is not seen, as they can be on the test collection!
The right way to perform a correct text classification experiment with cross validation in WEKA is feeding the indexing process into the classifier itself, that is, chaining the indexing filter (StringToWordVector) and the learner, the way that we index and train for every sub-set in the cross-validation run. Thus, you have to use the FilteredClassifier class provided by WEKA.
In fact, this is not that difficult. Let us go back to the original test collection, which features two attributes: the message (as a string) and the class. Then you can go to the Classify tab, and choose the FilteredClassifier learner, which is available at the "weka > classifiers > meta", and shown in the next picture:

Then you must choose the filter and the classifier you are going to apply to the collection, by clicking on the classifier name at the "Classifier" area. I choose StringToWordFilter and PART with their default options:

If we now run our experiment with 3-fold cross-validation and the filtered classifier we have just configured, we get different results:
=== Confusion Matrix ===
a b <-- classified as
13 20 | a = spam
7 160 | b = ham
For an accuracy of 86.5%, a bit better than the one obtained with the wrong setup. However, we catch 4 less spam messages, and the True Positive ratio goes down from 0.515 to 0.394. This setup is more realistic and it better mimics what will happen in the real world, in which we will find highly relevant but unseen events, and our statistics may change dramatically over time.
So now we can run our experiment safely, as no unseen events will be used in the classification. Moreover, if we apply any kind of Information Theory based filter like e.g. ranking the attributes according to their Information Gain value, the statistics will be correct, as they will be based on the training set for each cross-validation run.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

扫码加我 拉你入群

分享

扫码加好友，拉您进群

扫码加我拉你入群