Baseline Sentiment Analysis with WEKA

919

收藏 2017-03-26

Sentiment Analysis (and/or Opinion Mining) is one of the hottest topics in Natural Language Processingnowadays. The task, defined in a simplistic way, consists of determining the polarity of a text utterance according to the opinion or sentiment of the speaker or writer, as positive or negative. This task has multiple applications, including e.g. Customer Relationship Management or predicting political elections.
While initial results dating back to early 2000 seem very promising, it is not such a simple task. We face from the informal Twitter language to the fact that opinions can be faceted (for instance, I may like the software but not the hardware of a device), or opinion spam and fake reviews, along with traditional and complex problems in Natural Language Processing as irony, sarcasm or negation. For a good overview of the task, please check the survey paper on opinion mining and sentiment analysis by Bo Pang and Lillian Lee. A more practical overview is the Sentiment Tutorial with LingPîpe by Alias-i.
In general, there are two main approaches to this task:

Counting and/or weighting sentiment-related words that have been evaluated and tagged by experts, conforming a lexical collection like SentiWordNet.
Learning a text classifier on a previously labelled text collection, like e.g. the SFU Review Corpus.

The SentiWordNet home page offers a simple Java program that follows the first approach. I will follow the second one in order to show how to use an essential WEKA text mining class (weka.core.converters.TextDirectoryLoader), and to provide another example of theweka.filters.unsupervised.attribute.StringToWordVector class.
I will follow the process outlined in the previous post about Language Identification using WEKA.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

全部回复

ReneeBK

2017-3-26 12:52:26

Data Collection and Preprocessing

For this demonstration, I will make use of a relatively small but interesting dataset named the SFU Review Corpus. This corpus consists of 400 reviews in English extracted from the Epinions website in 2004 divided in 25 positive and 25 negative reviews for each of 8 product categories (Books, Cars, Computers, etc.). It also contains 400 reviews in Spanish extracted from Ciao.es divided in the same categories (except for the Cookware category in English, which --more or less-- maps to Lavadoras --Washing Machines-- in Spanish).

The original format of the collections is one directory per category of products, including 25 positive reviews including the word "yes" in the file name and 25 negative reviews including the word "no" in the file name. Unfortunately, this format does not allow to work directly with it in WEKA, but a couple of handy scripts transform it into a new format: two directories, one including the positive reviews (directory yes), and the other one including the negative reviews (directory no). I have kept the category in the name of the files (with patterns like bookyes1.txt) in order to allow others making a more detailed analysis per category.

Comparing the structure of the original and the new format of the text collections:

In order to construct an ARFF file from this structure, we can use theweka.core.converters.TextDirectoryLoader class, which is an evolution of a previously existing helper class named TextDirectoryToArff.java and available at WEKA Documentation at wikispaces. Using this class is as simple as issuing the next command:

$> java weka.core.converters.TextDirectoryLoader -dir SFU_Review_Corpus_WEKA > SFU_Review_Corpus.arff

You have to call this command at the parent directory of SFU_Review_Corpus_WEKA, and the parameter -dir sets up the input directory. This class expects to have a single directory containing a directory per class value (yes and no in our case), which in turn should contain a number of files pertaining to the corresponding classes. As the output of this command goes to the standard output, I have to redirect it to a file.

I have left the output of the execution of this command for both the English (SFU_Review_Corpus.arff) and the Spanish (SFU_Spanish_Review.arff) collections at the OpinionMining folder of my GitHub repository.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

ReneeBK

2017-3-26 12:53:24

Data Analysis

Previous models in my blog posts have been based on a relatively simple representation of texts as sequences of words. However, a trivial analysis of the problem easily drives us to think that multi-word expressions (e.g. "very bad" vs. "bad", or "a must" vs. "I must") can lead to better predictors of user sentiment or opinion about an item. Because of this, we will compare word n-grams vs. single words (or unigrams). As an basic set up, I propose to compare word unigrams, 3-grams, and 1-to-3-grams. The latter representation will include uni- to 3-grams with the hope of getting the best of all of them.

Keeping in ming that capitalization may matter in this problem ("BAD" is worse than "bad"), and that we can use standard punctuation (for each of the languages) as texts are long comments (several paragraphs each), I derive the following calls to the weka.filters.unsupervised.attribute.StringToWordVector class:

$> java weka.filters.unsupervised.attribute.StringToWordVector -O -tokenizer "weka.core.tokenizers.NGramTokenizer -delimiters \"[url=]\\\\W\[/url]" -min 1 -max 1" -W 10000000 -i SFU_Review_Corpus.arff -o SFU_Review_Corpus.vector.uni.arff
$> java weka.filters.unsupervised.attribute.StringToWordVector -O -tokenizer "weka.core.tokenizers.NGramTokenizer -delimiters \"[url=]\\\\W\[/url]" -min 3 -max 3" -W 10000000 -i SFU_Review_Corpus.arff -o SFU_Review_Corpus.vector.tri.arff
$> java weka.filters.unsupervised.attribute.StringToWordVector -O -tokenizer "weka.core.tokenizers.NGramTokenizer -delimiters \"[url=]\\\\W\[/url]" -min 1 -max 3" -W 10000000 -i SFU_Review_Corpus.arff -o SFU_Review_Corpus.vector.unitri.arff

We follow the notation vector.uni to denote that the dataset is vectorized and that we are using word unigrams, and so on. The calls for the Spanish collection are similar to these ones.

The most important thing in these calls is that we are no longer using theweka.core.tokenizers.WordTokenizer class. Instead, we are usingweka.core.tokenizers.NGramTokenizer, which uses the options -min and -max to set the minimum and maximum size of the n-grams. But the most important thing is that there is a major difference between both classes, regarding the usage of delimiters:

The weka.core.tokenizers.WordTokenizer class uses the deprecated Java classjava.util.StringTokenizer , even in the latest versions of the WEKA package (as of the day of this writing). In StringTokenizer, the delimiters are the characters used as "spaces" to tokenize the input string: white space, punctuation marks, etc. So you have to explicitly define which will be the "spaces" in your text.
The weka.core.tokenizers.NGramTokenizer class uses the recommended Java String method String[] split(String regex) , in which the argument (and thus the delimiters string) is a Regular Expression (regex) in Java. The text is splitted into tokens separated by substrings that match the regex, so you can use all the power of regexes including e.g. special codes for characters. In this case I am using the code \W which denotes any non-word character, in order to get only alpha-numeric character sequences.

After splitting the text into word n-grams (or more properly, after representing the texts as term-weight vectors in our Vector Space Model), we may want to examine which n-grams are most predictive. As in the Language Identification post, we make use of theweka.filters.supervised.attribute.AttributeSelection class:

$> java weka.filters.supervised.attribute.AttributeSelection -c 1 -E weka.attributeSelection.InfoGainAttributeEval -S "weka.attributeSelection.Ranker -T 0.0" -i SFU_Review_Corpus.vector.uni.arff -o SFU_Review_Corpus.vector.uni.ig0.arff
$> java weka.filters.supervised.attribute.AttributeSelection -c 1 -E weka.attributeSelection.InfoGainAttributeEval -S "weka.attributeSelection.Ranker -T 0.0" -i SFU_Review_Corpus.vector.tri.arff -o SFU_Review_Corpus.vector.tri.ig0.arff
$> java weka.filters.supervised.attribute.AttributeSelection -c 1 -E weka.attributeSelection.InfoGainAttributeEval -S "weka.attributeSelection.Ranker -T 0.0" -i SFU_Review_Corpus.vector.unitri.arff -o SFU_Review_Corpus.vector.unitri.ig0.arff

After the selection of the most predictive n-grams, we get the following statistics in the test collections:

The percentages in rows 3-6-9 measure the agressivity of feature selection. Overall, both collections have comparable statistics (in the same order of magnitude). Original unigrams are quite similar, but bigrams and trigrams are less in Spanish (despite the fact that there are more isolated words -- unigrams). Selecting n-grams with Information Gain is a bit more aggressive in Spanish for unigrams and possible bigrams, but less in trigrams.

Adding bigrams and trigrams to the representation substantially increases the number of predictive features (from 4 to 5 times). However, only trigrams result in a little increment of features, so bigrams will play a role here. The number of features is quite handy, and allows us to make quick experiments.

According to my previous post on setting up experiments with WEKA text classifiers and how to chain filters and classifiers, you must note that these are not the final features if we configure a cross-validation experiment -- we have to chain the filters (StringToWordVector and AttributeSelection) and the classifier in order to perform a valid experiment, as the features for each folder should be different.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

扫码加我 拉你入群

扫码加我 拉你入群

扫码加我 拉你入群

分享

扫码加好友，拉您进群

扫码加我拉你入群

扫码加我拉你入群

扫码加我拉你入群