This tutorial presents the steps for training one of the classifiers available in WEKA using java. Weka provides implementation of wide range of machine learning based classifiers. A trained classifier can be used for the classification of data in a particular domain which depends on the training set.To train a classifier we need a training set.Once the classifier is trained it can be stored in a file and can be loaded for later use (classifier serialization and deserialization).
In this article, we will provide you a weka tutorial on classsifing a tweet as either positive tweet or negative tweet(Sentiment analysis).The following points explain the steps involved in performing sentiment analysis using weka.
- Download weka 3.7.x and install it.
- In the weka installation folder you will find weka.jar. Add it to the java class path.
- The following section consists of code snippets for training and testing the Naive bayes classifier.
3.1 Reading Input Dataset from CSV FileTweets for training, annotated with their sentiment values are in CSV(Comma Separated Values) file format. Following code reads the input data from the CSV file
3.2 Tweet Preprocessing and Feature extractionOnce input is read from the file, the data (tweets in this case) needs to be preprocessed and then feature extraction is performed. For tweet preprocessing, a modified CMU POS Tagger is used.featureWords is an arrayList that consists of the feature words. If the feature word is present in the tweet, the index corresponding to that feature word is given a value of 1. All other feature words that are not in the tweet have a value 0. There are totally 6800 featute words. This means that the feature vector is a point in which each axis represents the presence or the absence of the feature word corresponding to that axis. Hence Feature vector takes a sparse form (more zeroes than non zero entries). Thus a SparseInstance is used to represent a feature vector.
Once feature extraction is done, training and testing can be performed.
3.3 Training the ClassifierIn this tutorial, NaiveBayes classifier is used. The following code snippet consists of steps involved in training in the NaiveBayes Classifier.
3.4 Testing the ClassifierThe following code snippet consists of steps involved in testing the classifier. In testing, with the training knowledge, the classifier tries to predict the class (sentiment) of the tweet.
Download the project files here.
The downloaded archive is a NetBeans project. It can be opened with NetBeans IDE. If you are not using NetBeans IDE, go to the src directory for the java code. FeatureWordsList.dat is an arraylist of feature words(ArrayList serialized to file). training.csv and testing.csv contains the datasets for training and testing.