【独家发布】Train a Weka Classifier in Java

945

收藏 2014-12-10

This tutorial presents the steps for training one of the classifiers available in WEKA using java. Weka provides implementation of wide range of machine learning based classifiers. A trained classifier can be used for the classification of data in a particular domain which depends on the training set.To train a classifier we need a training set.Once the classifier is trained it can be stored in a file and can be loaded for later use (classifier serialization and deserialization).

In this article, we will provide you a weka tutorial on classsifing a tweet as either positive tweet or negative tweet(Sentiment analysis).The following points explain the steps involved in performing sentiment analysis using weka.

Download weka 3.7.x and install it.
In the weka installation folder you will find weka.jar. Add it to the java class path.
The following section consists of code snippets for training and testing the Naive bayes classifier.

3.1 Reading Input Dataset from CSV File

Tweets for training, annotated with their sentiment values are in CSV(Comma Separated Values) file format. Following code reads the input data from the CSV file

复制代码

3.2 Tweet Preprocessing and Feature extraction

Once input is read from the file, the data (tweets in this case) needs to be preprocessed and then feature extraction is performed. For tweet preprocessing, a modified CMU POS Tagger is used.featureWords is an arrayList that consists of the feature words. If the feature word is present in the tweet, the index corresponding to that feature word is given a value of 1. All other feature words that are not in the tweet have a value 0. There are totally 6800 featute words. This means that the feature vector is a point in which each axis represents the presence or the absence of the feature word corresponding to that axis. Hence Feature vector takes a sparse form (more zeroes than non zero entries). Thus a SparseInstance is used to represent a feature vector.

复制代码

Once feature extraction is done, training and testing can be performed.

3.3 Training the Classifier

In this tutorial, NaiveBayes classifier is used. The following code snippet consists of steps involved in training in the NaiveBayes Classifier.

复制代码

3.4 Testing the Classifier

The following code snippet consists of steps involved in testing the classifier. In testing, with the training knowledge, the classifier tries to predict the class (sentiment) of the tweet.

复制代码

Download the project files here.

The downloaded archive is a NetBeans project. It can be opened with NetBeans IDE. If you are not using NetBeans IDE, go to the src directory for the java code. FeatureWordsList.dat is an arraylist of feature words(ArrayList serialized to file). training.csv and testing.csv contains the datasets for training and testing.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

扫码加我 拉你入群

分享

扫码加好友，拉您进群

扫码加我拉你入群