Filtering Mobile Phone Spam with the Naive Bayes Algorithm

3304

收藏 2015-01-23

Since naive Bayes has been used successfully for email spam filtering, it seems likely that it could also be applied to SMS spam. However, relative to email spam, SMS spam poses additional challenges for automated filters. SMS messages are often limited to 160 characters, reducing the amount of text that can be used to identify whether a message is junk. The limit, combined with small mobile phone keyboards, has led many to adopt a form of SMS shorthand lingo, which further blurs the line between legitimate messages and spam. Let's see how well a simple naive Bayes classifier handles these challenges.
Step 1 – collecting data
To develop the naive Bayes classifier, we will use data adapted from the SMS Spam Collection at http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/.This dataset includes the text of SMS messages along with a label indicating whether the message is unwanted. Junk messages are labeled spam, while legitimate messages are labeled ham.
Our naive Bayes classifier will take advantage of such patterns in the word frequency to determine whether the SMS messages seem to better fit the profile of spam or ham. While it's not inconceivable that the word "free" would appear outside of a spam SMS, a legitimate message is likely to provide additional words providing context. For instance, a ham message might state "are you free on Sunday?", whereas a spam message might use the phrase "free ringtones." The classifier will compute the probability of spam and ham given the evidence provided by all the words in the message.
Step 2 – exploring and preparing the data
The first step towards constructing our classifier involves processing the raw data for analysis. Text data are challenging to prepare because it is necessary to transform the words and sentences into a form that a computer can understand. We will transform our data into a representation known as bag-of-words , which ignores the order that words appear in and simply provides a variable indicating whether the word appears at all.

We'll begin by importing the CSV data using the read.csv() function and saving it to a data frame titled sms_raw:

复制代码

Using the structure function str(), we see that the sms_raw data frame includes 5,559 total SMS messages with two features: type and text. The SMS type has been coded as either ham or spam, and the text variable stores the full raw SMS message text.

复制代码

The type variable is currently a character vector. Since this is a categorical variable, it would be better to convert it to a factor, as shown in the following code:

复制代码

Examining the type variable with the str() and table() functions, we see that the variable has now been appropriately recoded as a factor. Additionally, we see that 747 (or about 13 percent) of SMS messages in our data were labeled spam, while the remainder were labeled ham:

复制代码

Data preparation – processing text data for analysis

Build a corpus containing the SMS messages in the training data using the following command:

复制代码

If we print() the corpus we just created, we will see that it contains documents for each of the 5,559 SMS messages in the training data:

复制代码

View the first, second, and third SMS messages:

复制代码

Convert all of the SMS messages to lowercase and remove any numbers:

复制代码

Data preparation – processing text data for analysis

复制代码

Data preparation – creating training and test datasets

复制代码

Then the document-term matrix:

复制代码

And finally, the corpus:

复制代码

Compare the proportion of spam in the training and test data frames:

复制代码

Visualizing text data – word clouds

复制代码

Let's use R's subset() function to take a subset of the sms_raw_train data by SMS type. First, we'll create a subset where type is equal to spam:

复制代码

Next, we'll do the same thing for the ham subset:

复制代码

We use the max.words parameter to look at the 40 most common words in each of the two sets. The scale parameter allows us to adjust the maximum and minimum font size for words in the cloud. Feel free to adjust these parameters as you see fit. This is illustrated in the following code:

复制代码

Visualizing text data – word clouds
Data preparation – creating indicator features for frequent words.The final step in the data preparation process is to transform the sparse matrix into a data structure that can be used to train a naive Bayes classifier.

Finding frequent words requires use of the findFreqTerms() function in the tm package. This function takes a document term matrix and returns a character vector containing the words appearing at least a specified number of times. For instance, the following command will display a character vector of the words appearing at least 5 times in the sms_dtm_train matrix:

复制代码

To save this list of frequent terms for use later, we'll use the Dictionary() function:

复制代码

A dictionary is a data structure allowing us to specify which words should appear in a document term matrix. To limit our training and test matrixes to only the words in the preceding dictionary, use the following commands:

复制代码

The training and test data now includes roughly 1,200 features corresponding only to words appearing in at least five messages.
The naive Bayes classifier is typically trained on data with categorical features. This poses a problem since the cells in the sparse matrix indicate a count of the times a word appears in a message. We should change this to a factor variable that simply indicates yes or no depending on whether the word appears at all.The following code defines a convert_counts() function to convert counts to factors:

复制代码

To convert the training and test matrixes are as follows:

复制代码

Step 3 – training a model on the data
To build our model on the sms_train matrix, we'll use the following command:

复制代码

Step 4 – evaluating model performance
The predict() function is used to make the predictions. We will store these in a vector named sms_test_pred:

复制代码

To compare the predicted values to the actual values, we'll use the CrossTable() function in the gmodels package, which we have used previously. This time, we'll add some additional parameters to eliminate unnecessary cell proportions, and use the dnn parameter (dimension names) to relabel the rows and columns, as shown in the following code:

复制代码

Step 5 – improving model performance