Since naive Bayes has been used successfully for email spam filtering, it seems likely that it could also be applied to SMS spam. However, relative to email spam, SMS spam poses additional challenges for automated filters. SMS messages are often limited to 160 characters, reducing the amount of text that can be used to identify whether a message is junk. The limit, combined with small mobile phone keyboards, has led many to adopt a form of SMS shorthand lingo, which further blurs the line between legitimate messages and spam. Let's see how well a simple naive Bayes classifier handles these challenges.
Step 1 – collecting data
To develop the naive Bayes classifier, we will use data adapted from the SMS Spam Collection at 
http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/.This dataset includes the text of SMS messages along with a label indicating whether the message is unwanted. Junk messages are labeled spam, while legitimate messages are labeled ham. 
Our naive Bayes classifier will take advantage of such patterns in the word frequency to determine whether the SMS messages seem to better fit the profile of spam or ham. While it's not inconceivable that the word "free" would appear outside of a spam SMS, a legitimate message is likely to provide additional words providing context. For instance, a ham message might state "are you free on Sunday?", whereas a spam message might use the phrase "free ringtones." The classifier will compute the probability of spam and ham given the evidence provided by all the words in the message.
Step 2 – exploring and preparing the data
The first step towards constructing our classifier involves processing the raw data for analysis. Text data are challenging to prepare because it is necessary to transform the words and sentences into a form that a computer can understand. We will transform our data into a representation known as bag-of-words , which ignores the order that words appear in and simply provides a variable indicating whether the word appears at all.
- We'll begin by importing the CSV data using the read.csv() function and saving it to a data frame titled sms_raw:
 
- Using the structure function str(), we see that the sms_raw data frame includes 5,559 total SMS messages with two features: type and text. The SMS type has been coded as either ham or spam, and the text variable stores the full raw SMS message text.
 
The type variable is currently a character vector. Since this is a categorical variable, it would be better to convert it to a factor, as shown in the following code:
- Examining the type variable with the str() and table() functions, we see that the variable has now been appropriately recoded as a factor. Additionally, we see that 747 (or about 13 percent) of SMS messages in our data were labeled spam, while the remainder were labeled ham:
 
- Data preparation – processing text data for analysis
 
Build a corpus containing the SMS messages in the training data using the following command:
- If we print() the corpus we just created, we will see that it contains documents for each of the 5,559 SMS messages in the training data:
 
- View the first, second, and third SMS messages:
 
- Convert all of the SMS messages to lowercase and remove any numbers:
 
- Data preparation – processing text data for analysis
 
- Data preparation – creating training and test datasets
 
- Then the document-term matrix:
 
- Compare the proportion of spam in the training and test data frames:
 
- Visualizing text data – word clouds
 
- Let's use R's subset() function to take a subset of the sms_raw_train data by SMS type. First, we'll create a subset where type is equal to spam:
 
- Next, we'll do the same thing for the ham subset:
 
- We use the max.words parameter to look at the 40 most common words in each of the two sets. The scale parameter allows us to adjust the maximum and minimum font size for words in the cloud. Feel free to adjust these parameters as you see fit. This is illustrated in the following code:
 
- Visualizing text data – word clouds
 - Data preparation – creating indicator features for frequent words.The final step in the data preparation process is to transform the sparse matrix into a data structure that can be used to train a naive Bayes classifier.
 
- Finding frequent words requires use of the findFreqTerms() function in the tm package. This function takes a document term matrix and returns a character vector containing the words appearing at least a specified number of times. For instance, the following command will display a character vector of the words appearing at least 5 times in the sms_dtm_train matrix:
 
- To save this list of frequent terms for use later, we'll use the Dictionary() function:
 
- A dictionary is a data structure allowing us to specify which words should appear in a document term matrix. To limit our training and test matrixes to only the words in the preceding dictionary, use the following commands:
 
The training and test data now includes roughly 1,200 features corresponding only to words appearing in at least five messages.
The naive Bayes classifier is typically trained on data with categorical features. This poses a problem since the cells in the sparse matrix indicate a count of the times a word appears in a message. We should change this to a factor variable that simply indicates yes or no depending on whether the word appears at all.The following code defines a convert_counts() function to convert counts to factors:
To convert the training and test matrixes are as follows:
Step 3 – training a model on the data
To build our model on the sms_train matrix, we'll use the following command:
Step 4 – evaluating model performance
The predict() function is used to make the predictions. We will store these in a vector named sms_test_pred:
To compare the predicted values to the actual values, we'll use the CrossTable() function in the gmodels package, which we have used previously. This time, we'll add some additional parameters to eliminate unnecessary cell proportions, and use the dnn parameter (dimension names) to relabel the rows and columns, as shown in the following code:
Step 5 – improving model performance
- We'll build a naive Bayes model as before, but this time set laplace = 1:
 
- Next, we'll make predictions:
 
- Finally, we'll compare the predicted classes to the actual classifications using a cross tabulation:
 
Reference
- Machine Learning with R
 By: Brett Lantz
Publisher: Packt Publishing
Pub. Date: October 25, 2013
Print ISBN-13: 978-1-78216-214-8
Web ISBN-13: 978-1-78216-215-5