Spam Classification Practice: Using Bag of Words¶
Import a dataset of spam (and not spam, aka ham) messages, and use Multinomial Niave Bayes to classify a new message.
Import Libraries¶
In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
import warnings
warnings.filterwarnings('ignore')
Import File and Convert Labels¶
In [2]:
df = pd.read_csv('SpamCollection.csv')
In [3]:
df.head()
Out[3]:
label | message | |
---|---|---|
0 | ham | Go until jurong point, crazy.. Available only ... |
1 | ham | Ok lar... Joking wif u oni... |
2 | spam | Free entry in 2 a wkly comp to win FA Cup fina... |
3 | ham | U dun say so early hor... U c already then say... |
4 | ham | Nah I don't think he goes to usf, he lives aro... |
In [4]:
df['label']=df['label'].replace('ham', 0)
df['label']=df['label'].replace('spam', 1)
In [5]:
df.head()
Out[5]:
label | message | |
---|---|---|
0 | 0 | Go until jurong point, crazy.. Available only ... |
1 | 0 | Ok lar... Joking wif u oni... |
2 | 1 | Free entry in 2 a wkly comp to win FA Cup fina... |
3 | 0 | U dun say so early hor... U c already then say... |
4 | 0 | Nah I don't think he goes to usf, he lives aro... |
Train, Test, Split the Data¶
In [6]:
X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label'], test_size=0.2, random_state=42)
Define and Create BoW Vectoriser¶
In [7]:
vectorizer = CountVectorizer()
In [8]:
X_train_counts = vectorizer.fit_transform(X_train)
In [9]:
X_test_counts = vectorizer.transform(X_test)
Create Multinomial Niave Bayes Classifier¶
From 'given it's spam, calculate probability of features to 'given its features, calculate probability it's spam'
In [10]:
classifier = MultinomialNB()
classifier.fit(X_train_counts, y_train)
Out[10]:
MultinomialNB()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
MultinomialNB()
In [11]:
predicted = classifier.predict(X_test_counts)
Accuracy/Classification Report¶
In [12]:
accuracy = accuracy_score(y_test, predicted)
print("Accuracy:", accuracy)
Accuracy: 0.9919282511210762
In [13]:
print(classification_report(y_test, predicted))
precision recall f1-score support 0 0.99 1.00 1.00 966 1 1.00 0.94 0.97 149 accuracy 0.99 1115 macro avg 1.00 0.97 0.98 1115 weighted avg 0.99 0.99 0.99 1115
Predict Classification of New Message¶
In [14]:
new_message = ["Congratulations! You've won a free vacation."]
In [15]:
new_message_counts = vectorizer.transform(new_message)
prediction = classifier.predict(new_message_counts)
print("Predicted classification for the new message:", 'Not Spam' if prediction == 0 else 'Spam')
Predicted classification for the new message: Spam
In [ ]:
Redo Classification Model, using Text PreProcessing First¶
In [16]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import string
nltk.download('punkt')
nltk.download('stopwords')
[nltk_data] Downloading package punkt to /Users/jimhardy/nltk_data... [nltk_data] Package punkt is already up-to-date! [nltk_data] Downloading package stopwords to [nltk_data] /Users/jimhardy/nltk_data... [nltk_data] Package stopwords is already up-to-date!
Out[16]:
True
Text PreProccesing First¶
In [17]:
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))
def preprocess_text(text):
text = text.lower()
tokens = word_tokenize(text)
processed_tokens = [stemmer.stem(token) for token in tokens if token not in string.punctuation and token not in stop_words]
processed_text = ' '.join(processed_tokens)
return processed_text
df['processed_message'] = df['message'].apply(preprocess_text)
View Dataset (with processed_message column)¶
In [18]:
df.head()
Out[18]:
label | message | processed_message | |
---|---|---|---|
0 | 0 | Go until jurong point, crazy.. Available only ... | go jurong point crazi .. avail bugi n great wo... |
1 | 0 | Ok lar... Joking wif u oni... | ok lar ... joke wif u oni ... |
2 | 1 | Free entry in 2 a wkly comp to win FA Cup fina... | free entri 2 wkli comp win fa cup final tkt 21... |
3 | 0 | U dun say so early hor... U c already then say... | u dun say earli hor ... u c alreadi say ... |
4 | 0 | Nah I don't think he goes to usf, he lives aro... | nah n't think goe usf live around though |
Train, Test, Split the Data (Using The processed_message Column)¶
In [19]:
X_train, X_test, y_train, y_test = train_test_split(df['processed_message'], df['label'], test_size=0.2, random_state=42)
In [20]:
vectorizer = CountVectorizer()
In [21]:
X_train_counts = vectorizer.fit_transform(X_train)
In [22]:
X_test_counts = vectorizer.transform(X_test)
In [23]:
classifier = MultinomialNB()
classifier.fit(X_train_counts, y_train)
Out[23]:
MultinomialNB()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
MultinomialNB()
In [24]:
predicted = classifier.predict(X_test_counts)
In [25]:
accuracy = accuracy_score(y_test, predicted)
print("Accuracy:", accuracy)
Accuracy: 0.9874439461883409
In [26]:
print(classification_report(y_test, predicted))
precision recall f1-score support 0 0.99 1.00 0.99 966 1 0.97 0.93 0.95 149 accuracy 0.99 1115 macro avg 0.98 0.96 0.97 1115 weighted avg 0.99 0.99 0.99 1115
In [27]:
new_message = "Congratulations! You've won a free vacation."
In [28]:
new_message = new_message.lower()
tokens = word_tokenize(new_message)
processed_tokens = [stemmer.stem(token) for token in tokens if token not in string.punctuation and token not in stop_words]
processed_text = [' '.join(processed_tokens)]
In [29]:
new_message_counts = vectorizer.transform(processed_text)
prediction = classifier.predict(new_message_counts)
print("Predicted classification for the new message:", 'Not Spam' if prediction == 0 else 'Spam')
Predicted classification for the new message: Spam