Spam Classification Practice: Using Bag of Words¶
Import a dataset of spam (and not spam, aka ham) messages, and use Multinomial Niave Bayes to classify a new message.
Import Libraries¶
In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
import warnings
Import File and Convert Labels¶
In [2]:
df = pd.read_csv('SpamCollection.csv')
In [3]:
label | message | |
0 | ham | Go until jurong point, crazy.. Available only ... |
1 | ham | Ok lar... Joking wif u oni... |
2 | spam | Free entry in 2 a wkly comp to win FA Cup fina... |
3 | ham | U dun say so early hor... U c already then say... |
4 | ham | Nah I don't think he goes to usf, he lives aro... |
In [4]:
df['label']=df['label'].replace('ham', 0)
df['label']=df['label'].replace('spam', 1)
In [5]:
label | message | |
0 | 0 | Go until jurong point, crazy.. Available only ... |
1 | 0 | Ok lar... Joking wif u oni... |
2 | 1 | Free entry in 2 a wkly comp to win FA Cup fina... |
3 | 0 | U dun say so early hor... U c already then say... |
4 | 0 | Nah I don't think he goes to usf, he lives aro... |
Train, Test, Split the Data¶
In [6]:
X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label'], test_size=0.2, random_state=42)
Define and Create BoW Vectoriser¶
In [7]:
vectorizer = CountVectorizer()
In [8]:
X_train_counts = vectorizer.fit_transform(X_train)
In [9]:
X_test_counts = vectorizer.transform(X_test)
Create Multinomial Niave Bayes Classifier¶
From 'given it's spam, calculate probability of features to 'given its features, calculate probability it's spam'
In [10]:
classifier = MultinomialNB(), y_train)
In [11]:
predicted = classifier.predict(X_test_counts)
Accuracy/Classification Report¶
In [12]:
accuracy = accuracy_score(y_test, predicted)
print("Accuracy:", accuracy)
Accuracy: 0.9919282511210762
In [13]:
print(classification_report(y_test, predicted))
precision recall f1-score support 0 0.99 1.00 1.00 966 1 1.00 0.94 0.97 149 accuracy 0.99 1115 macro avg 1.00 0.97 0.98 1115 weighted avg 0.99 0.99 0.99 1115
Predict Classification of New Message¶
In [14]:
new_message = ["Congratulations! You've won a free vacation."]
In [15]:
new_message_counts = vectorizer.transform(new_message)
prediction = classifier.predict(new_message_counts)
print("Predicted classification for the new message:", 'Not Spam' if prediction == 0 else 'Spam')
Predicted classification for the new message: Spam
Redo Classification Model, using Text PreProcessing First¶
In [16]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import string'punkt')'stopwords')
[nltk_data] Downloading package punkt to /Users/jimhardy/nltk_data... [nltk_data] Package punkt is already up-to-date! [nltk_data] Downloading package stopwords to [nltk_data] /Users/jimhardy/nltk_data... [nltk_data] Package stopwords is already up-to-date!
Text PreProccesing First¶
In [17]:
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))
def preprocess_text(text):
text = text.lower()
tokens = word_tokenize(text)
processed_tokens = [stemmer.stem(token) for token in tokens if token not in string.punctuation and token not in stop_words]
processed_text = ' '.join(processed_tokens)
return processed_text
df['processed_message'] = df['message'].apply(preprocess_text)
View Dataset (with processed_message column)¶
In [18]:
label | message | processed_message | |
0 | 0 | Go until jurong point, crazy.. Available only ... | go jurong point crazi .. avail bugi n great wo... |
1 | 0 | Ok lar... Joking wif u oni... | ok lar ... joke wif u oni ... |
2 | 1 | Free entry in 2 a wkly comp to win FA Cup fina... | free entri 2 wkli comp win fa cup final tkt 21... |
3 | 0 | U dun say so early hor... U c already then say... | u dun say earli hor ... u c alreadi say ... |
4 | 0 | Nah I don't think he goes to usf, he lives aro... | nah n't think goe usf live around though |
Train, Test, Split the Data (Using The processed_message Column)¶
In [19]:
X_train, X_test, y_train, y_test = train_test_split(df['processed_message'], df['label'], test_size=0.2, random_state=42)
In [20]:
vectorizer = CountVectorizer()
In [21]:
X_train_counts = vectorizer.fit_transform(X_train)
In [22]:
X_test_counts = vectorizer.transform(X_test)
In [23]:
classifier = MultinomialNB(), y_train)
In [24]:
predicted = classifier.predict(X_test_counts)
In [25]:
accuracy = accuracy_score(y_test, predicted)
print("Accuracy:", accuracy)
Accuracy: 0.9874439461883409
In [26]:
print(classification_report(y_test, predicted))
precision recall f1-score support 0 0.99 1.00 0.99 966 1 0.97 0.93 0.95 149 accuracy 0.99 1115 macro avg 0.98 0.96 0.97 1115 weighted avg 0.99 0.99 0.99 1115
In [27]:
new_message = "Congratulations! You've won a free vacation."
In [28]:
new_message = new_message.lower()
tokens = word_tokenize(new_message)
processed_tokens = [stemmer.stem(token) for token in tokens if token not in string.punctuation and token not in stop_words]
processed_text = [' '.join(processed_tokens)]
In [29]:
new_message_counts = vectorizer.transform(processed_text)
prediction = classifier.predict(new_message_counts)
print("Predicted classification for the new message:", 'Not Spam' if prediction == 0 else 'Spam')
Predicted classification for the new message: Spam