Producing a Term-Document Matrix Processed with TF-IDF¶
(Term Frequency - Inverse Document Frequency... How many times the words appear in each document/how many times it appears in all documents)
Import Libraries¶
In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
Sample Pieces of Text¶
In [2]:
document_1 = "But there was one other thing that the grown-ups also knew, and it was this: that however small the chance might be of striking lucky, the chance is there. The chance had to be there."
document_2 = "I will not pretend I wasn't petrified. I was. But mixed in with the awful fear was a glorious feeling of excitement. Most of the really exciting things we do in our lives scare us to death. They wouldn't be exciting if they didn't."
document_3 = "A person who has good thoughts cannot ever be ugly. You can have a wonky nose and a crooked mouth and a double chin and stick-out teeth, but if you have good thoughts they will shine out of your face like sunbeams and you will always look lovely."
document_4 = "Never do anything by halves if you want to get away with it. Be outrageous. Go the whole hog. Make sure everything you do is so completely crazy it's unbelievable"
document_5 = "And above all, watch with glittering eyes the whole world around you because the greatest secrets are always hidden in the most unlikely places. Those who don't believe in magic will never find it."
In [3]:
documents = [document_1, document_2, document_3, document_4, document_5]
Process List Using TF-IDF¶
In [4]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
Produces a Term-Document Matrix¶
In [5]:
terms = tfidf_vectorizer.get_feature_names_out()
tfidf_df = pd.DataFrame(tfidf_matrix.T.toarray(), index=terms, columns=["Document 1", "Document 2", "Document 3","Document 4","Document 5"])
tfidf_df
Out[5]:
Document 1 | Document 2 | Document 3 | Document 4 | Document 5 | |
---|---|---|---|---|---|
above | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.182704 |
all | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.182704 |
also | 0.144571 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
always | 0.000000 | 0.000000 | 0.110176 | 0.000000 | 0.147404 |
and | 0.096821 | 0.000000 | 0.365825 | 0.000000 | 0.122359 |
... | ... | ... | ... | ... | ... |
wonky | 0.000000 | 0.000000 | 0.136561 | 0.000000 | 0.000000 |
world | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.182704 |
wouldn | 0.000000 | 0.160672 | 0.000000 | 0.000000 | 0.000000 |
you | 0.000000 | 0.000000 | 0.274369 | 0.262054 | 0.122359 |
your | 0.000000 | 0.000000 | 0.136561 | 0.000000 | 0.000000 |
119 rows × 5 columns
Print the Words From Document 1 in order of Importance¶
In [6]:
filtered_tfidf_df = tfidf_df[tfidf_df['Document 1'] != 0]
sorted_tfidf_df = filtered_tfidf_df.sort_values(by='Document 1', ascending=False)
word_list = sorted_tfidf_df.index.tolist()
word_list
Out[6]:
['chance', 'there', 'the', 'that', 'was', 'be', 'one', 'ups', 'this', 'thing', 'striking', 'small', 'other', 'also', 'might', 'lucky', 'knew', 'however', 'had', 'grown', 'is', 'and', 'it', 'to', 'but', 'of']