import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

document_1 = "But there was one other thing that the grown-ups also knew, and it was this: that however small the chance might be of striking lucky, the chance is there. The chance had to be there."
document_2 = "I will not pretend I wasn't petrified. I was. But mixed in with the awful fear was a glorious feeling of excitement. Most of the really exciting things we do in our lives scare us to death. They wouldn't be exciting if they didn't."
document_3 = "A person who has good thoughts cannot ever be ugly. You can have a wonky nose and a crooked mouth and a double chin and stick-out teeth, but if you have good thoughts they will shine out of your face like sunbeams and you will always look lovely."
document_4 = "Never do anything by halves if you want to get away with it. Be outrageous. Go the whole hog. Make sure everything you do is so completely crazy it's unbelievable"
document_5 = "And above all, watch with glittering eyes the whole world around you because the greatest secrets are always hidden in the most unlikely places. Those who don't believe in magic will never find it."

documents = [document_1, document_2, document_3, document_4, document_5]

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

terms = tfidf_vectorizer.get_feature_names_out()

tfidf_df = pd.DataFrame(tfidf_matrix.T.toarray(), index=terms, columns=["Document 1", "Document 2", "Document 3","Document 4","Document 5"])

tfidf_df

filtered_tfidf_df = tfidf_df[tfidf_df['Document 1'] != 0]
sorted_tfidf_df = filtered_tfidf_df.sort_values(by='Document 1', ascending=False)
word_list = sorted_tfidf_df.index.tolist()
word_list

['chance',
 'there',
 'the',
 'that',
 'was',
 'be',
 'one',
 'ups',
 'this',
 'thing',
 'striking',
 'small',
 'other',
 'also',
 'might',
 'lucky',
 'knew',
 'however',
 'had',
 'grown',
 'is',
 'and',
 'it',
 'to',
 'but',
 'of']

	Document 1	Document 2	Document 3	Document 4	Document 5
above	0.000000	0.000000	0.000000	0.000000	0.182704
all	0.000000	0.000000	0.000000	0.000000	0.182704
also	0.144571	0.000000	0.000000	0.000000	0.000000
always	0.000000	0.000000	0.110176	0.000000	0.147404
and	0.096821	0.000000	0.365825	0.000000	0.122359
...	...	...	...	...	...
wonky	0.000000	0.000000	0.136561	0.000000	0.000000
world	0.000000	0.000000	0.000000	0.000000	0.182704
wouldn	0.000000	0.160672	0.000000	0.000000	0.000000
you	0.000000	0.000000	0.274369	0.262054	0.122359
your	0.000000	0.000000	0.136561	0.000000	0.000000

Producing a Term-Document Matrix Processed with TF-IDF¶

Import Libraries¶

Sample Pieces of Text¶

Process List Using TF-IDF¶

Produces a Term-Document Matrix¶

Print the Words From Document 1 in order of Importance¶