SRC: UNKNOWN

NLP: TF-IDF (Term Frequency-Inverse Document Frequency)

Convert words into numbers

Rahul S
3 min readJun 7

--

It is advisable to go through basics og Bag of Words before delving into TF-IDF:

TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a numerical representation technique commonly used in natural language processing (NLP) to reflect the importance of a word in a document within a collection or corpus.

It is basically a numerical statistic used in information retrieval and text mining to measure the importance of a term in a document within a larger collection or corpus.

TF-IDF takes into account both the frequency of a word within a document (term frequency) and the rarity of the word across the entire corpus (inverse document frequency).

Term Frequency (TF) measures the frequency of a term (word) within a document. It indicates how often a term appears in a document relative to the total number of terms in that document. The formula for calculating the TF score of a term is:

TF(t) = (Number of occurrences of term t in a document) / (Total number of terms in the document)

TF assigns higher weights to terms that occur more frequently within a document. It captures the local importance of a term within a specific document.

Inverse Document Frequency (IDF) measures the rarity or uniqueness of a term across the entire corpus. It quantifies how much information a term provides by considering its presence in other documents. The formula for calculating the IDF score of a term is:

IDF(t) = log_e (Total number of documents / Number of documents with term t)

IDF assigns lower weights to terms that occur in many documents and higher weights to terms that occur in fewer…

--

--

Rahul S

I learn as I write | LLM, NLP, Statistics, ML