NLP: TF-IDF (Term Frequency-Inverse Document Frequency)
It is advisable to go through basics og Bag of Words before delving into TF-IDF:
TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a numerical representation technique commonly used in natural language processing (NLP) to reflect the importance of a word in a document within a collection or corpus.
It is basically a numerical statistic used in information retrieval and text mining to measure the importance of a term in a document within a larger collection or corpus.
TF-IDF takes into account both the frequency of a word within a document (term frequency) and the rarity of the word across the entire corpus (inverse document frequency).
Term Frequency (TF) measures the frequency of a term (word) within a document. It indicates how often a term appears in a document relative to the total number of terms in that document. The formula for calculating the TF score of a term is:
TF(t) = (Number of occurrences of term t in a document) / (Total number of terms in the document)
TF assigns higher weights to terms that occur more frequently within a document. It captures the local importance of a term within a specific document.
Inverse Document Frequency (IDF) measures the rarity or uniqueness of a term across the entire corpus. It quantifies how much information a term provides by considering its presence in other documents. The formula for calculating the IDF score of a term is:
IDF(t) = log_e (Total number of documents / Number of documents with term t)
IDF assigns lower weights to terms that occur in many documents and higher weights to terms that occur in fewer…