NLP: Bag of Words

Rahul S
2 min readJun 7, 2023

The Bag of Words (BoW) model is a simple and was widely used technique in natural language processing (NLP) for representing textual data.

It treats a document as a collection of words, disregarding grammar, word order, and context. The name “Bag of Words” reflects the idea that the model represents a text as an unordered “bag” or set of words.

ALGORITHM:

In the Bag of Words model:

  1. Vocabulary Creation: The first step is to create a vocabulary or dictionary of unique words that appear in the corpus (collection of documents). Each unique word is assigned a unique index or identifier.
  2. Text Representation: For each document in the corpus, the Bag of Words model represents it as a numerical vector, where each element of the vector corresponds to a word in the vocabulary. The value of each element represents the frequency of that word in the document.
  3. Vectorization: The numerical vectors are created by counting the occurrences of words in each document. The size of the vector is equal to the size of the vocabulary.

For example, consider the following two sentences:

Sentence 1: “I love cats.”
Sentence 2: “I love dogs.”

--

--