The Bag of Words (BoW) model is a simple and was widely used technique in natural language processing (NLP) for representing textual data.
It treats a document as a collection of words, disregarding grammar, word order, and context. The name “Bag of Words” reflects the idea that the model represents a text as an unordered “bag” or set of words.
In the Bag of Words model:
- Vocabulary Creation: The first step is to create a vocabulary or dictionary of unique words that appear in the corpus (collection of documents). Each unique word is assigned a unique index or identifier.
- Text Representation: For each document in the corpus, the Bag of Words model represents it as a numerical vector, where each element of the vector corresponds to a word in the vocabulary. The value of each element represents the frequency of that word in the document.
- Vectorization: The numerical vectors are created by counting the occurrences of words in each document. The size of the vector is equal to the size of the vocabulary.
For example, consider the following two sentences:
Sentence 1: “I love cats.”
Sentence 2: “I love dogs.”
In the BoW model, we create a vocabulary comprising unique words from both sentences: [“I”, “love”, “cats”, “dogs”].
Using this vocabulary, we represent each sentence as a numerical vector:
Sentence 1 vector: [1, 1, 1, 0]
Sentence 2 vector: [1, 1, 0, 1]
The values in the vectors correspond to the frequency of each word in the respective sentence. In this case, “I” and “love” appear once in both sentences, while “cats” appears only in Sentence 1, and “dogs” appears only in Sentence 2.
The Bag of Words model has several limitations.
- It doesn’t consider the semantic meaning or the order of words, which can result in the loss of valuable contextual information.
- Additionally, the model can lead to high-dimensional and sparse representations when dealing with large vocabularies or large corpora.
- Various enhancements, such as TF-IDF (Term Frequency-Inverse Document Frequency), can be applied to mitigate some of these limitations and improve the representation of text data.