How Bag
of Words Model Can Work in Text Mining with Python?
The Bag
of Words model, BOW, is a popular representation which is commonly used in natural
language processing applications. In this model, a text document including
several sentences and paragraphs is represented as the bag of the words
included in that document, without considering syntax and the order of words.
The main idea is to quantize each extracted key point into one of categorical
words, and then represent each document by a histogram of the categorical
words. The BOW feature design creates a vocabulary out of all the words in the
training set and, for each example, creates a vector of word counts pertaining
to that instance. Since the vector holds a place for each word in the
vocabulary, the resulting BOW matrix is sparse (mostly zeros) because most
words in the vocabulary don't appear in a given example.
One of
its universal application is in document classification. In this case, the
frequency of occurrence of each word from a dictionary is considered as an
attribute for learning a classifier [1]. Salton and McGill proposed a Keyword
Classification method to classify the documents into different categories
providing a synonym dictionary for document indexing and query retrieval [2].
Furthermore, a statistical framework for generalizing the BOW representation is
presented by Zhang et.al. for image processing inspired from text mining applications
[3]. Voloshynovskiy et. al. also performs a survey for the analysis of the performance
of BOW method to present a better understanding of this technique in terms of
the robustness and accuracy in decision making. They also accomplished a
successful experiment on real image data [4].
After we
determine how many documents we have, we simply put all into a single text file. Then we
will replace all the punctuation such as $, %, & and @ as well
as some digits or specific terms such as WWW and com with
blank space. After cleaning text file, we will count the frequency of each word
in our text file, and rank them according to their frequency. We will always choose
the top N words from the text file, and N is a user-defined parameter.
Here is a demo of how BOW
works on a text data. For the first paragraph of this blog post, we have:
from __future__ import
print_function
import numpy as np
import pandas as pd
from collections import
Counter
def get_user_terms_stops(user_words):
''' extract the tokens set for each user
from the document:
'''
import re
import string
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
print('extracingt the token set from urls
for each user ...')
user_terms_stops = []
for i in range(len(user_words)):
terms_stops = []
u = '
'.join(np.array(user_words[i])).split()
# only return the non-empty elements
for j in range(len(u)):
# tokenizing:
tokens = re.findall('\w+', u[j])
#remove stop words:
stop = stopwords.words('english') +
list(string.punctuation)
terms_stop = [word for word in
tokens if (word not in stop)]
terms_stops = terms_stops +
terms_stop
#stemming:
p_stemmer = PorterStemmer()
stemmed_tokens = [p_stemmer.stem(k)
for k in terms_stops]
user_terms_stops.append(terms_stops)
print('accomplished!')
user_terms_stops = [item for sublist in
user_terms_stops for item in sublist]
return user_terms_stops
if __name__ == '__main__':
X = [['The Bag of Words model',
'BOW, is a popular representation
which is commonly used in natural language processing applications.',
'In this model',
'a text document including several
sentences and paragraphs is represented as the bag of the words',
'included in that document, without
considering syntax and the order of words.'],
['The main idea is to quantize each
extracted key point into one of categorical words',
'and then represent each document by
a histogram of the categorical words.'],
['The BOW feature design creates a
vocabulary out of all the words in the training set and',
'for each example, creates a vector
of word counts pertaining to that instance.',
'Since the vector holds a place for
each word in the vocabulary',
'the resulting BOW matrix is sparse
(mostly zeros) because most words in the vocabulary do not appear in a given
example.']
]
user_terms_stops =
get_user_terms_stops(X)
user_terms_stops
['The',
'Bag',
'Words',
'model',
'BOW',
'popular',
'representation',
'commonly',
'used',
'natural',
'language',
'processing',
'applications',
'In',
'model',
'text',
'document',
'including',
'several',
'sentences',
'paragraphs',
'represented',
'bag',
'words',
'included',
'document',
'without',
'considering',
'syntax',
'order',
'words',
'The',
'main',
'idea',
'quantize',
'extracted',
'key',
'point',
'one',
'categorical',
'words',
'represent',
'document',
'histogram',
'categorical',
'words',
'The',
'BOW',
'feature',
'design',
'creates',
'vocabulary',
'words',
'training',
'set',
'example',
'creates',
'vector',
'word',
'counts',
'pertaining',
'instance',
'Since',
'vector',
'holds',
'place',
'word',
'vocabulary',
'resulting',
'BOW',
'matrix',
'sparse',
'mostly',
'zeros',
'words',
'vocabulary',
'appear',
'given',
'example']
c = Counter(user_terms_stops)
c
Following code returns the N most frequent words in our documents which can be used in many applications like text tagging, text categorization and etc.
Following code returns the N most frequent words in our documents which can be used in many applications like text tagging, text categorization and etc.
Counter({'BOW': 3,
'Bag': 1,
'In': 1,
'Since': 1,
'The': 3,
'Words': 1,
'appear': 1,
'applications': 1,
'bag': 1,
'categorical': 2,
'commonly': 1,
'considering': 1,
'counts': 1,
'creates': 2,
'design': 1,
'document': 3,
'example': 2,
'extracted': 1,
'feature': 1,
'given': 1,
'histogram': 1,
'holds': 1,
'idea': 1,
'included': 1,
'including': 1,
'instance': 1,
'key': 1,
'language': 1,
'main': 1,
'matrix': 1,
'model': 2,
'mostly': 1,
'natural': 1,
'one': 1,
'order': 1,
'paragraphs': 1,
'pertaining': 1,
'place': 1,
'point': 1,
'popular': 1,
'processing': 1,
'quantize': 1,
'represent': 1,
'representation': 1,
'represented': 1,
'resulting': 1,
'sentences': 1,
'set': 1,
'several': 1,
'sparse': 1,
'syntax': 1,
'text': 1,
'training': 1,
'used': 1,
'vector': 2,
'vocabulary': 3,
'without': 1,
'word': 2,
'words': 6,
'zeros': 1})
top_N = 6
classes = []
top_words = c.most_common(top_N)
top_words
[('words', 6),
('BOW', 3),
('vocabulary', 3),
('The', 3),
('document', 3),
('creates', 2)]
References
[1]
Harris, Zellig. "Distributional Structure". Word 10 (2/3): 146{62,
1954.
[2]
Gerard Salton and Michael J. McGill. "Introduction to Modern Information
Retrieval".
McGraw-Hill
Book Company, New York, 1983.
[3] Yin
Zhang, Rong Jin, and Zhi-Hua Zhou. "Understanding bag-of-words model: a
statistical
framework". International Journal of Machine Learning and Cybernetics,
1(1-
4):43{52,
2010.
[4]
Sviatoslav Voloshynovskiy, Maurits Diephuis, Dimche Kostadinov, Farzad
Farhadzadeh and Taras Holotyak. "On accuracy, robustness and security of
bag-of-word search systems".