Text Data

Text is arguably the most ubiquitous non-numeric modality of data. Most of the data on the internet is text. Text data generally exists as a collection of documents called corpus, where each document is one or more sentences.

Bag of Words

Text data can be encoded into a number of different numeric representations. The most common is the Bag-of-Words (BoW) representation, which is closely related with One-Hot Encoding.

In the BoW representation, each row represents a document and each column represents a word in the vocabulary. The vocabulary is the list of all unique words in the corpus. The corpus is the collection of all the documents. A document is a sequence of words. The value in the cell at \(i\)th row and \(j\)th column represents the number of times the word \(j\) appears in document \(i\).

Creating a Bag of Words representation for a corpus generally entails the following steps:

Tokenization: Split each document into a sequence of tokens (words).
Create Vocabulary: Create a list of all unique tokens (words) for all documents in the corpus. Often words are normalized by converting all words to lowercase and removing punctuation.
Create Document Vectors: Create a vector for each document in the corpus. The vector is the same length as the vocabulary. The value in each cell of the vector is the number of times the word in the corresponding column appears in the document.
Create Document-Term Matrix: Create a 2D array where each row represents a document and each column represents a word in the vocabulary. The value in the cell at \(i\)th row and \(j\)th column represents the number of times the word \(j\) appears in document \(i\).

The image below shows a bag-of-words representation of a corpus of two documents. The vocabulary is the list of words on the left. The corpus is the 2D array of numbers on the right.

import pandas as pd 

data = pd.read_csv('https://raw.githubusercontent.com/fahadsultan/csc272/main/data/chat_dataset.csv')

data.head()

	message	sentiment
0	I really enjoyed the movie	positive
1	The food was terrible	negative
2	I'm not sure how I feel about this	neutral
3	The service was excellent	positive
4	I had a bad experience	negative

# 1. Tokenization: Concatenate all messages into a single string and convert to lowercase and then split into words
tokens = (' '.join(data['message'].values)).lower().split()

# 2. Vocabulary: Create a list of UNIQUE words in the dataset
vocab = list(set(tokens))

# 4. Create an empty DataFrame with columns as the words in the vocab
bow = pd.DataFrame(columns=vocab)

# Go through each word in the vocab
for word in vocab: 

    # 3. For each message, count the number of times the word appears
    bow[word] = data['message'].apply(lambda msg: msg.count(word))

bow.head()

	i	really	enjoyed	the	movie	the	food	was	terrible	...	ex	is	i	feel
0	1	1	1	1	1	1	0	0	0	...	0	0	1	0
1	1	0	0	0	0	0	1	1	1	...	0	0	1	0
2	1	0	0	0	0	0	0	0	0	...	0	1	1	1
3	1	0	0	0	0	0	0	1	0	...	1	0	1	0
4	1	0	0	0	0	0	0	0	0	...	1	0	1	0

5 rows × 4291 columns

CountVectorizer

In practice, the Bag-of-Words representation is created using libraries such as scikit-learn.

The CountVectorizer class in scikit-learn can be used to create a Bag-of-Words representation of a corpus.

In the code below, we create a corpus of two documents. We then create an instance of the CountVectorizer class and use it to fit and transform the corpus into a Bag-of-Words representation. The resulting Bag-of-Words representation is a sparse matrix, which we convert to a dense array using the toarray() method. Finally, we print the vocabulary and the Bag-of-Words representation.


from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.'
]

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the corpus to create the bag-of-words representation
X = vectorizer.fit_transform(corpus)


# View the feature names and the resulting array
print(vectorizer.get_feature_names_out()) 

# View the resulting array
print(X.toarray())

Note that count vectorizer automatically converts all words to lowercase and removes punctuation.

It is also more efficient than creating the Bag-of-Words representation manually, as it uses sparse matrices to store the data. Sparse matrices only store the non-zero values, which saves memory and computation time.