Text is arguably the most ubiquitous non-numeric modality of data. Most of the data on the internet is text. Text data generally exists as a collection of documents called corpus, where each document is one or more sentences.
Bag of Words
Text data can be encoded into a number of different numeric representations. The most common is the Bag-of-Words (BoW) representation, which is closely related with One-Hot Encoding.
In the BoW representation, each row represents a document and each column represents a word in the vocabulary. The vocabulary is the list of all unique words in the corpus. The corpus is the collection of all the documents. A document is a sequence of words. The value in the cell at \(i\)th row and \(j\)th column represents the number of times the word \(j\) appears in document \(i\).
Creating a Bag of Words representation for a corpus generally entails the following steps:
Tokenization: Split each document into a sequence of tokens (words).
Create Vocabulary: Create a list of all unique tokens (words) for all documents in the corpus. Often words are normalized by converting all words to lowercase and removing punctuation.
Create Document Vectors: Create a vector for each document in the corpus. The vector is the same length as the vocabulary. The value in each cell of the vector is the number of times the word in the corresponding column appears in the document.
Create Document-Term Matrix: Create a 2D array where each row represents a document and each column represents a word in the vocabulary. The value in the cell at \(i\)th row and \(j\)th column represents the number of times the word \(j\) appears in document \(i\).
The image below shows a bag-of-words representation of a corpus of two documents. The vocabulary is the list of words on the left. The corpus is the 2D array of numbers on the right.
import pandas as pd data = pd.read_csv('https://raw.githubusercontent.com/fahadsultan/csc272/main/data/chat_dataset.csv')data.head()
message
sentiment
0
I really enjoyed the movie
positive
1
The food was terrible
negative
2
I'm not sure how I feel about this
neutral
3
The service was excellent
positive
4
I had a bad experience
negative
# 1. Tokenization: Concatenate all messages into a single string and convert to lowercase and then split into wordstokens = (' '.join(data['message'].values)).lower().split()# 2. Vocabulary: Create a list of UNIQUE words in the datasetvocab =list(set(tokens))# 4. Create an empty DataFrame with columns as the words in the vocabbow = pd.DataFrame(columns=vocab)# Go through each word in the vocabfor word in vocab: # 3. For each message, count the number of times the word appears bow[word] = data['message'].apply(lambda msg: msg.count(word))bow.head()