Text

Text is arguably the most ubiquitous non-numeric modality of data. Most of the data on the internet is text. Text data generally exists as a collection of documents called corpus, where each document is one or more sentences.

Bag of Words


Text data can be encoded into a number of different numeric representations. The most common is the Bag-of-Words (BoW) representation, which is closely related with One-Hot Encoding.

In the BoW representation, each row represents a document and each column represents a word in the vocabulary. The vocabulary is the list of all unique words in the corpus. The corpus is the collection of all the documents. A document is a sequence of words. The value in the cell at \(i\)th row and \(j\)th column represents the number of times the word \(j\) appears in document \(i\).

Creating a Bag of Words representation for a corpus generally entails the following steps:

  1. Tokenization: Split each document into a sequence of tokens (words).
  2. Create Vocabulary: Create a list of all unique tokens (words) for all documents in the corpus. Often words are normalized by converting all words to lowercase and removing punctuation.
  3. Create Document Vectors: Create a vector for each document in the corpus. The vector is the same length as the vocabulary. The value in each cell of the vector is the number of times the word in the corresponding column appears in the document.
  4. Create Document-Term Matrix: Create a 2D array where each row represents a document and each column represents a word in the vocabulary. The value in the cell at \(i\)th row and \(j\)th column represents the number of times the word \(j\) appears in document \(i\).

The image below shows a bag-of-words representation of a corpus of two documents. The vocabulary is the list of words on the left. The corpus is the 2D array of numbers on the right.

import pandas as pd 

data = pd.read_csv('https://raw.githubusercontent.com/fahadsultan/csc272/main/data/chat_dataset.csv')

data.head()
message sentiment
0 I really enjoyed the movie positive
1 The food was terrible negative
2 I'm not sure how I feel about this neutral
3 The service was excellent positive
4 I had a bad experience negative
# 1. Tokenization: Concatenate all messages into a single string and convert to lowercase and then split into words
tokens = (' '.join(data['message'].values)).lower().split()

# 2. Vocabulary: Create a list of UNIQUE words in the dataset
vocab = list(set(tokens))

# 4. Create an empty DataFrame with columns as the words in the vocab
bow = pd.DataFrame(columns=vocab)

# Go through each word in the vocab
for word in vocab: 

    # 3. For each message, count the number of times the word appears
    bow[word] = data['message'].apply(lambda msg: msg.count(word))

bow.head()
i really enjoyed the movie the food was terrible i'm ... ex is dating someone new. i feel so heartbroken 💔😢
0 1 1 1 1 1 1 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0
1 1 0 0 0 0 0 1 1 1 0 ... 0 0 0 0 0 1 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 1 1 0 0 0
3 1 0 0 0 0 0 0 1 0 0 ... 1 0 0 0 0 1 0 0 0 0
4 1 0 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 1 0 0 0 0

5 rows × 4291 columns