import pandas as pd
= pd.DataFrame({'a': [1, 20, 3, 40], 'b': [50, 6, 70, 8]})
df
= ['v1', 'v2', 'v3', 'v4']
df.index
df
a | b | |
---|---|---|
v1 | 1 | 50 |
v2 | 20 | 6 |
v3 | 3 | 70 |
v4 | 40 | 8 |
We denote matrices by bold capital letters (e.g., pd.DataFrame
. The expression
import pandas as pd
df = pd.DataFrame({'a': [1, 20, 3, 40], 'b': [50, 6, 70, 8]})
df.index = ['v1', 'v2', 'v3', 'v4']
df
a | b | |
---|---|---|
v1 | 1 | 50 |
v2 | 20 | 6 |
v3 | 3 | 70 |
v4 | 40 | 8 |
An (n, d) matrix can be interpreted as a collection of n vectors in d-dimensional space.
import seaborn as sns
from matplotlib import pyplot as plt
plt.style.use('dark_background')
ax = sns.scatterplot(x='a', y='b', data=df, s=100);
ax.set(title='Scatterplot of a vs b', xlabel='a', ylabel='b');
def annotate(row):
plt.text(x=row['a']+0.05, y=row['b'], s=row.name, size=20);
df.apply(annotate, axis=1);
Sometimes we want to flip the axes. When we exchange a matrix’s rows and columns, the result is called its transpose. Formally, we signify a matrix’s
In pandas, you can transpose a DataFrame
with the .T
attribute:
Note that columns of original dataframe df
are the same as index of df.T
Now that we know how to calculate dot products, we can begin to understand the product between an
To start off, we visualize our matrix in terms of its row vectors
where each
The matrix–vector product
We can think of multiplication with a matrix
These transformations are remarkably useful. For example, we can represent rotations as multiplications by certain square matrices. Matrix–vector products also describe the key calculation involved in computing the outputs of each layer in a neural network given the outputs from the previous layer.
Note that, given a vector
There is one thing to be careful about: recall that the formula for cosine similarity is:
The dot product (numerator, on the right hand side) is equal to the cosine of the angle between the two vectors when the vectors are normalized (i.e. each divided by their norms).
The example below shows how to compute the cosine similarity between a vector
# random message : I don't have an opinion on this
msg = bow_unit.iloc[20]
# cosine similarity of first message with all other messages
msg_sim = bow_unit.dot(msg.T)
msg_sim.index = data['message']
msg_sim.sort_values(ascending=False)
message
I don't have an opinion on this 1.000000
I don't really have an opinion on this 0.984003
I have no strong opinion about this 0.971575
I have no strong opinions about this 0.971226
I have no strong opinion on this 0.969768
...
I'm not sure what to do 😕 0.204587
I'm not sure what to do next 🤷♂️ 0.201635
I'm not sure what to do next 🤔 0.200593
The food was not good 0.200295
The food was not very good 0.197220
Length: 584, dtype: float64
Once you have gotten the hang of dot products and matrix–vector products, then matrix–matrix multiplication should be straightforward.
Say that we have two matrices
Let
To form the matrix product
We can think of the matrix–matrix multiplication
Note that, given two matrices
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 574 | 575 | 576 | 577 | 578 | 579 | 580 | 581 | 582 | 583 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.000000 | 0.568191 | 0.453216 | 0.565809 | 0.612787 | 0.608156 | 0.377769 | 0.631309 | 0.793969 | 0.642410 | ... | 0.516106 | 0.612624 | 0.660911 | 0.490580 | 0.845518 | 0.573549 | 0.648564 | 0.629598 | 0.594468 | 0.570452 |
1 | 0.568191 | 1.000000 | 0.508206 | 0.928580 | 0.687137 | 0.681944 | 0.423604 | 0.707906 | 0.551139 | 0.720354 | ... | 0.573755 | 0.686954 | 0.741100 | 0.546166 | 0.801864 | 0.643139 | 0.721001 | 0.705988 | 0.652357 | 0.639666 |
2 | 0.453216 | 0.508206 | 1.000000 | 0.506075 | 0.548093 | 0.707857 | 0.778316 | 0.621438 | 0.538883 | 0.648861 | ... | 0.457088 | 0.547948 | 0.650512 | 0.443105 | 0.593732 | 0.512998 | 0.574392 | 0.605515 | 0.520351 | 0.609064 |
3 | 0.565809 | 0.928580 | 0.506075 | 1.000000 | 0.685614 | 0.679085 | 0.421828 | 0.704938 | 0.548828 | 0.717334 | ... | 0.571349 | 0.684074 | 0.737993 | 0.543876 | 0.798502 | 0.640442 | 0.717978 | 0.703027 | 0.649622 | 0.638039 |
4 | 0.612787 | 0.687137 | 0.548093 | 0.685614 | 1.000000 | 0.735468 | 0.273748 | 0.559585 | 0.714100 | 0.599091 | ... | 0.926494 | 0.825530 | 0.806139 | 0.933257 | 0.802775 | 0.945544 | 0.931874 | 0.913601 | 0.928473 | 0.690556 |
5 rows × 584 columns
from matplotlib import pyplot as plt
plt.imshow(similarity_matrix, cmap='Greens')
plt.colorbar();
plt.title("Similarity Matrix: Each cell contains cosine \nsimilarity between two messages");
plt.xlabel("Message Index");
plt.ylabel("Message Index");
Note how 1. the similarity matrix is symmetric, i.e.