Correlations

Covariance

Covariance is a measure of the relationship between two random variables. It is similar to correlation, but it is not normalized.

\[ \text{Cov}(X, Y) = \frac{\sum_{i=1}^{n} (x_i - \mu_x)(y_i - \mu_y)}{n} \]

where \(x_i\) and \(y_i\) are the \(i\)-th values of the two variables, $ _x$ and $ _y$ are the means of the two variables, and \(n\) is the number of values.

The covariance can be positive, negative, or zero. A positive covariance indicates that the two variables tend to increase or decrease together, while a negative covariance indicates that one variable tends to increase as the other decreases.

Correlations

Correlation is a statistical measure that describes the relationship between two variables. It can be positive, negative, or zero.

Positive correlation: If one variable increases, the other variable also increases.

Negative correlation: If one variable increases, the other variable decreases.

Zero correlation: There is no relationship between the two variables.

The correlation coefficient ranges from -1 to 1. A value of 1 indicates a perfect positive correlation, a value of -1 indicates a perfect negative correlation, and a value of 0 indicates no correlation.

There are several methods to compute the correlation between two variables. The two most common methods are the Pearson correlation coefficient and the Spearman correlation

Pearson Correlation Coefficient

The Pearson correlation coefficient measures the linear relationship between two variables. It ranges from -1 to 1.

\[ r = \frac{\sum_{i=1}^{n} (x_i - \mu_x)(y_i - \mu_y)}{\sqrt{\sum_{i=1}^{n} (x_i - \mu_x)^2} \sqrt{\sum_{i=1}^{n} (y_i - \mu_y)^2}} \]

where \(x_i\) and \(y_i\) are the \(i\)-th values of the two variables, $ _x$ and $ _y$ are the means of the two variables, and \(n\) is the number of values.

import pandas as pd 

url = "https://raw.githubusercontent.com/fahadsultan/csc272/main/data/elections.csv"

elections = pd.read_csv(url)

elections.head()

x = elections['Popular vote']
y = elections['%']

r = x.cov(y) / (x.std() * y.std())

r
0.559061061317942

You can also compute the correlation between two columns of a DataFrame using the .corr() method.

x.corr(y, method='pearson')
0.559061061317942

Spearman Correlation

The Spearman correlation coefficient measures the monotonic relationship between two variables. It ranges from -1 to 1.

\[ r = 1 - \frac{6 \sum_{i=1}^{n} d_i^2}{n(n^2 - 1)} \]

where \(d_i\) is the difference between the ranks of the two variables and \(n\) is the number of values.

x.corr(y, method='spearman')
0.7432486904455022

.corr()

The .corr() method computes the correlation between columns in a DataFrame. By default, it computes the Pearson correlation coefficient, but the method parameter can be used to specify the method to use.