Joint Probability

import pandas as pd
import warnings 
warnings.filterwarnings("ignore")

data = pd.read_csv("../data/Shark Tank US dataset.csv")

data = data[data.columns[:30]]
data['Got Deal'] = data['Got Deal'].astype(bool)
data.head(2)

	Season Number	Startup Name	Episode Number	Pitch Number	Season Start	Season End	Original Air Date	Industry	Business Description	Company Website	...	Got Deal	Total Deal Amount	Total Deal Equity	Deal Valuation	Number of Sharks in Deal	Investment Amount Per Shark	Equity Per Shark	Royalty Deal	Advisory Shares Equity	Loan
0	1	AvaTheElephant	1	1	9-Aug-09	5-Feb-10	9-Aug-09	Health/Wellness	Ava The Elephant - Baby and Child Care	http://www.avatheelephant.com/	...	True	50000.0	55.0	90909.0	1.0	50000.0	55.0	NaN	NaN	NaN
1	1	MrTod'sPieFactory	1	2	9-Aug-09	5-Feb-10	9-Aug-09	Food and Beverage	Mr. Tod's Pie Factory - Specialty Food	http://whybake.com/	...	True	460000.0	50.0	920000.0	2.0	230000.0	25.0	NaN	NaN	NaN

2 rows × 30 columns

Joint Frequency

The joint frequency of two events is the number of times they both occur in a given number of trials.

In pandas, we can calculate the joint frequency of two events by using the crosstab function.

pd.crosstab(data['Industry'], data['Got Deal'])

Got Deal	False	True
Industry
Automotive	4	13
Business Services	21	19
Children/Education	45	78
Electronics	9	7
Fashion/Beauty	98	128
Fitness/Sports/Outdoors	48	79
Food and Beverage	116	180
Green/CleanTech	5	6
Health/Wellness	27	40
Lifestyle/Home	83	163
Liquor/Alcohol	5	5
Media/Entertainment	9	17
Pet Products	24	33
Software/Tech	31	38
Travel	6	5
Uncertain/Other	8	15

Joint Probability \(P(A, B)\)

Joint probability is the probability of two events occurring together.The joint probability is usually denoted by \(P(A, B)\), which is shorthand for \(P(A \wedge B)\) read as Probability of \(A\) AND \(B\).

Note that \(P(A, B) = P(B, A)\) since \(A \wedge B = B \wedge A\).

For example, if we are rolling two dice, the joint probability is the probability of rolling a 1 on the first die and a 2 on the second die.

In Data Science, we rarely know the true joint probability. Instead, we estimate the joint probability from data. We will talk more about this when we talk about Statistics.

joint_prob = pd.crosstab(data['Industry'], data['Got Deal'], normalize=True)

joint_prob

Got Deal	False	True
Industry
Automotive	0.002930	0.009524
Business Services	0.015385	0.013919
Children/Education	0.032967	0.057143
Electronics	0.006593	0.005128
Fashion/Beauty	0.071795	0.093773
Fitness/Sports/Outdoors	0.035165	0.057875
Food and Beverage	0.084982	0.131868
Green/CleanTech	0.003663	0.004396
Health/Wellness	0.019780	0.029304
Lifestyle/Home	0.060806	0.119414
Liquor/Alcohol	0.003663	0.003663
Media/Entertainment	0.006593	0.012454
Pet Products	0.017582	0.024176
Software/Tech	0.022711	0.027839
Travel	0.004396	0.003663
Uncertain/Other	0.005861	0.010989

sum(joint_prob)

Note that sum of joint probabilities is 1 i.e. \(\sum P(C, D) = 1\) at the end of the day, since the sum of all probabilities is 1.

The following three are all true at the same time:

\(\sum_{C, D} P(C, D) = 1\) where \(P(C, D)\) is a probability table with 12 rows and 3 columns: \(C, D, P(C, D)\).
\(\sum_{C} P(C) = 1\) where \(P(C)\) is a probability table with 2 rows (\({H, T}\)) and 2 columns: \(C, P(C)\).
\(\sum_{D} P(D) = 1\) where \(P(D)\) is a probability table with 6 rows (\({1, 2, 3, 4, 5, 6}\)) and 2 columns: \(D, P(D)\).

from matplotlib import pyplot as plt

plt.style.use('dark_background')

fig, ax = plt.subplots(figsize=(10, 6))

ax.bar(joint_prob.index, joint_prob[True], label='Got Deal', color='blue', alpha=0.5)
ax.bar(joint_prob.index, joint_prob[False], bottom=joint_prob[True], label='No Deal', color='red', alpha=0.5)

ax.set_xticklabels(joint_prob.index, rotation=90)
ax.set_ylabel('Probability')
ax.set_title('Probability of Getting a Deal by Industry')
ax.legend();



plt.style.use('dark_background')

fig, ax = plt.subplots(figsize=(10, 6))

x_values  = pd.Series(range(len(joint_prob.index)))
x_values1 = x_values + 0.1
x_values2 = x_values - 0.1

ax.bar(x_values1, joint_prob[True],  label='Got Deal', color='blue', alpha=0.5, width=0.2)
ax.bar(x_values2, joint_prob[False], label='No Deal', color='red', alpha=0.5, width=0.2)

ax.set_xticks(x_values)
ax.set_xticklabels(joint_prob.index, rotation=90)

ax.set_ylabel('Probability')
ax.set_title('Probability of Getting a Deal by Industry')
ax.legend();

Marginal Probability \(P(A)\)

Because most data sets are multi-dimensional i.e. involving multiple random variables, we can sometimes find ourselves in a situation where we want to know the joint probability \(P(A, B)\) of two random variables \(A\) and \(B\) but we don’t know \(P(A)\) or \(P(B)\). In such cases, we compute the marginal probability of one variable from joint probability over multiple random variables.

Marginalizing is the process of summing over one or more variables (say B) to get the probability of another variable (say A). This summing takes place over the joint probability table.

\[ P(A) = \sum_{b \in \Omega_B} P(A, B=b) \]

joint_prob

Got Deal	False	True
Industry
Automotive	0.002930	0.009524
Business Services	0.015385	0.013919
Children/Education	0.032967	0.057143
Electronics	0.006593	0.005128
Fashion/Beauty	0.071795	0.093773
Fitness/Sports/Outdoors	0.035165	0.057875
Food and Beverage	0.084982	0.131868
Green/CleanTech	0.003663	0.004396
Health/Wellness	0.019780	0.029304
Lifestyle/Home	0.060806	0.119414
Liquor/Alcohol	0.003663	0.003663
Media/Entertainment	0.006593	0.012454
Pet Products	0.017582	0.024176
Software/Tech	0.022711	0.027839
Travel	0.004396	0.003663
Uncertain/Other	0.005861	0.010989

# marginal probability of getting a deal
marginal_prob = joint_prob.sum(axis=0)
marginal_prob

Got Deal
False    0.394872
True     0.605128
dtype: float64

data['Got Deal'].value_counts(normalize=True)

True     0.605128
False    0.394872
Name: Got Deal, dtype: float64

# marginal probability of industry
marginal_prob = joint_prob.sum(axis=1)
marginal_prob

Industry
Automotive                 0.012454
Business Services          0.029304
Children/Education         0.090110
Electronics                0.011722
Fashion/Beauty             0.165568
Fitness/Sports/Outdoors    0.093040
Food and Beverage          0.216850
Green/CleanTech            0.008059
Health/Wellness            0.049084
Lifestyle/Home             0.180220
Liquor/Alcohol             0.007326
Media/Entertainment        0.019048
Pet Products               0.041758
Software/Tech              0.050549
Travel                     0.008059
Uncertain/Other            0.016850
dtype: float64

fig, axs = plt.subplots(1, 3, figsize=(15, 5))

axs[0].bar(joint_prob.index, joint_prob[True], label='Got Deal', color='blue', alpha=0.5)
axs[0].bar(joint_prob.index, joint_prob[False], bottom=joint_prob[True], label='No Deal', color='red', alpha=0.5)
axs[0].set_xticklabels(joint_prob.index, rotation=90)
axs[0].set_ylabel('Probability')
axs[0].set_title('Probability of Getting a Deal by Industry')

# Marginal probability of getting a deal
marginal_prob_deal = joint_prob.sum(axis=0)
axs[1].bar(marginal_prob_deal.index, marginal_prob_deal, color=['blue', 'red'], alpha=0.5)
axs[1].set_xticks([0, 1])
axs[1].set_xticklabels(['Got Deal', 'No Deal'])
axs[1].set_ylabel('Probability')
axs[1].set_title('Marginal Probability of Getting a Deal')

# Marginal probability of industry
marginal_prob_industry = joint_prob.sum(axis=1)
axs[2].bar(joint_prob.index, marginal_prob_industry, color='green', alpha=0.5)
axs[2].set_xticklabels(joint_prob.index, rotation=90)
axs[2].set_ylabel('Probability')
axs[2].set_title('Marginal Probability of Industry')

Text(0.5, 1.0, 'Marginal Probability of Industry')

As we look at new concepts in probability, it is important to stay mindful of i) what the probability sums to ii) what are the dimensions of the table that represents the probability.

You can see from the cell below that the dimensions of marginal probability table is the length of the range of the variable.

You can see from the code below that both the computed marginal probabilities in add up to 1.

Independent Random Variables

Random variables can be either independent or dependent. If two random variables are independent, then the value of one random variable does not affect the value of the other random variable.

For example, if we are rolling two dice, we can use two random variables to represent the numbers that we roll. The two random variables are independent because the value of one die does not affect the value of the other die. If two random variables are dependent, then the value of one random variable does affect the value of the other random variable. For example, if we are measuring the temperature and the humidity, we can use two random variables to represent the temperature and the humidity. The two random variables are dependent because the temperature affects the humidity and the humidity affects the temperature.

More formally, two random variables \(X\) and \(Y\) are independent if and only if \(P(X, Y) = P(X) \cdot P(Y)\).

# Test for independence of Industry and Got Deal

prob_deal     = joint_prob.sum(axis=0)
prob_industry = joint_prob.sum(axis=1)

prob_deal

Got Deal
False    0.394872
True     0.605128
dtype: float64

joint_prob = pd.crosstab(data['Industry'], data['Got Deal'], normalize=True)

joint_prob['False2'] = joint_prob.apply(lambda x: (prob_deal[False] * prob_industry[x.name]), axis=1)
joint_prob['True2'] = joint_prob.apply(lambda x: (prob_deal[True] * prob_industry[x.name]), axis=1)

joint_prob

Got Deal	False	True	False2	True2
Industry
Automotive	0.002930	0.009524	0.004918	0.007536
Business Services	0.015385	0.013919	0.011571	0.017733
Children/Education	0.032967	0.057143	0.035582	0.054528
Electronics	0.006593	0.005128	0.004629	0.007093
Fashion/Beauty	0.071795	0.093773	0.065378	0.100190
Fitness/Sports/Outdoors	0.035165	0.057875	0.036739	0.056301
Food and Beverage	0.084982	0.131868	0.085628	0.131222
Green/CleanTech	0.003663	0.004396	0.003182	0.004876
Health/Wellness	0.019780	0.029304	0.019382	0.029702
Lifestyle/Home	0.060806	0.119414	0.071164	0.109056
Liquor/Alcohol	0.003663	0.003663	0.002893	0.004433
Media/Entertainment	0.006593	0.012454	0.007521	0.011526
Pet Products	0.017582	0.024176	0.016489	0.025269
Software/Tech	0.022711	0.027839	0.019961	0.030589
Travel	0.004396	0.003663	0.003182	0.004876
Uncertain/Other	0.005861	0.010989	0.006654	0.010196

joint_prob = joint_prob.round(2)
joint_prob

Got Deal	False	True	False2	True2
Industry
Automotive	0.00	0.01	0.00	0.01
Business Services	0.02	0.01	0.01	0.02
Children/Education	0.03	0.06	0.04	0.05
Electronics	0.01	0.01	0.00	0.01
Fashion/Beauty	0.07	0.09	0.07	0.10
Fitness/Sports/Outdoors	0.04	0.06	0.04	0.06
Food and Beverage	0.08	0.13	0.09	0.13
Green/CleanTech	0.00	0.00	0.00	0.00
Health/Wellness	0.02	0.03	0.02	0.03
Lifestyle/Home	0.06	0.12	0.07	0.11
Liquor/Alcohol	0.00	0.00	0.00	0.00
Media/Entertainment	0.01	0.01	0.01	0.01
Pet Products	0.02	0.02	0.02	0.03
Software/Tech	0.02	0.03	0.02	0.03
Travel	0.00	0.00	0.00	0.00
Uncertain/Other	0.01	0.01	0.01	0.01

joint_prob[False] == joint_prob['False2']

Industry
Automotive                  True
Business Services          False
Children/Education         False
Electronics                False
Fashion/Beauty              True
Fitness/Sports/Outdoors     True
Food and Beverage          False
Green/CleanTech             True
Health/Wellness             True
Lifestyle/Home             False
Liquor/Alcohol              True
Media/Entertainment         True
Pet Products                True
Software/Tech               True
Travel                      True
Uncertain/Other             True
dtype: bool

joint_prob[True] == joint_prob['True2']

Industry
Automotive                  True
Business Services          False
Children/Education         False
Electronics                 True
Fashion/Beauty             False
Fitness/Sports/Outdoors     True
Food and Beverage           True
Green/CleanTech             True
Health/Wellness             True
Lifestyle/Home             False
Liquor/Alcohol              True
Media/Entertainment         True
Pet Products               False
Software/Tech               True
Travel                      True
Uncertain/Other             True
dtype: bool