Cara menggunakan python bernoulli random variable

College Statistics with Python

Introduction

In a series of weekly articles, I will be covering some important topics of statistics with a twist.

The goal is to use Python to help us get intuition on complex concepts, empirically test theoretical proofs, or build algorithms from scratch. In this series, you will find articles covering topics such as random variables, sampling distributions, confidence intervals, significance tests, and more.

At the end of each article, you can find exercises to test your knowledge. The solutions will be shared in the article of the following week.

Articles published so far:

  • Bernoulli and Binomial Random Variables with Python
  • From Binomial to Geometric and Poisson Random Variables with Python
  • Sampling Distribution of a Sample Proportion with Python
  • Confidence Intervals with Python
  • Significance Tests with Python
  • Two-sample Inference for the Difference Between Groups with Python
  • Inference for Categorical Data
  • Advanced Regression
  • Analysis of Variance — ANOVA

As usual the code is available on my GitHub.

Random Variables

We are going to start by defining what exactly is a Random Variable (RV). The first important aspect to consider is that it is not a traditional variable. A random variable can take many different values with different probabilities, so we cannot solve for them, for instance, like we would do in the equation y = x + 1. Instead, it makes more sense to talk about the probability of an RV being less than or greater than some value. In short, an RV maps outcomes of random processes to numbers.

The simplest example for us to think about an RV is a coin flip.

Cara menggunakan python bernoulli random variable

Figure 1: Coin flips help us understand the concept of an RV; source

The possible outcomes are “heads” and “tails”, which we have quantified to 1 and 0, respectively.

Cara menggunakan python bernoulli random variable

Bernoulli Random Variable

Random Variables can be either discrete or continuous. We will start by focusing on discrete RVs. By definition, discrete variables can only take distinct values, such as in the example above of the flip of a coin. For these variables, you can count the number of different values it can take on.

Our RV X defined above is actually a Bernoulli RV. It can take the value 1 with probability p (in the case of a fair coin, p is equal to 0.5 or 50%) and the value 0 with probability q = 1-p. Its Probability Mass Function can then be defined as:

Cara menggunakan python bernoulli random variable

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import bernoulli, binom
import seaborn as sns

Let’s start by defining a Bernoulli RV X with a success probability of p=0.3.

p = 0.3
X = bernoulli(p)

We can print the values for its Probability Mass Function on 0 and 1.

print(np.round(X.pmf(1),2))
print(np.round(X.pmf(0), 2))
0.3
0.7

To help visualize, let’s empirically arrive at the same values by drawing 10,000 samples from our variable.

X_samples = X.rvs(100000)sns.histplot(X_samples, stat="density", discrete=True, shrink=0.2);

Cara menggunakan python bernoulli random variable

Figure 2: Sampling distribution of our RV X

It looks as expected; we have a 0.3 probability of success and a 0.7 probability of failure.

Let’s define the mean and variance of our RV. The mean is calculated by summing the product of the value of each outcome by the corresponding probability. The variance is the squared deviation of the value of each outcome from the mean, weighted by the respective probability. More formally:

Cara menggunakan python bernoulli random variable

We can compare our empirically calculated mean with the theoretical mean that we just derived. They are indeed very close, and they would become closer as the sample size increases.

print('Empirically calculated mean: {}'.format(X_samples.mean()))
print('Theoretical mean: {}'.format(p))
print('Empirically calculated standard deviation: {}'.format(X_samples.std()))
print('Theoretical standard deviation: {}'.format((p*(1-p))**(1/2)))
Empirically calculated mean: 0.30059
Theoretical mean: 0.3
Empirically calculated standard deviation: 0.45851461470709964
Theoretical standard deviation: 0.458257569495584

Binomial Random Variable

We can look at a Binomial RV as a set of Bernoulli experiments or trials. This way, our understanding of how the properties of the distribution are derived becomes significantly simpler.

Before diving into definitions, let’s start with the main conditions that need to be fulfilled to define our RV as Binomial:

  • The trials are independent;
  • Each trial can be classified as either success or failure;
  • There is a fixed number of trials;
  • The probability of success on each trial is constant.

Let’s define the RV Z as the number of successes after n trials where P(success) for each trial is p.

Let’s also define Y, a Bernoulli RV with P(Y=1)=p and P(Y=0)=1-p.

Y represents each independent trial that composes Z. We already derived both the variance and expected value of Y above.

Cara menggunakan python bernoulli random variable

Using the following property E(X+Y)=E(X)+E(Y), we can derive the expected value of our Binomial RV Z:

Cara menggunakan python bernoulli random variable

Recall that we have n independent trials or n RV Y being summed.

When deriving the VAR(Y), the process is the same because VAR(X+Y)=VAR(X)+VAR(Y) is also true. Then we have:

Cara menggunakan python bernoulli random variable

Let’s now test our theoretical understanding with an experiment.

n=6
p = 0.3
Y = bernoulli(p)

We defined our Y variable. We can construct our X variable from this Y variable as defined above; these are the Bernoulli independent trials. Let’s assume that we have 6 independent trials.

Y_samples = [Y.rvs(1000000) for i in range(6)]
Y_samples
[array([0, 0, 0, ..., 0, 1, 1]),
array([1, 0, 1, ..., 0, 0, 0]),
array([0, 0, 1, ..., 1, 1, 0]),
array([1, 0, 0, ..., 0, 0, 0]),
array([0, 0, 0, ..., 0, 1, 0]),
array([0, 0, 0, ..., 0, 0, 0])]
Z_samples = sum(Y_samples)print('Empirically calculated expected value: {}'.format(Z_samples.mean()))
print('Theoretical expected value: {}'.format(n*p))
Empirically calculated expected value: 1.800219
Theoretical expected value: 1.7999999999999998
print('Empirically calculated variance: {}'.format(Z_samples.var()))
print('Theoretical variance: {}'.format(n*p*(1-p)))
Empirically calculated variance: 1.2612705520390002
Theoretical variance: 1.2599999999999998

We feel better about our theoretical derivations as our experiment shows that we are indeed on the right path.

We can also plot our Binomial distribution. Remember that it is a discrete distribution.

sns.histplot(Z_samples, stat="density", discrete=True, shrink=0.3);

Cara menggunakan python bernoulli random variable

Figure 3: Sampling distribution of our RV Z, which is a Binomial RV.

Binomial PMF and CDF

The Binomial Probability Mass Function (PMF) can be written in the following way:

Cara menggunakan python bernoulli random variable

It seems a bit daunting at first; let’s try to break it down into smaller interpretable pieces.

The first term is just the binomial coefficient or the number of different ways we can choose k items from n possible ones when the order does not matter, i.e., the set ABC is the same as CBA.

Cara menggunakan python bernoulli random variable

Recall that n!/(n-k)! is the number of permutations or the number of different ways we can choose k items from n possible ones when the order matters, i.e., ABC and CBA counts as two different results. k! is just the number of ways to arrange k items. For example, in the case of 3 items and 3 positions, we have the following possibilities:

Cara menggunakan python bernoulli random variable

Recall that 3!=3 * 2 *1= 6.

Let’s start building small functions to handle our different components, starting with a function to compute the factorial of an input argument.

def fact(n):
x = 1
for i in range(1, n+1):
x *= i
return x

fact(3)

6

Now, we can use our function fact() inside of another function that computes the binomial coefficient.

def comb(n, k):
x = 1
return fact(n)/(fact(k)*fact(n-k))

comb(6, 3)

20.0

And finally, putting everything together:

def binompmf(prob, n, k):
return comb(n,k)*prob**k*(1-prob)**(n-k)

A useful function that you often find in statistical packages together with the PMF of a distribution is the Cumulative Distribution Function (CDF). It is nothing more than the probability that our RV takes values up to a z:

Cara menggunakan python bernoulli random variable

def binomcdf(prob, n, x):
result = 0
for x_ in range(0, x+1):
result += binompmf(prob, n, x_)
return result

Let’s try it out.

Figure 4: Cristiano Ronaldo’s free-kick execution can be modeled using a Binomial distribution, source

We want to assess the probability distribution that models Cristiano Ronaldo’s ability to score free-kicks. We will be using the table below, which shows Ronaldo’s free-kick record in La Liga and the Champions League:

Cara menggunakan python bernoulli random variable

Table 1: Ronaldo’s free-kick record, source

Ronaldo has a 0.094 probability of successfully convert a free-kick in the Champions League. Based on that, what is Ronaldo’s probability of scoring 1 out of 7 free-kicks in the Champions League?

binompmf(0.094, 7, 1)0.3639109131870316

What is the probability of scoring less than 2?

binomcdf(0.094, 7, 1)0.8649797389430357

Exercises:

You will get the solutions in next week’s article.

1.A company produces bottles of water. In its main factory, the number of defective bottles is 5%. A quality check consists of randomly selecting and testing 1000 bottles. What are the mean and standard deviation of the number of defective bottles in these samples?

2. A wine company is running a promotion that says 1-in-4 boxes of the wine contain a surprise. Suppose that you will buy 5 boxes of this wine, and let X represent the number of surprises you can win in these boxes. Assume that these boxes represent a random sample and assume that surprises are independent between boxes. What is the probability that you win at most 1 surprise in the 5 boxes?

(Can you solve this problem in 3 different ways? Tip: by sampling, summing individual probabilities, and using the CDF)

3. A math teacher is doing an activity with her students where she gives them a 20-question multiple-choice test, and they know none of the answers. Students need to guess on every question, and each question has 5 possible choices, 1 of which is correct. What are the mean and standard deviation of the number of questions that each student gets correct?

(Can you solve this problem in 2 different ways? Tip: by sampling and using the theoretical derivation that we did above)