Cara menggunakan python bernoulli random variable
Show College Statistics with PythonIntroductionIn a series of weekly articles, I will be covering some important topics of statistics with a twist. The goal is to use Python to help us get intuition on complex concepts, empirically test theoretical proofs, or build algorithms from scratch. In this series, you will find articles covering topics such as random variables, sampling distributions, confidence intervals, significance tests, and more. At the end of each article, you can find exercises to test your knowledge. The solutions will be shared in the article of the following week. Articles published so far:
As usual the code is available on my GitHub. Random VariablesWe are going to start by defining what exactly is a Random Variable (RV). The first important aspect to consider is that it is not a traditional variable. A random variable can take many different values with different probabilities, so we cannot solve for them, for instance, like we would do in the equation y = x + 1. Instead, it makes more sense to talk about the probability of an RV being less than or greater than some value. In short, an RV maps outcomes of random processes to numbers. The simplest example for us to think about an RV is a coin flip. Figure 1: Coin flips help us understand the concept of an RV; sourceThe possible outcomes are “heads” and “tails”, which we have quantified to 1 and 0, respectively. Bernoulli Random VariableRandom Variables can be either discrete or continuous. We will start by focusing on discrete RVs. By definition, discrete variables can only take distinct values, such as in the example above of the flip of a coin. For these variables, you can count the number of different values it can take on. Our RV X defined above is actually a Bernoulli RV. It can take the value 1 with probability p (in the case of a fair coin, p is equal to 0.5 or 50%) and the value 0 with probability q = 1-p. Its Probability Mass Function can then be defined as: import matplotlib.pyplot as plt Let’s start by defining a Bernoulli RV X with a success probability of p=0.3. p = 0.3 We can print the values for its Probability Mass Function on 0 and 1. print(np.round(X.pmf(1),2)) To help visualize, let’s empirically arrive at the same values by drawing 10,000 samples from our variable. X_samples = X.rvs(100000)sns.histplot(X_samples, stat="density", discrete=True, shrink=0.2);Figure 2: Sampling distribution of our RV X It looks as expected; we have a 0.3 probability of success and a 0.7 probability of failure. Let’s define the mean and variance of our RV. The mean is calculated by summing the product of the value of each outcome by the corresponding probability. The variance is the squared deviation of the value of each outcome from the mean, weighted by the respective probability. More formally: We can compare our empirically calculated mean with the theoretical mean that we just derived. They are indeed very close, and they would become closer as the sample size increases. print('Empirically calculated mean: {}'.format(X_samples.mean())) Binomial Random VariableWe can look at a Binomial RV as a set of Bernoulli experiments or trials. This way, our understanding of how the properties of the distribution are derived becomes significantly simpler. Before diving into definitions, let’s start with the main conditions that need to be fulfilled to define our RV as Binomial:
Let’s define the RV Z as the number of successes after n trials where P(success) for each trial is p. Let’s also define Y, a Bernoulli RV with P(Y=1)=p and P(Y=0)=1-p. Y represents each independent trial that composes Z. We already derived both the variance and expected value of Y above. Using the following property E(X+Y)=E(X)+E(Y), we can derive the expected value of our Binomial RV Z: Recall that we have n independent trials or n RV Y being summed. When deriving the VAR(Y), the process is the same because VAR(X+Y)=VAR(X)+VAR(Y) is also true. Then we have: Let’s now test our theoretical understanding with an experiment. n=6 We defined our Y variable. We can construct our X variable from this Y variable as defined above; these are the Bernoulli independent trials. Let’s assume that we have 6 independent trials. Y_samples = [Y.rvs(1000000) for i in range(6)] We feel better about our theoretical derivations as our experiment shows that we are indeed on the right path. We can also plot our Binomial distribution. Remember that it is a discrete distribution. sns.histplot(Z_samples, stat="density", discrete=True, shrink=0.3); Figure
3: Sampling distribution of our RV Z, which is a Binomial RV.Binomial PMF and CDFThe Binomial Probability Mass Function (PMF) can be written in the following way: It seems a bit daunting at first; let’s try to break it down into smaller interpretable pieces. The first term is just the binomial coefficient or the number of different ways we can choose k items from n possible ones when the order does not matter, i.e., the set ABC is the same as CBA. Recall that n!/(n-k)! is the number of permutations or the number of different ways we can choose k items from n possible ones when the order matters, i.e., ABC and CBA counts as two different results. k! is just the number of ways to arrange k items. For example, in the case of 3 items and 3 positions, we have the following possibilities: Recall that 3!=3 * 2 *1= 6. Let’s start building small functions to handle our different components, starting with a function to compute the factorial of an input argument. def fact(n): Now, we can use our function def comb(n, k): And finally, putting everything together: def binompmf(prob, n, k): A useful function that you often find in statistical packages together with the PMF of a distribution is the Cumulative Distribution Function (CDF). It is nothing more than the probability that our RV takes values up to a z: def binomcdf(prob, n, x): Let’s try it out. Figure 4: Cristiano Ronaldo’s free-kick execution can be modeled using a Binomial distribution, sourceWe want to assess the probability distribution that models Cristiano Ronaldo’s ability to score free-kicks. We will be using the table below, which shows Ronaldo’s free-kick record in La Liga and the Champions League: Table 1: Ronaldo’s free-kick record, sourceRonaldo has a 0.094 probability of successfully convert a free-kick in the Champions League. Based on that, what is Ronaldo’s probability of scoring 1 out of 7 free-kicks in the Champions League? binompmf(0.094, 7, 1)0.3639109131870316 What is the probability of scoring less than 2? binomcdf(0.094, 7, 1)0.8649797389430357 Exercises:You will get the solutions in next week’s article. 1.A company produces bottles of water. In its main factory, the number of defective bottles is 5%. A quality check consists of randomly selecting and testing 1000 bottles. What are the mean and standard deviation of the number of defective bottles in these samples? 2. A wine company is running a promotion that says 1-in-4 boxes of the wine contain a surprise. Suppose that you will buy 5 boxes of this wine, and let X represent the number of surprises you can win in these boxes. Assume that these boxes represent a random sample and assume that surprises are independent between boxes. What is the probability that you win at most 1 surprise in the 5 boxes? (Can you solve this problem in 3 different ways? Tip: by sampling, summing individual probabilities, and using the CDF) 3. A math teacher is doing an activity with her students where she gives them a 20-question multiple-choice test, and they know none of the answers. Students need to guess on every question, and each question has 5 possible choices, 1 of which is correct. What are the mean and standard deviation of the number of questions that each student gets correct? (Can you solve this problem in 2 different ways? Tip: by sampling and using the theoretical derivation that we did above) |