# DS Intro: Probability Theory

## How randomness works in the system?

In this article, we explore the different aspects of probability theory.

**Setup:**

A sampling strategy is said to be truly random(unbiased) if every element in the population has an equal chance of becoming a part of the sample.

The branch of Mathematics that deals with the chances and probability is called probability theory.

If we observe some trend in the small sample, what is the chance that we will observe a similar trend in other samples or the entire population!

Consider a set S and A, B, C belongs to set S.

*Experiment:*

An experiment or trail in any procedure that can be repeated infinite times and had a well-defined set of outcomes.

examples: Experiment — blood test for viral fever. Outcomes: positive or negative. Experiment — Writing an exam. Outcomes: pass or fail.

The set of all possible outcomes of an experiment is called the sample space. The elements in a sample space are mutually exclusive (no two outcomes can appear simultaneously) and collectively exhaustive (consider all the possible outcomes).

The outcome in every experiment or trail is uncertain but the set of outcomes is certain.

*Event:*

An event is a set of outcomes of an experiment. This set is a subset of the sample space.

*Probability function:*

Assign a number to each event such that the number reflects chance of the experiment resulting in that event.

Axioms of the probability function:

- The outcome of the probability function for the Event A shall always be greater than or equal to zero for all A.
- sum of all the probabilities shall be equal to one.
- if the events are mutually disjoint then the probability of union of this event is equal to the sum of individual probability of the events.

From the last point we can observe that, the probability of an event can be computed as the sum of the probabilities of the disjoint outcomes contained in the event.

properties of the probability:

- probability of an event equals to the subtract of total probability to the probability of complement of the event.
- probability of Event A is always less than or equal to one.
- P(A∪B) = P(A) + P(B) -P(A∩B)
- sum of the probabilities of all outcomes is equal to one.

*Design Probability function:*

We can think of the probability of an event as the fraction of times the event occurs when an experiment is repeated a large number of times.

Assign a number to each event such that the number reflects chance of the experiment resulting in that event and the probability function must satisfy the axioms of the probability function.

*Conditional probability:*

P(A|B) is called the conditional probability of the event A given the event B.

By using the Multiplication principle and chain rule we can derive the total probability using several distinct events.

The substitution of Total Probability theorem in the Multiplication principle gives the following output and it is known as Bayes’ theorem.

Two events A and B are independent if P(A|B) = P(B) or P(B|A) = P(A)

*Random Variables:*

A random variable is a function from a set of possible outcomes to the set of real numbers.

Multiple functions (random variables) are possible for the given domain (sample space).

example: A function could be sine or cosine or x² or 4 x¹⁰ + 256 x³ then domain or sample space could be a Real valued space and Range also could be a Real valued.

*Types of Random Variables:*

- Discrete — A discrete variable is a variable whose value can be obtained by counting since it contains a possible number of values that we can count.
*ex*: the outcome of a single die, number of children of a population. - Continuous — A continuous variable is a variable whose value is obtained by measuring.
*ex*: the amount of rainfall, temperature of a surface, the height of a person.

An assignment of probabilities to all possible values that a discrete Random Variable can take is called the distribution of the discrete random variable.

where X = *x *is an event and *x* is a condition.

The above function performs mapping the values between zero to one for all events in the sample and this function is known as Probability Mass Function.

properties of probability mass function:

- probability mass function shall satisfy the event probability shall be greater than or equal to zero.
- sum of all the probabilities of a probability mass function or random variable shall be equal to one.

This distribution is related to the experiments with only two outcomes.

ex: exam: fail or pass, movie: hit or flop, email: spam or no spam etc.

let’s have the following domain

then the Bernoulli function can be shown as below

Where X is the Bernoulli Random variable and we have two outcomes which are represented as zero and one.

Let’s assume, A: event that outcome is success.

P(A) = P(success) = *p*

then the probability of success be P(X = 1) = *p* and* *similarly the probability of failure be* *P(X = 0) =* *1- *p *because the sum of probabilities shall be equal to one.

we can rewrite the above expression in the compact way as shown below.

by substituting *x* = 0 or 1 we can verify the compact form.

let’s draw a sample using scipy module.

import the required modules.

calculate the mean, variance, skew and Kurtosis as shown below.

now let’s display the Bernoulli probability mass function.

Repeating the Bernoulli trail or experiment *n* times, this approach is independent and identical because the outcome of one trail doesn’t impact on another trail.

X: random variable indicating the no. of successes in the *n *trails.

Then the probability distribution with minimal number of parameters,

P(X = x) = ?

where* x *∈ {1, 2, 3, 4, …, *n*}

How many different outcomes can we have if we repeat a Bernoulli trail n time?

We have two different outcomes *success* or *failure* and for *n* times it shall be 2 power* n* times.

Let’s consider there are *k* favorable outcomes or success outcomes then *n-k* unfavorable outcomes or failure outcomes.

we can write the favorable outcomes as shown below which is also known as *n* chosen *k*.

and also we know that each of the *k* successes occur independently with a probability *p *which can be given as* *shown below.

similarly each of the* n-k *failures occur independently with a probability *1-p* which can be given as shown below.

combining these terms gives us the probability distribution function, which can be shown as below.

the expression depends on the parameters *p* and *n* as discussed minimal parameters.

let’s draw a sample using scipy module.

import the required modules.

calculate the mean, variance, skew and Kurtosis with a sample size of 5 as shown below.

now let’s display the Binomial probability mass function.

Repeating the Bernoulli trail or experiment *infinity *times, which means until we find the expected event to happen or observe.

let’s draw a sample using scipy module.

import the required modules.

calculate the mean, variance, skew and Kurtosis as shown below.

now let’s display the Binomial probability mass function.

This distribution is related to experiments or trails with equally likely outcomes.

let’s consider a random variable in between *a* and *b*

let’s draw a sample using scipy module.

import the required modules.

calculate the mean, variance, skew and Kurtosis as shown below.

now let’s display the Uniform probability mass function.

similarly, in the case of continuous variable.

let’s draw a sample using scipy module.

import the required modules.

calculate the mean, variance, skew and Kurtosis as shown below.

now let’s display the Uniform probability mass function.

The expected value or expectation of a discrete random variable X whose possible values are *x*1, *x*2, *x*3, … *x*n is denoted by E[X] and computed as shown below.

The variance of a Random variable can be defined as the expectation of the squared deviation from the expectation of the random variable to the random variable.

Var(X) = E[ ( X-E[X] )² ]

the simplified version can be written as the squared expectation of the random variable to the expectation of the squared random variable as shown below.

Var(X) = E[X²] -(E[X])²

This distribution is related to the continuous random variable or real-valued random variable and the distribution is known as the continuous probability distribution which can be represented as shown below.

the distribution is fully specified by the parameters mu and sigma².

let’s draw a sample using scipy module.

import the required modules.

calculate the mean, variance, skew and Kurtosis as shown below.

now let’s display the Normal probability mass function.

The simplest case of a normal distribution is known as the *standard normal distribution* or *unit normal distribution*. This is a special case when 𝜇=0 and 𝜎=1, and it is described as *probability density function.*

**Q:** we know what is population and sample, but how do we select a sample out of a population?

Unbiased samples: A sample that is representative of the entire population and gives each element an equal chance of being chosen.

*sampling strategies:*

Random sampling is always good to choose (We discuss more at inferential statistics).

Reference: Wikipedia, Datascience Stack Exchange, Data Science Course by IITM professors K Mitesh, Pratyush, Sci-kit learn documentation.