# DS Intro: Inferential statistics

## why do we compute statistics?

In this article, we explore the different aspects of inferential statistics.

**Setup:**

why do we compute statistics?

- to describe the sample
- to estimate population parameters.
- to test hypotheses

We have used the ** descriptive statistics** to describe the sample, which is a subset of the population and also we discussed different sampling strategies.

*Estimate population parameters:*

Well, we are not interested in estimation methods that work for a particular dataset, instead we are looking for mathematical relations that hold true generally.

This generality makes us to make two key assumptions:

- the values of interest of the elements in the population are independent random variables with a common distribution.
- Each element of the population has an equal chance of being selected in any sample.

If we consider statistics as the random variable then,

- We can use sample mean or sample standard deviation or proportions as the random variable.
- we can also plot probability distribution of this sample mean.

For a given population parameters (𝜇, 𝜎) compute the distribution of sample statistics.

Later, we take sample data and try to predict the population parameters (𝜇, 𝜎).

## compute the E[X’]

Let’s compute E[X’] for a given 𝜇, 𝜎 population parameters, for a sample size *n*.

By using the linearity of expected value

taking the come value n as shown below

because we are using random sampling X1, X2, X3 … Xn are independent with which we know that expectation of the sum of random variables is equal to the sum of expectation of individual random variables as shown below.

each element of population is equally likely in samples, which means expectation of any item in any sample is expectation of any item in population, with this we write as shown below.

by simplifying the above expression

if we look closely the result

it holds the generality which is independent of both the 𝜎 and sample size n.

X’ is an unbiased estimate 𝜇.

## compute Var(X’)

Let’s compute Var[X’] for a given 𝜇, 𝜎 population parameters, for a sample size *n*.

it is always good to have variance close to zero and not more than the squared variance.

The variance can be defined as the how much far away the points are from the mean.

- the below expression shows that expectation of the squared difference of the mean to the point.

Expanding the above expression gives as shown below.

after applying the expectation to the individual terms we get as shown below

rewriting the above expression gives as shown below

2. variance of a constant along with the variable is equal to the squared of the constant along with the variance of the variable.

let’s consider a variable and constant as shown below.

substitute in the expression 1 gives us as shown above.

here a is a constant so we can write the above expression as shown below.

after rewrite the expression in terms of variance of the variable as shown below

3. Variance of sum of two independent random variables is equal to sum of variance of the independent random variables

let’s consider a sum of two random variables and substitute in the expression 1 as shown below.

expanding the above expression gives us as shown below.

simplifying the above expression give us as shown below.

rewriting the above expression

considering the expression 1, rewrite the above expression as shown below.

4. let’s consider X1, X2, X3 … Xn as the values of the individual elements of a sample.

then the variance of the sample can be written as

from the expression 2 we can rewrite the above expression as shown below

from the expression 3 we can rewrite the above expression as shown below because the random variables are independent

variance of individual element is equal to the squared sigma because each of the element comes from the population of same distribution and the distribution is characterized as mean(𝜇) and variance(𝜎²)

there are n such terms which can be shown as below.

this gives the variance of the sample.

the standard deviation of the sample can be written as shown below.

we can draw some observations from the above expression that X’ is unbiased estimate and varies from sample to sample with standard deviation.

we care about standard deviation rather variance because the units of *standard deviation* and the *mean* are of the same whereas variance contains square of the units.

When the sample size *n* is larger, the likelihood of *sample mean* being far away from the 𝜇 is lower. Thereby, this concentrates the distribution around the mean, with a lower *standard deviation*.

we can give the conclusion, for a given population parameters we can compute the expectation and variance of a sample size *n* as shown below.

## Central Limit Theorem

we know about mean, variance or standard deviation which describes the spread of the data, but we have no idea about the probability distribution function. Let’s discuss about this!

The theorem states that: if X*1*, X*2*, X*3*, …, X*n* are random samples from a population with mean 𝜇 and standard deviation 𝜎, then the sum of the random samples X*1* + X*2* + X*3* + … + X*n* will converge to Normal distribution *N(n*𝜇, 𝜎√*n) *in the limiting condition of *n* tends to infinity.

** z score:** The transformation of

*x*to

*z*is called

*z*score. The number of standard deviations a value is away from the mean.

the purpose of the central limit theorem is to get the likelihood of the sample mean.

We can see that identically distributed independent samples, the standardized sample mean tends towards the standard normal distribution even if the original variables themselves are not normally distributed

The probability distribution can be seen as shown below

let’s plot a graph with population of 1Million elements, number of samples be 10000 with sample sizes 1, 3, 5, 10 and 50 as shown below.

with the sample size *50* we can observe that probability distribution is smooth and near to the peak around sample mean *0.5 *(bell shape). we can also observe that when the *n* tends to infinity (or larger values of *n*) converging to the normal distribution.

## Chi Squared Distribution

The *chi-square distribution* (also *chi-squared *or *χ²**-distribution*) with *k* degrees of freedom is the distribution of a sum of the squares of *k* independent standard normal random variables.

the mean of **χ²**(k) is k and similarly the variance of **χ²**(k) is 2k

we can observe that, as the degree of freedom value increases the distribution turns towards the normal distribution.

we can see only positive values on the x-axis, it is because we have **χ² **which brings the negative values also to the positive axis.

we have very sharp raise near zero, is due to the fact that squared value is always less than it’s normal when the value is less than zero.

When we are summing up or increase of k, the distribution moves towards right hand side and reaches near to the standard normal distribution.

## Point and Interval Estimators

An *estimator* is a statistic of a given sample. The value of the estimator, called an estimate is used to predict a population parameter.

*point estimator*is a statistic used to estimate the value of an unknown parameter of a population. It uses sample data when calculating a single statistic that will be the best estimate of the unknown parameter of the population.*Interval**estimator*uses sample data to calculate the interval of the possible values of an unknown parameter of a population. The interval of the parameter is selected in a way that it falls within a 95% or higher probability, also known as the confidence interval. The confidence interval is used to indicate how reliable an estimate is, and it is calculated from the observed data. The endpoints of the intervals are referred to as the upper and lower confidence limits.

properties of the estimator:

- Unbiased estimator
- Consistent estimator
- Efficient estimator (w.r.t loss function)

** Point estimator of a population mean: **The sample mean as the point estimator of the population mean.

** Point estimator of a population proportion: **The sample proportion as the point estimator of the population proportion.

** Point estimator of a population variance: **The sample variance as the point estimator of the population variance.

*Interval estimator of a population mean:*

- with known population variance
- with unknown population variance

Effect of sample, variance and tolerance

if n is large, we should expect sample mean to be closer to the population mean. Even small differences should lead to the null hypothesis getting rejected.

if sigma is large, we should expect lot of deviation in sample mean from one sample to another, so small differences should not lead to the null hypothesis getting rejected.

Lower is the alpha, stricter is the requirement for rejecting the null hypothesis.

Let’s assume a fictional company AKet (Anonymousket). which is selling snacks claims that each packet has 150 grams of snacks (as mentioned on the label). We sceptic of their claims and believe that on average each packet does not contain 150 grams of the snacks. How will we prove our claim?

known 𝜇 = 150 grams which is null hypothesis.

our claim is that 𝜇 is not equal to 150 grams, which is alternative hypothesis.

Goal is to prove null hypothesis is not true.

How do we test our hypothesis?

- Take a sample of n packets.
- compute the mean weight or standard deviation of weight.

Suppose the mean of the packets is 147.2 grams. Can we conclude the company is cheating?

Not always! because we know that the sample mean follows a distribution.

**Case: 1 Population variance is known**

Based on the past data we know the standard deviation is 7 grams.

generate random sample data using python random module as shown below.

Let’s find the z test value

import modules

given values to us

consider the generated chip weights as sample data

compute mean and then z test value using the

here we know mean, mu value and sigma.

we can choose alpha as 0.01, 0.05 or 0.1

compute z critical and p value as shown below

the data says that not having enough evidence to reject the Null Hypothesis

**case 2:** **Population variance is unknown using t-score**

generate random sample data using python random module as shown below.

Let’s find the t test value

import modules

given values to us

consider the generated chip weights as sample data

compute mean and then t test value using the

here we know mean, and standard deviation.

compute the t value

we can choose alpha as 0.01, 0.05 or 0.1

compute t critical and p value as shown below with n-1 degree of freedom

the data says that not having enough evidence to reject the Null Hypothesis

Let’s assume a fictional company AKet (Anonymousket), which is manufacturing a ball point pen claims that the average radius of the ball in the ball point pen is 0.8mm. We purchased these ball pens and started verifying their claim. How are we verifying this claim?

known, 𝜇 = 0.8 mm which is null hypothesis.

our claim is that 𝜇 is not equal to 0.8 mm, which is alternative hypothesis.

Goal is to prove null hypothesis is not true.

How do we test our hypothesis?

- Take a sample of n ball point pens.
- compute the mean radius.

Suppose the mean of the sample is 0.82 mm. Can we conclude the ball point pen is not our requirement?

Not always! because we know that the sample mean follows a distribution.

**Case: 1 Population variance is known**

Based on the past data the standard deviation is 0.08mm.

generate random sample data using python random module as shown below.

Let’s find the z test value

import modules

given values to us

consider the generated ball point pen ball size as sample data

compute mean and then z test value using the

here we know mean, mu value and standard deviation.

we can choose alpha as 0.01, 0.05 or 0.1

compute z critical and p value as shown below

the data says that not having enough evidence to reject the Null Hypothesis

**case 2:** **Population variance is unknown using t-score**

generate random sample data using python random module as shown below.

Let’s find the t test value

import modules

given values to us

consider the generated chip weights as sample data

compute mean and then t test value using the

here we know mean, and standard deviation.

compute the t value

we can choose alpha as 0.01, 0.05 or 0.1

compute t critical and p value as shown below with n-1 degree of freedom

the data says that not having enough evidence to reject the Null Hypothesis

updated May-02–2023: minor changes, link updated and sd value changed to 0.08

Reference: Wikipedia, Datascience Stack Exchange, Data Science Course by IITM professors K Mitesh, Pratyush, Sci-kit learn documentation.