Statistics is the science of collecting, describing and drawing inferences from data.
A population is the total collection of all objects that we are interested in studying.
A sample is the total collection of all objects that we are study to draw inference about the population.
We are typically interested in estimating some parameter of the population.
The quantity estimated from a small sample is called a statistic.
A statistic is any numerical property of the sample of a population which is used as an estimate for the corresponding parameter of the population.
We look closely about the following list of items under the descriptive statistics.
- Different types of data.
- Different types of plots.
- Measures of centrality and spread.
Types of Data:
This is the first topic as part of the descriptive statistics, different types of data can be broadly classified as shown in the below figure.
Let’s have a look at the Qualitative Data:
Qualitative or Categorical attributes are those which describe the object under consideration using a finite set of discrete classes.
examples: colors, size, rating, patterns.
- Nominal attributes are those qualitative attributes in which there is no natural ordering in the values that an attribute can take. examples: colors, patterns
- Ordinal attributes are those qualitative attributes in which there is a natural ordering in the values that an attribute can take. examples: sizes, rating.
Quantitative attributes are those which have numerical values, and which are used to count or measure certain properties of a population.
- Discrete attributes are those quantitative attributes which can take on only a finite number of numerical values (Integers).
- Continuous attributes refer to quantitative attributes which can take on fractional values (Real numbers).
Types of plots:
This is the second topic as part of the descriptive statistics, in the previous topic we have seen different types of data and in this topic, we have a look at the different types of plots. With in a data if the total number of times a value appear is called frequency. The graphs drawn using this frequency are called frequency graphs (Histograms, Bar charts, pie charts, relative charts, grouped charts, etc.).
- Identifying the discriminatory features.
- Analyzing output scores/responses.
we use the planets dataset from the seaborn library.
load the planets dataset
Plot the histogram using the seaborn for the mass related data.
To plot the categorical values, we can use the bar plots of the seaborn, for which we apply group by on the method column of the planet's dataset. where method column describes the method used to identify the planets.
but we are interested in the series data rather Data frame, the below figure captures the series data of the method column related count.
with the above series data, we plot the bar plot.
A plot between two variables to draw the information. We consider the planets dataset and try to draw information between methods vs year to understand which method been efficient over the years.
We can observe that Radial velocity and Transit methods contributed more. Around 2010 there are more number of planets been identified.
To observe the data distribution for the categorical and continuous data. We use the penguins dataset in the seaborn. we have a look at the species vs their weights as shown below
load penguins data set
the below graph shows that Gentoo species is having more body mass compared to the other species.
if we are interested to look at the data depending on the other variable, for example we have sex column as male and female in the dataset. let’s see how we can differentiate the data.
We have observed that swarm plot shows the data distribution with the data points, while if we are interested about the data distribution rather all data points visually, we can use the violin plot and also it gives some more information which given by the box plot.
let’s compare the data based on the sex column, for which we split the data as shown below.
here inner variable takes the quartile and draws the quartile lines in the graph as shown above.
The relationship between the data can be drawn using line plots. we use healthexp dataset from the seaborn to draw the line plots as shown below.
Load the healthexp dataset as shown below.
we draw a line plot for life expectancy with spending in dollars and also, we use hue as country to draw some conclusions based on the country.
From the above graph we could see that life expectancy of Japan is high compared to the other countries and also the spending in dollars is moderate.
To reveal the relationship between variables/features.
Used for two discrete variables or two continuous variables or one continuous variable and one discrete variable. Not for qualitative variables.
- Identify correlated features.
we use tips dataset from the seaborn to plot the scatter plot to see the information regarding the tip received.
Load the tips dataset
we can observe that as the total bill increases the tip given is also increases and the hue gives the information related to the lunch or dinner time. As we can observe that Dinner time tip is high in value compared to the lunch.
Measures of centrality and spread:
This is the third topic as part of the descriptive statistics, in the previous topics we have explored the different types of data and different ways to express the data in the charts or plots. In this topic, we discuss about the Measures of centrality and spread.
Q: Why do we need measures of centrality and spread?
To summarize the small and medium data we can use plots, but to deal with the big data, we use summary statistics especially for quantitative data.
- Measures of centrality (Mean, Median, Mode),
- Percentiles (quartiles, quintiles, deciles),
- Measure of spread (range, IQR, variance, standard deviation).
Firstly, we have a look at the Measures of centrality, different measures of centrality and characteristics of this measures.
Q: what are the different measures of centrality?
For a given n data points, let’s measure the centrality.
mean: In the case of mean, we measure the centrality by taking sum of all the data points with the total number of data points divided.
In case of sample space, we consider x’ whereas in the case of population we consider it as μ.
median: Median is the value which appears at the center of the data when the data is sorted.
case 1: if n is odd then we consider the element at the center of the data which is as shown below.
case 2: if n is even then we consider mean of the elements as shown below.
mode: Most frequently occurring element in the data.
case 1: single mode, only one frequent value.
case 2: multiple mode, more than one frequent value.
case 3: no mode, all values appear exactly once.
Q: what are some characteristics of these measures?
Deviation: The deviation of a point from the mean is defined as the difference between the point and the mean.
the sum of all the deviations from the mean is zero.
The physical interpretation of the mean is the center of gravity.
sensitivity to outliers: we define an outlier as any point which is far away from the other points in the data.
Mean is very sensitive to the outlier whereas median is not sensitive to the outlier and mode is also not sensitive to the outliers unless mode itself is an outlier.
To bring mean away from the sensitivity we can use trimmed mean. Trimmed mean is computed by dropping k extreme elements from either side of the data.
we discuss about the percentiles, different types of percentiles.
- Quartiles: Quartiles divide the data in to four equal parts.
- Quintiles: Quintiles divide the data in to five equal parts.
we discuss about the measures of spread,
when all the values in the data are very close to the mean and median then we have a low variability in the data, similarly if some values in the data are far away from the mean and median then we have a high variability in the data.
Measures of centrality doesn’t give any information related to the spread and variability of the data.
Range: subtract the maximum value in the data with the minimum value in the data for a given sorted data. Then range can give us the information related to the spread/variability of the data.
But, in the case of outlier data the range might not give us the proper spread of the data. Similar to the mean, range is very sensitive to the outliers.
Inter Quartile Range (IQR): It is the difference between the 75th percentile and the 25th percentile of the given data.
Clearly, we can see the IQR doesn’t hold the outlier data because it considers 75th percentile. We can consider IQR is not sensitive to the outlier data.
we say that x is an outlier if,
or in the case of 75th percentile.
The sum of deviations from the mean doesn’t give us any information related to the spread of the data.
Because sum of deviations from the mean of the given data is always zero. The reason for the issue is that positive deviations get cancel with the negative deviations.
The solution for this is apply the average to the sum of the squares of the deviations from the mean which is known as Variance.
Here we consider square value rather than absolute value because square function is having better properties than the absolute function, one of the reasons is derivative of square value is easier than the derivative of absolute value and the second reason is that square function magnifies the outliers.
But the challenge is, variance is not measured in the same units as the data. Which is square of the input units.
To measure the data in the same units as the input units we can use the square root of the variance which eliminates the challenge of square of the input units. And this square root of the variance is called the standard deviation.
If we have a look at the manufacturing industries, the desire of the manufacturing industries is to maintain zero variance or very minimal variance in their products. examples: radius of tyre, length of dumbbells, size of smart phone and it’s parts etc.
Standardizing the data:
For a given dataset each and every feature might have different data ranges. Which might impact the algorithm while training with the data. To avoid the situation we need to standardize the data.
Instead of express distances or difference between data points in absolute values, we use standard deviation.
After standardizing the data we always have zero mean and unit variance.
Box plots are used for visualizing spread of the data, median of the data and outliers in the data.
Let’s consider a data which lies in between -1 and 1, the outliers identified at the -3 and 3, as shown in the below figure which discusses the different parts of the box plot.
The following planets dataset loaded using seaborn, then using pandas performed describe function which gives the min, max, 25th percentile, 50th percentile, 75th percentile, mean and standard deviation details about the data.
Load planets data set
using pandas describe function
the below figure describes the min, max, median, 25th percentile, 75th percentile data for the year column in the planet's dataset.
we can see the same data using the box plot as shown below.
Reference: Wikipedia, Datascience Stack Exchange, Data Science Course by IITM professors K Mitesh, Pratyush, Sci-kit learn documentation.