## Introduction

Statistics and probability form the central point of data analysis, manipulation, and formatting. Methods like mean, mode, median, standard deviation, etc. are used in almost all data science problems. In this article, we will discuss the most important and widely used methods of statistics and probability for data science.

## Examples and Importance of data

Data is a collection of information from various sources. For example, details of a student, employee like name, age, salary, marks, etc. are all data. Data can be factual or derived from other information. For example, based on what a user orders on a food delivery app frequently, we can analyze his taste and preferences.

There are two types of data: Quantitative and Qualitative.

### 1. Quantitative data

Quantitative data can be counted. It is to the point and conclusive. For example, if we have average marks of two students as 89.7% and 84.3%, we can conclude that the former person is a better performer in academics. Quantitative data is self-descriptive. It is easy to plot a graph with quantitative data as the values are fixed. Quantitative information is obtained through surveys, experiments, tests, market reports, metrics, etc.

Quantitative data is further classified as discrete and continuous. Continuous values are those that can be broken down into smaller parts—for example, the speed of a train, your height, weight, and age.

Discrete values are integers and have a limit. For example, how many people enrolled for a particular course, the amount of money you gave for buying a new laptop etc.

### 2. Qualitative data

Qualitative data is unstructured (or semi-structured) and non-statistical. You cannot measure or plot such data using graphs straightaway. Such data can be classified using its attributes, labels, properties and different identifiers—for example, shopping preferences of a person. Qualitative data often answers the question ‘why.’ Qualitative data is used for developing a hypothesis based on initial observations and understanding. This type of data can be obtained using interviews, texts, documents, images, video transcripts, audios, notes, and so on.

Qualitative data can be further divided into two – ordinal and nominal.

**Nominal:**Data that is not in any particular order and is used for labeling. For example, the gender or hair color of a person.**Ordinal:**Ordinal data follow a particular order. For example, a survey form where the answer is categorized as good, very good, excellent in order of preference.

Check out our article on Data Analysis Techniques to know the various techniques to analyze quantitative and qualitative data.

So, which type of data are we going to discuss today?: Qualitative or quantitative?

Let us find out if you haven’t done that already!

## What is Statistics?

Statistics is the science of collecting vast amounts of numerical data for analysis, interpretation, and presentation of data for further use. Statistics help to obtain the right data, apply the best methods for analysis, and present the data in the most understandable and accurate form. Statistics is widely used in data science for analysis, interpretation, presentation, and making decisions and predictions.

## Jargons in statistics

The most common terms used in statistics are:

- Population: It is the entire pool of data or observations that can be made. The population can be a whole dataset of people, places, drugs, etc.
- Parameter: It is a numerical property of a population. For example, mean, median.
- Sample: Sample is a subset taken from a population.
- Statistic: A value that can be computed from the data that has known parameters.
- Random variable: Assignment of a number to all the possible outcomes of a random experiment.
- Range: Range is the difference between the largest and the smallest values in a set of data. For example, if the dataset has {1,2,3,4,5,6,7} as its values, the range is 7-1 = 6
- Mode: Mode is the most common value occurring in a list.
- Check out more statistics terms in this comprehensive glossary.

## Types of Statistics

There are two types of statistics:

### 1. Descriptive statistics

Descriptive statistics are used to determine the basic features of data. You can get a summary of the sample and the measures. Every quantitative data analysis starts with descriptive statistics methods that show just what the data contains and simplify vast amounts of data through summary like average, mean, etc.

Descriptive analysis can be done in the following ways:

#### 1.1 Univariate analysis

As the name says, it involves examining one variable at a time. We can look at the following characteristics of a single variable –

**Distribution:**Distribution is the range of values for a variable, i.e., how the variable is distributed. For example, to represent the distribution of income among various age groups, we can divide them into different categories as per the range of values.

Age | Percentage |

Under 30 | 25% |

31-55 years | 45% |

56-70 years | 20% |

71+ | 10% |

The most common type of distribution (as shown above) is a frequency distribution. A frequency distribution can be represented graphically using a bar chart or histogram.

#### 1.2 Central tendency

Central tendency is the measure of central value in the range of benefits. The most common types of central tendency measures are – mean, median, and mode.

**Mean:** mean is the average value of all the values in the distribution. For example, the marks of a student in various subjects are 80, 82, 85, 89, 93. Thus, the average or mean value will be the (sum of all the values/number of values) = (80 + 82 + 85 + 89 + 93)/5 = 85.8

**Median:** The median is the value found at the exact center of the entire distribution. For this, we have to sort the values in order:

80, 82, 85, 89, 93

The middle value is = 85.

Since we had five values, it was easy to find the middle. What if we had 6? Say, 80, 82, 85, 89, 93, 98

Now, the middle values are 85 and 89. The median has to be estimated. A common method to do this is to find the average. In this case, 87.

**Mode:** Mode is the most frequently occurring value in the distribution. Let us say, there are a set of values {20, 25, 40, 40, 30, 20, 20, 35, 40, 45, 50, 30, 20}

We see that the value 20 occurs four times, which is the maximum out of all. So, the mode is 20. Sometimes, there might be more than one mode value in a given sample.

**Dispersion:** It is the spread of values around the central tendency. There are two ways to measure dispersion – range and standard deviation

**Range:** Range is the difference between the highest and the lowest value. For example, in {80, 82, 85, 89, 93}, the range is 93-80 = 13.

**Standard deviation:** The problem with range is that if there is an outlier, the range value can change drastically. In the above example, what if the lowest value was 20 while all others in the distribution were above 80. That would radically change the range value. To avoid such a problem, we use the standard deviation.

How to calculate standard deviation –

Step 1 – First, calculate the mean (average) of the values. In our case, the mean is 85.8.

Step 2 – Calculate the distance of each value from the average value:

80 – 85.8 = -5.8

82 – 85.8 = -3.8

85 – 85.8 = -0.8

89 – 85.8 = +3.2

93 – 85.8 = +7.2

Step 3 – Square the differences

-5.8 * -5.8 = 33.64

-3.8 * -3.8 = 14.44

-0.8 * -0.8 = 0.64

3.2 * 3.2 = 10.24

7.2 * 7.2 = 51.84

Step 4 – Calculate the sum of squares:

33.64 + 14.44 + 0.64 + 10.24 + 51.84 = 110.8

Step 5 – Divide this by the (number of values – 1). In our case, it is 4.

110.8/4 = 27.7

The above value is called the variance.

Step 6 – Calculate the standard deviation, which is the square root of the variance.

Standard deviation gives out a better spread of data or variation in data than range.

#### 1.3 Correlation and Correlation matrix (Bivariate Analysis)

Correlation is one of the most widely used statistics for data science. It gives a single number as output that explains the degree of relationship between two variables.

Here is an example to understand correlation fully:

I have a small dataset which contains details of the number of hours of activity and happiness quotient of people. Let us say we have a theory that more active people are happier. The activity can be of any type – exercising, pursuing a hobby like cooking, reading, etc. For simplicity, we will use the word ‘activity’ for all activities that involve a person to be physically and mentally active.

Person | Activity hours | Happiness quotient |

1 | 1 | 3 |

2 | 3 | 5 |

3 | 4 | 5 |

4 | 5.5 | 6 |

5 | 2 | 3 |

6 | 3 | 3.5 |

7 | 6 | 6 |

8 | 7.5 | 7 |

9 | 10 | 9 |

10 | 0 | 1 |

11 | 2.5 | 3 |

12 | 3.5 | 4 |

13 | 8.5 | 8 |

14 | 2 | 3.5 |

15 | 4 | 5.6 |

Let us first create a histogram for each and then find the other descriptive analysis parameters like mean, median, standard deviation, range etc.

From the above, we can infer that the maximum number of people fall in the range of 0-4 hours of activity and happiness quotient of 1-4.

Now, let us find some numerical data –

Variable = Activity hours | Variable = Happiness Quotient | |

Mean | 5 | 4.84 |

Median | 3.5 | 5 |

Variance | 7.988095238 | 4.605428571 |

Standard Deviation | 2.826321857 | 2.146026228 |

Range | 10 | 8 |

Note that these values can be easily calculated using Microsoft Excel.

We can now plot a graph as well to determine whether the relationship between both the variables is positive or negative:

We can see that the relationship is positive, i.e., people who spend a higher number of hours on various activities are happier. This proves our earlier theory with facts and data. The fact that the graph is moving in the upward direction indicates the same. This is precisely what correlation also tells us.

Let us now calculate the correlation. It seems a bit tedious to calculate, and the formula is big – but all you have to do is remember just the formula and not how to derive it. So, pretty good!

where x and y are the variables, and n is the number of items in the data. In our case, n = 15.

Now it’s time to calculate and fill the data. The hard work is worth it! (You can do most of it using Microsoft Excel). In our data, here are the following values:

Now, let us apply all the values in the formula,

r = (15*384.9-62.5*72.6)/√(15*372.25-62.52)*(15*415.86-72.62)

= (5773.5-4537.5)/ √(5583.75-3906.25)*(6237.9-5270.76)

= (1236)/1273.726

= 0.9704

Well, the correlation value for our data is 0.97, which is a perfect positive relationship between the two variables. In general, the value of the coefficient is between -1.0 and +1.0, where the signs depict the negative and positive relationships, respectively. For the above example, we have calculated the Pearson Product Moment Correlation.

##### Correlation matrix

While we have taken an example that has two variables, real-world data will have many more variables to consider. If we have to find the relationship between each of them, we will have many possible pairs. In general, the number of correlations is n(n-1)/2, where n is the number of variables. So, if we have five variables, we will have 5*4/2 = 10 correlations. To represent all the correlations, we use a table: this table is called the correlation matrix. Here is an example of how a correlation matrix looks like:

We can see that this is in the shape of a half triangle. The other half will be an exact mirror image of this, so we show only half a triangle of the matrix. A correlation matrix is always symmetric. So, let’s say you wanted to know the correlation value of C6 and C1; you just see C6 and the corresponding C1 value, i.e., 0.236567! Pretty cool, huh?

#### 1.4 Entropy, information gain, and confusion matrix

Entropy is the measure of impurity in a dataset. For example, if all the items in a group are of the same type, the impurity will be zero. If items are of a different kind, the impurity will be non-zero, i.e., the degree of randomness will be high when all the items are of various types. Refer to our article on the decision tree in machine learning, for example, on Entropy.

Information gain measures how much information we can get from a particular variable or feature in the data. For example, if we want to identify how exercise affects a person’s health, and if the data contains a variable like ‘height,’ that variable will have low information gain as height is not a parameter to determine the effect of exercise on a person’s health.

The confusion matrix helps us examine the performance of a predictive model in classification problems. It is a tabular representation of the predicted vs. actual values to determine the accuracy of a model.

Confusion matrix = (TP + TN)/(TP+TN+FP+FN)

where TP = True Positive, TN = True Negative, FP = False Positive and FN = False Negative

A simple example to understand TP, TN, FP, FN:

Let us say we are given data about 20 students, of which seven are eligible for voter id, and 13 are not. We put this data into a model that predicts the outcome to be eight eligible and 12 non-eligible. With this, we can determine the accuracy of the model:

number of samples = 20 | PREDICTED yes | PREDICTED no |

ACTUAL yes | 7 (TP) | 1 (FN) |

ACTUAL no | 2 (FP) | 10 (TN) |

TP represents the column where the actual and predicted both are yes, i.e., the actual and the predicted values both are yes.

TN represents values where actual and predicted values both are no.

FP indicates that something is not true (i.e., the students are not eligible), but the model predicted that it is true (i.e., the model predicted that they are eligible). FP is also called a type I error.

FN indicates that something true is deemed as false by the model: The students are eligible, but the model predicted that they are not. FN is also called a Type II error.

### 2. Inferential statistics

In this type of statistical analysis, we are trying to see what is directly seen from the data obtained, i.e., make inferences from the observations made through descriptive analysis. For example, if you want to compare the average marks obtained by students of two different colleges, you can use the t-test. There are many methods for performing inferential statistics, most of them based on the General Linear Model.

#### 2.1 Paradigms for inference

There are several paradigms for inference, which are not mutually exclusive; instead, they work best under one paradigm but have interesting analysis in others. The four most common paradigms are –

- Classical (frequentist) inference: Here, datasets similar to the one in question are said to be produced through repeated sampling of population distribution. Some examples are p-value, confidence interval
- Bayesian inference: The Bayesian approach makes statistical propositions based on posterior beliefs. Degrees of belief are described using probability. Examples – interval estimation (credible interval) and model comparison using Bayes factors
- Likelihood-based inference: This approach uses a likelihood function known as maximum likelihood estimation.
- AIC (Akaike Information Criterion) based inference: It is a measure of the relative quality of the statistical models, i.e., a pair (S, P), where S is the set of possible observations, and P is the set of probability distributions for a given dataset.

#### 2.2 Interval estimation

Interval estimation is the use of an interval or range of values that are used to estimate a population parameter. The estimated value occurs between a lower and upper confidence limit. Let us say; you want to monitor your electricity bills. You get the bill amount to be somewhere between 2000 and 3000Rs. This is called an interval, where you don’t specify an exact value; instead, you specify a range or interval. Based on this, you can predict that the electricity bill for next month could also lie in the same interval.

To statistically determine this, there are two essential terms we need to learn –

- Confidence interval: The measure of confidence that the interval estimate contains the population mean, i.e., uncertainty associated with a sample estimate
- The margin of error: It is the highest difference between the point estimate (the best estimate of a population parameter) and the actual value of the population parameter.

#### Maximum likelihood estimation

In this method, parameters of a probability distribution are estimated by maximizing a likelihood function by selecting parameter values that make the observed data as the most probable. The point in the parameter space that maximizes the likelihood function is the ‘maximum likelihood estimate.’

#### How is inferential statistics done?

There are several tests for inferential statistics. It is not a straightforward task to compute statistical inference, and the detailed calculations are beyond the scope of this article – each of the below methods needs a piece of its own for explanation and understanding (coming up soon…!)

- The t-test: The t-test or Turing test determines if the means of two groups are statistically different from each other. By this, we can know whether the probability of the results that we obtained are by chance or are they repeatable for an entire population.
- Dummy variables: It is a numerical value that represents the subgroups of the sample in consideration. A dummy variable is used in regression analysis and enables us to represent multiple groups using a single equation.
- General Linear Model: GLM is the foundation for tests like the t-test, regression analysis, factor analysis, discriminant function analysis, etc. A linear model is generalized and thus useful in representing the relationship between variables, through the general equation,

y = b0 + bx + e, where y is the set of outcome variables, x is the set of covariables, b is the set of coefficients (one for each x), and b0 is the set of intercept values.

- Analysis of Covariance: ANCOVA is used to analyze the differences in the mean values of the dependent variables. These dependent variables are affected by controlled independent variables while taking into consideration the influence of uncontrolled independent variables. For example, how the eating patterns of customers will change if a restaurant is moved to a new location.
- Regression-discontinuity analysis: The goal of this method is to find the effectiveness of a solution. For example, a person claims that a controlling diet can reduce weight, a second person says that exercise is the most effective way to reduce weight, a third opinion is that taking weight-loss pills helps in faster weight loss. To find the outcome variable, we need to identify and assign participants that fit into each of these criteria (treatment groups), and the outcome is measured later.
- Regression point displacement analysis: It is a simple quasi-experimental strategy for community-based research. The design is such that it enhances the single program unit situation with the performance of larger units of comparison. The comparison condition is modeled from a heterogeneous set of communities.
- Hypothesis testing: In this inferential statistics technique, statisticians accept or reject a hypothesis based on whether there is enough evidence in the sample distribution to generalize the hypothesis for the entire population. This can lead to one of the following:
- Null hypothesis: The result obtained is the same as what was assumed before
- Alternate hypothesis: The result is different from the assumption made before

An example of this can be a medicine that is available in the market for treating cough. If ten people take the medication on one day and get cured on the 5th day, then we can say that the probability that a person gets cured with the medicine in 5 days is high. However, if we take the data of 100 people and test the same on them, the results may vary. If those 100 also produce the same results, then we reach the null hypothesis, else if the majority of the 100 people are not cured with the medicine in 5 days, then we reach an alternate hypothesis.

## Sampling

Sampling is a process in which a part of data is taken from the entire population for analysis.

There are five types of sampling –

- Random: in this type of sampling, all the members of the sample (or group) have equal chances of being selected.
- Systematic: In this type, every nth record from the sample is chosen to be a part of the sample. For example, out of numbers from 1-100 in a sample, every 5th number is selected as part of the sample, i.e., 5, 10, 15, etc.
- Convenience: This type of sampling involves taking samples from the part of the population that is closest or first to reach. It is also called grab sampling or availability sampling.
- Cluster: Samples are split into naturally occurring clusters. Each group should have heterogeneous members, and there should be homogeneity between groups. From these clusters, samples are selected using random sampling.
- Stratified: In this process, homogeneous subgroups are combined before sampling. Samples are then selected from the subgroups using random or systematic sampling methods.

## What is Probability?

Probability is the measure of how likely it is for an event to occur. For example, if the weather is cloudy, the likelihood (or probability) of rain is high, if you roll a dice, the probability of getting the number ‘6’ is 1/6, the probability of getting a ‘Queen’ from a pack of cards is 4/52. Similarly, the probability of getting a ‘queen of hearts’ from the pack is 1/52.

In general, Probability P(event) = desired outcome/number of outcomes.

Most problems in the real world do not have a definite yes or no solution, i.e., the likelihood or probability of one event is more than other – rather than a definitive 1 or 0. That’s why probability is so significant. Probability is one of the most critical topics of statistics.

### Probability Jargons

- Event: The event is a combination of various outcomes. For example, if the weather is cloudy, there can be two possible events – rain or no rain.
- Experiment: It is a planned operation performed under controlled conditions. For example, getting Q of hearts on two successive picks from face cards is an experiment.
- Chance experiment: If the result of an experiment is not predetermined, such an experiment is called a chance experiment.
- Random experiment: It is an experiment or observation that can be repeated n number of times given the same set of conditions. The outcome of a random experiment is never certain.
- Outcome: The result of an experiment is called an outcome.
- Sample space: Sample space is the set of all the possible outcomes. Usually, the letter S is used to indicate sample space. If there are two outcomes of tossing a coin head (H) and tail (T), then the sample space can be represented as S = {H, T}.
- Likelihood: It is a measure of the extent to which the sample supports particular values of a parameter. Read more about likelihood vs. probability.
- Independent events: If the probability of one event occurring doesn’t affect the probability of the other event in any way, such events are said to be independent events.
- Factor: A function that takes some number of random variables as an argument and produces output as a real number. Factors enable us to represent distributions in higher dimensional spaces.

## Types of Probability

There are three main types of probabilities that form the foundation for machine learning. We will discuss these below –

### 1. Marginal probability

Marginal probability indicates the probability of an event for one random variable, amongst other random variables, where the outcome is independent of other random variables. The simplest way to say this is:

P(A) = X for all outcomes of B

The probability of occurrence of one event in the presence of a subset of or all outcomes from other random variables is called marginal distribution. In other words, the marginal probability distribution is the sum or union of all the probabilities of the second variable (B):

P(A) = X or P(A=X) = sum (P (A=X, B=∑Bi))

### 2. Conditional probability

It is the probability of an event A occurring provided another event B has occurred. The occurrence of the other event B is not certain, but not zero. A conditional probability distribution is the probability of one variable to one or more random variables.

Conditional probability = P(A given B) = P(A|B)

P(A given B) can be calculated using the formula:

P(A given B) = P(A and B)/P(B),

where P(A and B) is the joint probability of A & B.

#### Bayes Theorem

Bayes’s theorem is based on conditional probability. It describes the probability of an event A based on the prior knowledge of conditions related to the event. The Bayesian theorem is also an approach for statistical inference or inferential analysis that we have discussed previously in this article and is key to Bayesian statistics.

P(A|B) = P(B|A)*P(A)/P(B)

where, P(A|B) = probability of A given B (conditional probability),

P(A) & P(B) = probability of A & B respectively,

P(B|A) = likelihood of B occurring given A is true

### 3. Joint probability

A joint probability is the probability of events A & B and is represented as P(A and B). The joint probability of two or more events or random variables is called the joint probability distribution.

A joint probability is calculated as the product of the probability of event A given probability of event B and the probability of event B:

P(A and B) = P(A U B) = P(A given B) * P(B)

This calculation is also referred to as product rule. Do you recollect what is P(A given B)? Yes, that’s the conditional probability as we saw in the previous section.

Remember that P(A U B) = P(B U A), i.e. joint probability is symmetrical.

## Probability Distribution Functions

- Probability density function (Continuous probability distribution): It is a measure that specifies the probability of a random variable occurring between a specified range rather than a single value. The exact probability is found using the integral of the range of values of that variable. When plotted on the graph, the graph looks like a bell curve, and the peak of the curve represents the model.
- Probability Mass function (Discrete probability distribution): It is a statistical measure to predict the possible outcome of a discrete random variable giving a single outcome value—for example, the price of stock or number of sales, etc. The values, when plotted on the graph, are shown as individual values, which should be non-negative and add to make 1.

- Normal (Gaussian) distribution: It is a type of continuous probability distribution and holds much importance in statistics. They are used to represent real-valued random variables for which distributions are unknown. In a standard distribution curve, also called a bell curve, the data is centered around the center of the curve, and the mean, mode, and median values are all the same. The graph is symmetrical about its center.
- Central limit theorem: This is a crucial concept in probability and statistics theory as it means that the methods of probability and statistics that work for normal distributions also work for many other types of distributions. CLT establishes that even if the original variables are not generally distributed if we add independent random variables, their properly normalized sum bends as a normal distribution curve. The power of CLT is seen when the samples are high in number.

Mean

## Conclusion

We have discussed in detail the various statistics and probability methods used for data science. These are simple but vast and powerful. While doing analysis, you need not remember the exact formulae – you can find that anywhere on the web – what’s important is to know when to apply which formula. The methods we saw are just the beginning of the oceans of ways and algorithms that are used in data science and will create a strong base for your further exploration of the subject. It takes experience and time to grasp all the concepts, and even if you didn’t get any, don’t worry – you will surely get it with practice.