What is the probability distribution in statistic?
Before discussing the concept of a probability distribution, it is important to understand the meaning of statistic, the idea of using a statistic, and the type of statistic.
What is the meaning of statistic and its role in machine learning?
Statistics is the science that deals with methodologies to collect, organize, review, analyze, and draw conclusions from data. It is used in many disciplines like marketing, business, healthcare, telecom, etc.
Types of Data and Scale of Measurement
The shape of a distribution of the data depends upon the measure of central tendency (mean, median, and mode) and measure of variability/dispersion (range, variance, and standard deviation).
Standard deviation measures the dispersion of a set of data from its mean and is represented by the square root of the variance (σ).
Let us discuss different types of probability distribution:
Probability density distribution plays an important role in various probability distributions. So, it needs to understand the concept of this.
Probability Density Function
A Probability Distribution is a mathematical function through which the probability of occurrence of different possible outcomes in an experiment can be calculated.
In other words, the equation describing a continuous probability distribution is called a probability density function.
It has some properties such as:
1. The graph of probability density function (PDF) Will be continuous over a range
2. The area bounded by the curve between (a) and (b) always equal to 1
Normal Distribution (Gaussian distribution)
It depends upon the two factors:
The normal distribution is a probability distribution that associates a normal random variable (X) with the cumulative probability.
The normal distribution is represented by its following features:
When mean, median, and mode all are the same, then distribution is called symmetric.
When a distribution is skewed to either left or right, then distribution is called asymmetric
There are several methods to check the skewness in the dataset e.g., boxplot, kde plot. More skewness means more outliers in the dataset. To handle this problem of skewness, we can use normalization without changing the nature of data (bring down the scale of data set into a specific range), in this manner, the dispersion of dataset would come down.
Standard Normal Distribution
Conversion of normal distribution to standard normal distribution (µ = 0, σ = 1) using Z statistic by shifting the entire graph(data) is called standard normal distribution.
How standardization differ from normalization?
Normalization means scale down the feature’s dataset between 0 and 1. Example Max Min Scalar.
Standardization means convert all the values of features into standard normal distribution with mean (µ) = 0, standard deviation (σ) = 1
But one question arises here: Why do we need to convert normal distribution to standard normal distribution?
So, the answer is that while performing the P-test, F-test, Z-test for sampling distribution, we need to get the value for relative statistical tables like Z-table, P-table, F-table in which all the values have been generated by using a standard normal distribution.
It is important to do standardization of the dataset to perform all statistical analyses. In this way we will get the result or intuition about the dataset, so we need to convert normal distribution into standard normal distribution. The condition of using the z- table is that we should know the population.
Student’s T Distribution
It is symmetrical about zero, bell-shaped, but more spread out than the normal distribution.
Using T-test, we can compare two samples.
Conditions for Student T-Test
Sample size less than 15:
Use t-test if the data are close to normal. If the data are non-normal or outliers are present, do not use t-procedures.
Sample size at least 15:
T-test can be used except in the presence of outliers or strong skewness
T-test can be used even for skewed distributions when the sample is large (greater than or equal to 30).
The larger the sample size, the distribution of the sample means tends to normality and the sample standard deviation (s) tends towards population standard deviation (σ)
As the degree of freedom increases, t - distribution tends towards a standard normal distribution
Chi Squared Test:
1. It tells about how closely distribution of the categorical variable matches an expected distribution (goodness of fit).
2. It also checks whether two categorical variables are independent of each other or not (test of independence)
3. It is based on the frequencies and independent of parameters like mean and standard deviation.
Goodness of Fit
The binomial distribution is a kind of probability density function. It is used when there is more than one outcome of a certain experiment, for example, tossing a coin gives two outcomes. These outcomes are labeled as “head” and “tail.”
It is a type of Discrete Probability distribution. It considers random experiment will have only two outcomes, 1 ("success") and 0 ("failure") with complementary probabilities p and 1−p respectively
For example, getting the probability of head from tossing of a coin in a single trial either “0” (success) or “1” (failure).
P(Success) = p
Let, X=1 when Success and X=0 when failure,
Then the probability distribution function is given as:
It is used to find out the probability of several events in a certain period.
Looking forward to valuable suggestions from all of you.
Thank you for reading.
Happy learning !!!