A Complete Guide to Probability and Statistics for Data Science from zero to Hero[Part -1]

#statistics #probability

Sanket Kangle Sept 19 2020 · 7 min read
Share this

A Complete Guide to Probability and Statistics for Data Science from zero to Hero[Part -1]

Hey you! Welcome!

  • Have you forgotten all the stats you learned in high school?
  • Have you ever felt that statistics is not my cup of tea?
  • You want to pursue data science and don’t know where to start?
  • You never learned statistics but want to learn?
  • Want to know probability and statistics from scratch?
  • If an answer to any of the above questions is yes then this post is for you.

    Table of content[part 1]

  • Random Variable
  • Population, Sample and Sampling Error
  • Central Tendency and its Measures
  • Dispersion and its Measures
  • All About Probability
  • Random Variable

    A random variable can be defined as a numerical outcome of an experiment. A random variable is also called a random quantity, aleatory variable, or stochastic variable. A set of all possible values of a random variable of a particular experiment is called a sample space. A random variable must be measurable so that the probabilities can be assigned to its potential values. The outcome of an experiment depends on uncertain variables in the environment.

    Image from author

    There are two types of a random variable:

    Discrete Random Variable (DRV):
    Discrete random variables can have only discrete values, it is always a whole number e.g. number of students(15, 16, 17), babies in the houses, etc.

    Continuous Random Variable (CRV):
    As the name suggests, CRV can have continuous values within the range. E.g height(160, 160.2, 160.26) temperature, weight, etc.

    Population, Sample and Sampling Error


    A population is an entire dataset that we wish to draw the conclusion of. Denoted by “N”. A parameter is a measure that describes the whole population.


    A subset of the population is a sample. Denoted by “n”. In general, it is impractical to collect data for the entire population so we have to rely on the samples, which are small, manageable, cost-effective, and representative of the whole population. A statistic is a measure that describes the sample.

      credit: icon used made by humans-2 from flaticon.com

    Sampling Error

    Sampling Error is the difference between the population parameter and sample statistics. Sampling error can occur due to the random selection of the sample (as the sample is not representing the entire population in a better way). In general, the aim is to generalize the findings from the sample to the entire population, hence we need sampling error to be low. One way of doing that is by increasing the sample size, Other could be selecting a sample in such a way that it will represent the population.

    Central Tendency and its Measures

    A central tendency is a central or typical value for a probability distribution. It is a summary statistic. The following are the most general measures of central tendency:


    The mean is the average of all the elements of the dataset.

    The mean is easily affected by outliers.


    Median is the middle element of a sorted dataset. If the dataset has an even number of elements, then the average of two middle elements is the median. The Median is more resistant to outliers.


    The most frequent entry in the dataset is the mode. This is the only central tendency measure that can be used in the case of nominal and qualitative data.

    A general question can come to our mind that if we have mean, then why do we need median or mode? Let me answer this question by an example.

    dataset A =[1,3,5,5,6,100]
    mean of dataset A is 20, but is it the best representative of our dataset? of course not. the number 100 which is too far away from all other elements of the dataset(called outlier) skews the mean. here median is 5.5 and the mode is 5 which are better representations of the central tendency of the dataset than the mean.

    Depending on the nature of the dataset, one should decide which quantity can describe the central tendency in a better way.

    Dispersion and its Measures

    Dispersion is the spread of the data, the extent to which a distribution is stretched or squeezed. Central tendencies are not enough to fully understand the nature of the entire dataset, hence we need dispersion measures. With both central tendencies and dispersion measures together paint a good picture of the dataset.

    The following are the most common dispersion measures:


    The range is nothing but the difference between the maximum value and minimum value. It tells us within what range the whole dataset lies. The range is very easy and straightforward to calculate but at the same time, it is too sensitive to outliers.

    R = max(dataset) -min(dataset)


    Variance is the measure of the spread of data, it is the expectation of squared deviation of a random variable from its mean.

    Now, let us try to understand by intuition, without going into mathematics too much, why we divide with “n-1” instead of “n”.
    The ideal way to calculate standard deviation from the sample dataset would be

    But we do not know ‘mu’, hence we use ‘x-bar’ instead. ‘x-bar’ is calculated using only a sample dataset, so in the real world


    Hence we use “n-1” in the denominator to calculate the unbiased standard deviation from the sample.

    Standard Deviation:

    As the unit of variance is squared the unit of data(as obvious from the formula above), the value of variance is not very intuitive, hence we take its square root which is defined as standard deviation.

  • If the standard deviation is small, the data has little spread (i.e., the majority of points fall very near the mean).
  • If standard deviation = 0, there is no spread. This only happens when all data items are the same value.
  • The standard deviation is significantly affected by outliers and skewed distributions.
  • All About Probability

    Probability is a numerical description of how likely an event can occur. It is the ratio number of ways event can occur with total possible outcomes

    Intuition becomes clear with examples. Let’s see an example of a fair coin toss.

    Image from author

    Let’s see an example of a dice,

    Image from author

    Total possible outcomes, formally called sample space(S) = {1, 2, 3, 4, 5, 6}
    total number(count) of possible outcomes = n(S) = 6

    Van diagram is also useful to visualize probability as shown in the image below

    Image from author

    We can visualize the following results from the image above.

    From our discussion till now, Some properties of probability

  • The probability of an event is always less or equal to 1. i.e. P(X) ≤ 1
  • For sample space S = {x_1, x_2, … , x_n},
  • Event A and Event C in the above image are called mutually exclusive events.
  • Addition rule of probability: P(A or B) = P(A) + P(B)- P(A and B), in strictly mathematical notation, the same is written as below
  • And these set notations are pronounced as below

    Multiplication rule of Probability: Probability of (independent events) occurrence of event A and event C is given by the product of their individual probability

    P(A)andP(C) = P(A).P(C) ….{this is true only when A and C are independent events.

    i.e. probability of getting 1 and 6 in two consecutive dice roll is given by P(1).P(6) = (1/6). (1/6) =1/36

  • P’(A) = 1-P(A)
  • Bayes theorem

    In simple words, Bayes theorem tells about how with new information our perception of the likelihood of an event can change.

    More technically it tells us how often B occurs when event A is true[P(B|A)] when we know:
    how often an event A occurs when event B is true[P(A|B)]
    how likely A is on its own[P(A)]
    how likely B is on its own[P(B)]
    with following formula

    Confused enough? don’t worry, just hang in there everything will be crystal clear in just a few minutes.

    Let’s recap to simple words where we said

    Bayes theorem tells about how with new information our perception of the likelihood of an event can change.

    It is easy to understand the Bayes theorem with an example. Let’s take an example of the Corona pandemic(Note: all numbers taken in this example are just for ease of calculation and are not true statistics of the actual pandemic). Assume the corona test is 99% accurate and you are tested positive to this test. Now, what is the probability that you have corona? it is .99 or 99%, right?

    Now let’s assume you read a new study that claims that 1% of the total population has corona. Does this new information changes the probability of whether you have corona or not given that you tested positive?

    For ease of calculation let’s assume total 10,000 tests are done randomly(Note: in reality, tests are not done randomly they are done according to doctors suggestion based on symptoms and contact with corona patient)

    Image from author

    If we calculate the probability of being actually corona positive given that the tests result is positive can be given as follows

    Which comes to (99)/(99+99) = .50 or 50%. See, just due to new information, the likelihood reduced from 99% to 50%. This is basically the Bayes theorem tells us from that scary formula above. The same result comes from the formula as well, check it yourself!


    See, I told you that you would understand the intuition in a couple of minutes, haven’t I?

    Permutations and combinations

    A combination is different ways of selection of some “r” objects from “n” distinct objects and permutation is simply when the arrangement of combination or its order of selection matters(r≤n).

    When I say the total number of ways I can select “r” balls from “n” distinct balls, I am talking about combinations, and when I say the total number of ways I can arrange “r” balls from “n” distinct balls I am talking about permutations. not clear even yet? look at this example, if I have 3(n) balls A, B, C and I select all 3(r) balls from it, then

    permutations:{ABC, ACB, BAC, BCA, CAB, CBA}

    in general, for cases like above,

    #(permutations) = r! X #(combinations)

    in mathematical notation,

    P — Permutations
    C — Combinations

    (proof of these formulas are out of the scope of this post, we can arrive at these formulas by simply calculating probabilities, give it a try!)

    As soon as I finish Part 2, I will add a link here.

    Thanks for reading the article! Wanna connect with me?
    Here is a link to my Linkedin Profile

    Read next