5.1 Statistics
Detailed Theory: Statistics
1. Introduction to Statistics
1.1 What is Statistics?
Statistics is the science of collecting, organizing, analyzing, interpreting, and presenting data.
Two Main Branches:
Descriptive Statistics: Methods for summarizing and describing data (mean, median, graphs, etc.)
Inferential Statistics: Methods for making predictions or inferences about a population based on sample data
1.2 Basic Terminology
a) Data
Information collected for analysis.
Types of Data:
Qualitative Data: Descriptive/non-numerical data (colors, gender, yes/no)
Quantitative Data: Numerical data that can be measured
Discrete Data: Countable values (number of students, cars)
Continuous Data: Measurable values (height, weight, temperature)
b) Population vs Sample
Population: Complete set of all items/individuals of interest
Sample: A subset of the population selected for study
c) Variable
A characteristic that can take different values.
Example: In a study of students: age, height, marks are variables
d) Parameter vs Statistic
Parameter: Numerical measure describing a population characteristic (denoted by Greek letters: μ, σ)
Statistic: Numerical measure describing a sample characteristic (denoted by Roman letters: xˉ, s)
2. Data Collection and Organization
2.1 Methods of Data Collection
Direct Observation: Watching and recording
Experiments: Controlled conditions
Surveys/Questionnaires: Asking questions
Interviews: Face-to-face questioning
Secondary Data: Using existing data
2.2 Frequency Distribution
A table showing how often each value or range of values occurs.
a) For Ungrouped Data
Example: Test scores: 5, 7, 8, 5, 9, 7, 5, 8, 7, 7
Frequency Table:
5
3
7
4
8
2
9
1
Total
10
b) For Grouped Data
When data has many different values, we group them into classes.
Example: Heights of 50 students (in cm)
150-155
5
155-160
10
160-165
15
165-170
12
170-175
8
Total
50
2.3 Types of Frequency
1. Absolute Frequency: Simple count (denoted by f)
2. Relative Frequency: Proportion or percentage
Relative Frequency=Total frequencyFrequency of class
3. Cumulative Frequency: Running total of frequencies
Less than type: Cumulative frequency up to upper limit of each class
More than type: Cumulative frequency from lower limit of each class
2.4 Class Interval Details
Class Limits: Lower and upper bounds of a class
Class Boundaries: True limits (for continuous data)
If classes are 150-154, 155-159, etc., boundaries are 149.5-154.5, 154.5-159.5
Class Width: Difference between upper and lower boundaries
Class Mark (Midpoint): Average of class limits
Class Mark=2Lower limit+Upper limit
3. Measures of Central Tendency
These are single values that represent the center of a data set.
3.1 Mean (Average)
a) Arithmetic Mean for Ungrouped Data
For n values x1,x2,…,xn:
Mean=xˉ=nx1+x2+⋯+xn=n∑i=1nxi
Example: Find mean of: 5, 8, 12, 15, 10
xˉ=55+8+12+15+10=550=10
b) Arithmetic Mean for Grouped Data
For grouped data with frequencies:
xˉ=∑fi∑fixi
where xi = class mark, fi = frequency of i-th class
Example: Find mean from:
0-10
5
5
25
10-20
8
15
120
20-30
12
25
300
30-40
5
35
175
Total
30
620
xˉ=30620=20.67
c) Assumed Mean Method (Shortcut)
For large numbers, use:
xˉ=A+∑fi∑fidi
where A = assumed mean, di=xi−A
d) Step Deviation Method
When class intervals are equal:
xˉ=A+h×∑fi∑fiui
where h = class width, ui=hxi−A
3.2 Median
The middle value when data is arranged in order.
a) For Ungrouped Data
Step 1: Arrange data in ascending order
Step 2: If n is odd: Median=(2n+1)-th term
If n is even: Median=2(2n)-th term+(2n+1)-th term
Example 1 (odd): 3, 7, 1, 9, 5 → Arrange: 1, 3, 5, 7, 9
n=5 (odd), Median = (25+1)-th term = 3rd term = 5
Example 2 (even): 4, 8, 2, 6 → Arrange: 2, 4, 6, 8
n=4 (even), Median = 22nd term+3rd term=24+6=5
b) For Grouped Data
For grouped data with cumulative frequency:
Median=L+(f2N−F)×h
where:
L = lower boundary of median class
N = total frequency
F = cumulative frequency before median class
f = frequency of median class
h = class width
Median Class: First class with cumulative frequency ≥ 2N
Example: Find median from:
0-10
5
5
10-20
8
13
20-30
12
25
30-40
5
30
Total
30
N=30, 2N=15
Median class is 20-30 (first with CF ≥ 15)
L=20, F=13, f=12, h=10
Median=20+(1215−13)×10=20+(122)×10
=20+1220=20+1.67=21.67
3.3 Mode
The value that occurs most frequently.
a) For Ungrouped Data
Simply the most frequent value.
Example: 3, 5, 7, 5, 2, 5, 9 → Mode = 5 (appears 3 times)
Note: Data can have no mode, one mode (unimodal), or multiple modes (bimodal, multimodal)
b) For Grouped Data
For grouped data:
Mode=L+(2f1−f0−f2f1−f0)×h
where:
L = lower boundary of modal class
f1 = frequency of modal class
f0 = frequency of class before modal class
f2 = frequency of class after modal class
h = class width
Modal Class: Class with highest frequency
Example: Find mode from:
0-10
5
10-20
8
20-30
12
30-40
5
40-50
3
Modal class is 20-30 (highest frequency 12)
L=20, f1=12, f0=8, f2=5, h=10
Mode=20+(2×12−8−512−8)×10=20+(24−134)×10
=20+(114)×10=20+1140=20+3.64=23.64
3.4 Relationship between Mean, Median, Mode
For moderately skewed distributions:
Mode=3×Median−2×Mean
This is called the empirical relationship.
4. Measures of Dispersion (Variability)
These measure how spread out the data is.
4.1 Range
Simplest measure of dispersion:
Range=Maximum value−Minimum value
Limitation: Affected by extreme values, doesn't consider all data
4.2 Mean Deviation
Average of absolute deviations from central value.
a) Mean Deviation about Mean
For ungrouped data:
MD(xˉ)=n∑∣xi−xˉ∣
For grouped data:
MD(xˉ)=∑fi∑fi∣xi−xˉ∣
b) Mean Deviation about Median
For ungrouped data:
MD(Med)=n∑∣xi−Median∣
For grouped data:
MD(Med)=∑fi∑fi∣xi−Median∣
c) Mean Deviation about Mode
For ungrouped data:
MD(Mode)=n∑∣xi−Mode∣
For grouped data:
MD(Mode)=∑fi∑fi∣xi−Mode∣
4.3 Variance and Standard Deviation
Most important measures of dispersion.
a) Variance (σ2 or s2)
Average of squared deviations from mean.
For ungrouped data:
Population Variance: σ2=N∑(xi−μ)2
Sample Variance: s2=n−1∑(xi−xˉ)2
For grouped data:
Population Variance: σ2=∑fi∑fi(xi−μ)2
Sample Variance: s2=(∑fi)−1∑fi(xi−xˉ)2
b) Standard Deviation (σ or s)
Square root of variance. More interpretable as it has same units as data.
For ungrouped data:
Population SD: σ=N∑(xi−μ)2
Sample SD: s=n−1∑(xi−xˉ)2
For grouped data:
Population SD: σ=∑fi∑fi(xi−μ)2
Sample SD: s=(∑fi)−1∑fi(xi−xˉ)2
c) Shortcut Formulas for Variance
Direct Method: σ2=∑fi∑fixi2−(∑fi∑fixi)2
Step Deviation Method: σ2=h2×[∑fi∑fiui2−(∑fi∑fiui)2]
where ui=hxi−A
4.4 Coefficient of Variation (CV)
Relative measure of dispersion, expressed as percentage:
CV=MeanStandard Deviation×100%
Used to compare variability of different data sets.
Lower CV means less variability relative to mean.
Example: Compare two data sets:
Set A: Mean = 50, SD = 5 → CV = 505×100%=10%
Set B: Mean = 100, SD = 15 → CV = 10015×100%=15%
Set A is more consistent (lower CV).
4.5 Quartiles and Interquartile Range (IQR)
a) Quartiles
Divide data into four equal parts:
Q1 (First Quartile): 25th percentile
Q2 (Second Quartile): 50th percentile (same as median)
Q3 (Third Quartile): 75th percentile
b) For Ungrouped Data
To find Q1: Value at position 4n+1
To find Q3: Value at position 43(n+1)
c) For Grouped Data
Similar to median formula:
Qk=L+(f4kN−F)×h
where k=1,2,3 for Q1, Q2, Q3
d) Interquartile Range (IQR)
IQR=Q3−Q1
Measures spread of middle 50% of data.
e) Quartile Deviation (Semi-IQR)
QD=2Q3−Q1
f) Coefficient of Quartile Deviation
Coefficient of QD=Q3+Q1Q3−Q1
5. Graphical Representation of Data
5.1 Bar Graph
For categorical/discrete data. Bars with gaps between them.
Types:
Simple Bar Graph: One variable
Multiple Bar Graph: Compare multiple variables
Component Bar Graph: Shows parts of whole
5.2 Histogram
For continuous grouped data. Bars without gaps.
Area of bars represents frequency.
Key Points:
Classes must be continuous
If class intervals are unequal, adjust heights
5.3 Frequency Polygon
Line graph connecting midpoints of tops of histogram bars.
To draw: Plot points (class mark, frequency) and connect them.
5.4 Ogive (Cumulative Frequency Curve)
Graph of cumulative frequency.
Types:
Less than Ogive: Plot upper limits vs cumulative frequency (rising curve)
More than Ogive: Plot lower limits vs cumulative frequency (falling curve)
Median from Ogive: Intersection point of less than and more than ogives gives median
5.5 Pie Chart (Circle Graph)
Shows proportions as sectors of a circle.
Angle for each category: Angle=Total frequencyFrequency×360∘
5.6 Box Plot (Box-and-Whisker Plot)
Shows five-number summary: Minimum, Q1, Median, Q3, Maximum
Construction:
Draw box from Q1 to Q3
Draw line inside box at median
Draw whiskers to min and max (or to 1.5×IQR for outliers)
6. Correlation and Regression
6.1 Correlation
Measures strength and direction of linear relationship between two variables.
a) Types of Correlation
Positive Correlation: Both variables increase together
Negative Correlation: One increases, other decreases
No Correlation: No relationship
Perfect Correlation: All points lie on straight line
b) Karl Pearson's Correlation Coefficient (r)
Measures linear correlation:
r=∑(xi−xˉ)2⋅∑(yi−yˉ)2∑(xi−xˉ)(yi−yˉ)
Shortcut formula:
r=[n∑x2−(∑x)2][n∑y2−(∑y)2]n∑xy−(∑x)(∑y)
Properties of r:
−1≤r≤1
r=1: Perfect positive correlation
r=−1: Perfect negative correlation
r=0: No linear correlation
c) Spearman's Rank Correlation Coefficient (ρ)
For ranked data or non-linear relationships:
ρ=1−n(n2−1)6∑di2
where di = difference in ranks
If ties exist: Use formula with adjustments
6.2 Regression
Finds relationship to predict one variable from another.
a) Regression Lines
Line of regression of y on x: Predicts y from x
y−yˉ=rσxσy(x−xˉ)
Line of regression of x on y: Predicts x from y
x−xˉ=rσyσx(y−yˉ)
b) Regression Coefficients
Regression coefficient of y on x: byx=rσxσy
Regression coefficient of x on y: bxy=rσyσx
Note: byx×bxy=r2
c) Angle Between Regression Lines
If θ is angle between two regression lines:
tanθ=∣r∣1−r2⋅σx2+σy2σxσy
Special Cases:
If r=0: Lines are perpendicular
If r=±1: Lines coincide (angle = 0)
7. Probability Basics (for Statistics)
7.1 Basic Concepts
Probability: Measure of likelihood of an event (0 to 1)
Sample Space (S): Set of all possible outcomes
Event (E): Subset of sample space
7.2 Probability Formulas
Classical Probability: P(E)=Total number of outcomesNumber of favorable outcomes
Addition Rule: P(A∪B)=P(A)+P(B)−P(A∩B)
For mutually exclusive events: P(A∪B)=P(A)+P(B)
Complement Rule: P(A′)=1−P(A)
Multiplication Rule: For independent events: P(A∩B)=P(A)×P(B)
Conditional Probability: P(A∣B)=P(B)P(A∩B)
8. Random Variables and Probability Distributions
8.1 Random Variable
Variable whose values depend on outcomes of random experiment.
Discrete Random Variable: Countable values
Continuous Random Variable: Measurable values
8.2 Probability Distribution
For discrete random variable X with values x1,x2,… and probabilities p1,p2,…:
Conditions: 0≤pi≤1 and ∑pi=1
8.3 Mean (Expected Value) of Discrete Random Variable
μ=E(X)=∑xipi
8.4 Variance of Discrete Random Variable
σ2=E(X2)−[E(X)]2=∑xi2pi−(∑xipi)2
8.5 Standard Deviation
σ=Variance
8.6 Binomial Distribution
For experiments with:
Fixed number of trials (n)
Two outcomes (success/failure)
Constant probability of success (p)
Independent trials
Probability Mass Function:
P(X=r)=(rn)pr(1−p)n−r for r=0,1,2,…,n
where (rn)=r!(n−r)!n!
Mean: μ=np
Variance: σ2=np(1−p)
8.7 Normal Distribution
Most important continuous distribution (bell curve).
Properties:
Bell-shaped, symmetric about mean
Mean = median = mode
Total area under curve = 1
Standard Normal Distribution: Mean = 0, SD = 1
Z-score: z=σx−μ
9. Solved Examples
Example 1: Find Mean, Median, Mode
Data: 12, 15, 18, 12, 20, 15, 12, 25, 18
Solution:
Mean: xˉ=912+15+18+12+20+15+12+25+18=9147=16.33
Median: Arrange: 12, 12, 12, 15, 15, 18, 18, 20, 25
n=9 (odd), Median = 29+1=5-th term = 15
Mode: 12 (appears 3 times, most frequent)
Example 2: Grouped Data Calculations
Given:
0-10
5
10-20
8
20-30
12
30-40
7
40-50
3
Find mean, median, mode, standard deviation.
Solution:
First prepare table:
0-10
5
5
25
125
5
10-20
8
15
120
1800
13
20-30
12
25
300
7500
25
30-40
7
35
245
8575
32
40-50
3
45
135
6075
35
Total
35
825
24075
Mean: xˉ=35825=23.57
Median: N=35, 2N=17.5
Median class: 20-30 (CF reaches 25 at this class)
L=20, F=13, f=12, h=10
Median = 20+(1217.5−13)×10=20+1245=23.75
Mode: Modal class: 20-30 (highest f=12)
L=20, f1=12, f0=8, f2=7, h=10
Mode = 20+(24−8−712−8)×10=20+94×10=24.44
Variance: σ2=∑f∑fx2−(∑f∑fx)2
=3524075−(23.57)2=687.86−555.66=132.2
Standard Deviation: σ=132.2=11.5
Example 3: Correlation Calculation
Find correlation coefficient for:
1
2
2
4
3
5
4
4
5
6
Solution:
Prepare table:
1
2
2
1
4
2
4
8
4
16
3
5
15
9
25
4
4
16
16
16
5
6
30
25
36
15
21
71
55
97
n=5, ∑x=15, ∑y=21, ∑xy=71, ∑x2=55, ∑y2=97
r=[n∑x2−(∑x)2][n∑y2−(∑y)2]n∑xy−∑x∑y
=[5×55−152][5×97−212]5×71−15×21
=[275−225][485−441]355−315=50×4440=220040=46.940=0.853
Strong positive correlation.
10. Important Formulas Summary
10.1 Measures of Central Tendency
Mean (ungrouped): xˉ=n∑xi
Mean (grouped): xˉ=∑fi∑fixi
Median (grouped): L+(f2N−F)×h
Mode (grouped): L+(2f1−f0−f2f1−f0)×h
Empirical Relation: Mode = 3×Median - 2×Mean
10.2 Measures of Dispersion
Range: Max - Min
Variance: σ2=∑fi∑fixi2−(∑fi∑fixi)2
Standard Deviation: σ=Variance
Coefficient of Variation: CV=xˉσ×100%
Quartiles: Qk=L+(f4kN−F)×h
IQR: Q3−Q1
10.3 Correlation and Regression
Correlation Coefficient:
r=[n∑x2−(∑x)2][n∑y2−(∑y)2]n∑xy−∑x∑y
Regression Line (y on x): y−yˉ=rσxσy(x−xˉ)
Regression Coefficients: byx=rσxσy, bxy=rσyσx
Relation: byx×bxy=r2
10.4 Probability Distributions
Binomial: P(X=r)=(rn)pr(1−p)n−r
Mean of Binomial: np
Variance of Binomial: np(1−p)
11. Exam Tips and Common Mistakes
11.1 Common Mistakes to Avoid
Using wrong formula for grouped vs ungrouped data
Confusing population vs sample formulas (divide by n vs n-1 for variance)
Forgetting to arrange data before finding median
Incorrect class boundaries for grouped data
Misinterpreting correlation coefficient (correlation ≠ causation)
11.2 Problem-Solving Strategy
Identify data type: Ungrouped or grouped? Discrete or continuous?
Choose correct formulas based on what's asked
Create tables for organized calculations (especially for grouped data)
Show all steps clearly
Include units in final answer
11.3 Quick Checks
Mean, median, mode relationship: For symmetric data: Mean = Median = Mode
Standard deviation: Always non-negative
Correlation coefficient: Between -1 and 1
Probability: Between 0 and 1
Variance formulas: Population: divide by N, Sample: divide by n-1
This comprehensive theory covers all aspects of statistics with detailed explanations and examples, making it easy to understand while being thorough enough for exam preparation.