Statistics & Data SL

IB Mathematics: Applications & Interpretation · Topic 4: Statistics & Probability

JMaths

www.jmaths.xyz

1

Measures of Central Tendency

Mean

\( \bar{x} = \Sigma x / n \)

For grouped data: \( \bar{x} = \Sigma fx / \Sigma f \) (use midpoints)

Median

Middle value when data is ordered.

If \( n \) even: average of the two middle values

Mode

Most frequently occurring value. A data set can have no mode, one mode, or multiple modes.

▶

When to use which: Mean is affected by outliers. Median is better for skewed data. Mode is useful for categorical data. The IB often asks you to justify which measure is most appropriate.

Modal class (grouped data)

The class interval with the highest frequency.

For grouped data you give a modal class (e.g. \( 20 \leq x < 30 \)), not a single modal value.

2

Measures of Spread

Range

\( \text{Range} = \max - \min \)

Interquartile range (IQR)

\( \text{IQR} = Q_3 - Q_1 \)

Middle 50% of data. Not affected by outliers.

Standard deviation & variance

Use GDC to calculate. Measures how spread out data is from the mean.

For IB AI, quote the population standard deviation \( \sigma_x \) (divisor \( n \)) from 1-Var Stats — NOT \( S_x \) (the sample SD, divisor \( n-1 \)). All three GDCs display both symbols; read \( \sigma_x \).

Variance \( = \sigma_x^{2} \) (square of the standard deviation). The GDC gives \( \sigma_x \); square it for variance.

▶

Outlier test: A value is an outlier if it is more than \( 1.5 \times \) IQR below \( Q_1 \) or above \( Q_3 \). E.g. outlier if value \( < Q_1 - 1.5 \times \text{IQR} \) or value \( > Q_3 + 1.5 \times \text{IQR} \).

Quartiles & percentiles (discrete data)

\( Q_1 \) at the lower 25%, \( Q_3 \) at the upper 25%.

Read \( Q_1 \) and \( Q_3 \) straight from the GDC 1-Var Stats output — by-hand methods can differ, so quote the GDC values in the IB. The \( k \)th percentile is read at \( (k/100) \times n \) on a cumulative frequency curve.

TI-84 Plus CE — 1-Var Stats

Enter data in [STAT] → Edit (L1, optionally L2 for frequencies)
[STAT] → CALC → 1: 1-Var Stats → L1 (,L2 if freq)
Gives: \( \bar{x}, \Sigma x, \sigma x, S_x, n, \) minX, \( Q_1, \) Med, \( Q_3, \) maxX
Read \( \sigma_x \) for the IB standard deviation, not \( S_x \).

TI-Nspire CX II — 1-Var Stats

Lists & Spreadsheet → enter data in column A (freq in column B if needed)
[Menu] → Statistics → Stat Calculations → One-Variable Statistics
Gives: \( \bar{x}, \sigma x, S_x, n, Q_1, \) median, \( Q_3, \) etc.
Read \( \sigma_x \) for the IB standard deviation, not \( S_x \).

Casio fx-CG50 — 1-Var Stats

[MENU] → Statistics → enter data in List 1 (freq in List 2)
[CALC] (F2) → 1-VAR; in SET (F6) set 1Var Freq to List 2 if using frequencies
Gives: \( \bar{x}, \sigma x, S_x, n, \) minX, \( Q_1, \) Med, \( Q_3, \) maxX
Read \( \sigma_x \) for the IB standard deviation, not \( S_x \).

3

Data Displays

Know how to read and draw: box plots, histograms, cumulative frequency curves, and frequency tables.

Box plot (5-number summary)

min, Q₁, median, Q₃, max

Box = IQR; whiskers to min/max (or to fence if showing outliers)

Cumulative frequency

Plot upper boundary vs cumulative frequency.

Read off: median at \( n/2 \), \( Q_1 \) at \( n/4 \), \( Q_3 \) at \( 3n/4 \); the \( k \)th percentile at \( (k/100)\,n \)

▶

Histogram vs bar chart: Histograms show continuous data with no gaps between bars. The area of each bar represents frequency. Bar charts show discrete or categorical data with gaps.

Statistics & Data SL

IB Mathematics: Applications & Interpretation · Topic 4: Statistics & Probability

JMaths

www.jmaths.xyz

4

Sampling & Data Types

Population vs sample

Population = every item of interest. Sample = a subset that is actually studied.

Random sample: every member of the population is equally likely to be chosen.

Discrete vs continuous data

Discrete = counts (e.g. number of cars). Continuous = measured (e.g. height, time).

Continuous data can take any value in an interval; discrete data takes separate fixed values.

Sampling methods

Simple random — everyone equally likely. Systematic — every \( k \)th item. Stratified — proportional sampling from each group. Quota — fixed number from each group. Convenience — whoever is easiest to reach.

▶

Bias: a sample is biased if it is not representative of the population (e.g. self-selected or convenience samples). Be ready to name a sampling method and give one reason a sample may be biased.

5

Effect of Constant Changes on Data

Add a constant \( k \) to every value

Mean, median, quartiles all increase by \( k \).

Range, IQR, standard deviation are unchanged (spread does not move when everything shifts).

Multiply every value by \( k \)

Mean, median, SD, IQR all multiply by \( k \) (SD by \( |k| \)).

Variance multiplies by \( k^{2} \).

Worked Example

A data set has mean \( 12 \) and standard deviation \( 3 \). Each value is multiplied by \( 2 \), then \( 5 \) is added. Find the new mean and standard deviation.

New mean \( = 2(12) + 5 = 29 \)

SD scales by the multiplier only (adding a constant has no effect): new SD \( = 2 \times 3 = 6 \)

Answer: mean \( = 29 \), standard deviation \( = 6 \)

Statistics & Data SL

IB Mathematics: Applications & Interpretation · Topic 4: Statistics & Probability

JMaths

www.jmaths.xyz

6

Frequency Tables & Grouped Data

For grouped data, use the midpoint of each class to estimate the mean: midpoint = (lower + upper) / 2.

Worked Example

A grouped frequency table shows: 10–20 (freq 5), 20–30 (freq 12), 30–40 (freq 8). Estimate the mean.

Midpoints: 15, 25, 35

\( \Sigma fx = 5(15) + 12(25) + 8(35) = 75 + 300 + 280 = 655 \)

\( \Sigma f = 5 + 12 + 8 = 25 \)

\( \bar{x} = 655 / 25 = 26.2 \)

Answer: Estimated mean = 26.2

✗

Common error: Forgetting to use midpoints for grouped data. The mean of "10–20" is NOT 10 or 20 — it is 15. Also check whether boundaries are \( 10 \leq x < 20 \) or 10–19.

Statistics & Data SL

IB Mathematics: Applications & Interpretation · Topic 4: Statistics & Probability

JMaths

www.jmaths.xyz

7

Correlation & Regression

The Pearson correlation coefficient (\( r \)) measures the strength and direction of a linear relationship between two variables.

Interpreting \( r \)

\( r = 1 \): perfect positive linear \( r = -1 \): perfect negative linear \( r = 0 \): no linear relationship

\( |r| > 0.75 \) strong \( 0.5 < |r| < 0.75 \) moderate \( |r| < 0.5 \) weak

\( r^2 \) (coefficient of determination)

The proportion of the variation in \( y \) explained by the linear model.

E.g. \( r^2 = 0.88 \) → 88% of the variation in \( y \) is explained by \( x \).

Regression line (\( y \) on \( x \))

\( y = ax + b \)

Use to predict \( y \) from \( x \). The line passes through \( (\bar{x}, \bar{y}) \).

Meaning of \( a \) and \( b \) in context: \( a \) = gradient = the change in \( y \) for each \( 1 \)-unit increase in \( x \). \( b \) = \( y \)-intercept = the predicted \( y \) when \( x = 0 \) (often not meaningful if \( x = 0 \) lies outside or is irrelevant to the data).

TI-84 Plus CE — LinReg

Enter data: [STAT] → Edit (x in L1, y in L2)
[STAT] → CALC → 4: LinReg(ax+b) L1, L2
Gives: \( a \) (gradient), \( b \) (intercept), \( r, r^2 \)
(If \( r \) not shown: [2nd][CATALOG] → DiagnosticOn)

TI-Nspire CX II — LinReg

Enter data in Lists & Spreadsheet (x in col A, y in col B)
[Menu] → Statistics → Stat Calculations → Linear Regression (mx+b)
Gives: \( m, b, r, r^2 \)

Casio fx-CG50 — LinReg

[MENU] → Statistics → enter x in List 1, y in List 2
[CALC] (F2) → REG → X (linear)
Gives: \( a, b, r, r^2 \)

Worked Example

GDC gives the regression line \( y = 2.3x + 4.1 \) with \( r = 0.94 \) for data where \( 5 \leq x \leq 30 \). Predict \( y \) when \( x = 20 \) and comment on reliability.

\( y = 2.3(20) + 4.1 = 50.1 \)

\( r = 0.94 \) shows a strong positive linear correlation.

\( x = 20 \) is within the data range (interpolation), so the prediction is reliable.

Answer: \( y = 50.1 \); reliable because strong correlation and interpolation.

Statistics & Data SL

IB Mathematics: Applications & Interpretation · Topic 4: Statistics & Probability

JMaths

www.jmaths.xyz

8

Interpolation vs Extrapolation

Interpolation

Predicting within the data range.

Generally reliable if \( r \) is strong.

Extrapolation

Predicting outside the data range.

Less reliable — the relationship may not hold.

✗

Common error: Using the regression line to extrapolate far beyond the data and stating the prediction is reliable. Always check whether the \( x \)-value is within the original data range.

✗

Common error: Confusing correlation with causation. A strong \( r \) does NOT mean one variable causes the other. There may be a third variable or coincidence.

9

Exam Reminders

▶

Describing distributions: Comment on shape (symmetric, positively/negatively skewed), centre (mean or median), and spread (range, IQR, or standard deviation).

▶

Comparing data sets: Compare a measure of centre AND a measure of spread. E.g. "Group A has a higher mean (25.3 vs 19.1) but a larger standard deviation (4.2 vs 2.8), so it is more spread out."

▶

Formula booklet: The formulae for mean, standard deviation, and \( r \) are given but you should use your GDC for calculations. The formulae help you understand what the statistics measure.