AP Stats Home

Describing the Distribution of a Quantitative Variable

When analyzing a single quantitative variable, it’s crucial to describe its overall pattern and any striking deviations from that pattern. We often use a visual display like a histogram, stemplot, or boxplot (which can be found in Representing a Quantitative Variable with Graphs) to aid in this description. A common mnemonic for a complete description is SOCS: Shape, Outliers, Center, and Spread.

Shape

The shape of a distribution describes its overall form.

Symmetry/Skewness:
- Symmetric: The left and right sides of the graph are approximately mirror images. The mean and median will be close. A special case of a symmetric distribution is the The Normal Distribution.
- Skewed Right (Positively Skewed): The tail extends further to the right. The mean will typically be greater than the median.
- Skewed Left (Negatively Skewed): The tail extends further to the left. The mean will typically be less than the median.
Modality:
- Unimodal: Has one distinct peak (mode).
- Bimodal: Has two distinct peaks.
- Multimodal: Has more than two distinct peaks.
- Uniform: No distinct peaks; all values appear with roughly the same frequency.

Outliers

Outliers are individual values that fall outside the overall pattern of the distribution. They can be due to natural variability, measurement errors, or data entry mistakes. When describing a distribution, you should always mention any apparent outliers and investigate them further.

Identifying Outliers

1.5 IQR Rule: A common method to formally identify potential outliers is the 1.5 IQR rule. A value is considered a potential outlier if it falls below $ Q_1 - 1.5 \times IQR $ or above $ Q_3 + 1.5 \times IQR $ . Here, $ Q_1 $ is the first quartile, $ Q_3 $ is the third quartile, and $ IQR = Q_3 - Q_1 $ .

Center

The center of a distribution provides a typical or central value.

Mean ( $ \bar{x} $ or $ \mu $ ): The arithmetic average of all the values. It is sensitive to outliers and skewness. $$ \bar{x} = \frac{\sum x_i}{n} $$ where $ x_i $ are the individual data points and $ n $ is the number of data points.
Median (M): The midpoint of the ordered data. Half the observations are smaller, and half are larger. It is resistant to outliers and skewness.

For more on calculating these values, refer to Summary Statistics for a Quantitative Variable.

Spread

The spread (or variability) describes how much the data values vary from each other.

Range: The difference between the maximum and minimum values ( $ Max - Min $ ). It is highly affected by outliers.
Interquartile Range (IQR): The range of the middle 50% of the data ( $ IQR = Q_3 - Q_1 $ ). It is resistant to outliers.
Standard Deviation ( $ s $ or $ \sigma $ ): A measure of the typical distance of the observations from the mean. It is sensitive to outliers. $$ s = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n-1}} $$ where $ x_i $ are the individual data points, $ \bar{x} $ is the sample mean, and $ n $ is the sample size.

For a summary of resistance to outliers:

Statistic	Resistant to Outliers?
Mean	No
Median	Yes
Range	No
Interquartile Range (IQR)	Yes
Standard Deviation	No

Choosing between mean/standard deviation vs. median/IQR depends on the shape:

Use mean and standard deviation for reasonably symmetric distributions without outliers.
Use median and IQR for skewed distributions or those with significant outliers.

Remember to always describe a quantitative variable in terms of its SOCS!

Describing the Distribution of a Quantitative Variable

Carson West

AP Stats Home

Describing the Distribution of a Quantitative Variable

Shape

Outliers

Identifying Outliers

Center

Spread