Quick Summary
This guide will equip you to analyze the relationship between two categorical variables. You will learn to organize data in two-way tables, calculate various types of relative frequencies (joint, marginal, and conditional), and create powerful visualizations like segmented and side-by-side bar charts. Ultimately, you will be able to use these tools to determine whether there is a statistical association between the two variables.
Key Concepts
Analyzing two categorical variables involves moving beyond single-variable descriptions to explore how they interact. We use specific tables, calculations, and graphs to uncover and describe these relationships.
1. Organizing Data: The Two-Way Table
Data for two categorical variables are best organized in a two-way table (also called a contingency table).
Structure: One variable's categories form the rows, and the other's form the columns. The variable we suspect influences the other (the explanatory variable) is typically placed in the columns, and the outcome (the response variable) is in the rows.
Joint Frequencies: The counts inside the main body of the table are joint frequencies, representing the number of individuals that fall into a specific category for both variables.
Marginal Frequencies: The totals for each row and column are the marginal frequencies. They represent the total counts for each category of a single variable, ignoring the other.
Grand Total: The sum of all row totals (or all column totals) is the grand total, representing the total number of individuals in the dataset.
Example Table: A study surveyed 200 high school students about their primary mode of transportation to school and their grade level.
| Grade Level | Bus | Car | Walk/Bike | Total |
|---|---|---|---|---|
| Underclass (9/10) | 60 | 30 | 20 | 110 |
| Upperclass (11/12) | 20 | 50 | 20 | 90 |
| Total | 80 | 80 | 40 | 200 |
2. Calculating Frequencies and Proportions
From the counts in a two-way table, we can calculate different types of relative frequencies (proportions or percentages) to make meaningful comparisons.
Joint Relative Frequency: Answers the question, "What proportion of the total individuals have both characteristics?"
Formula:
Example: The joint relative frequency of students who are Underclassmen AND ride the bus is 60 / 200 = 0.30 or 30%.
Marginal Relative Frequency: Describes the distribution of a single variable. Answers, "What proportion of the total individuals fall into this one category?"
Formula:
Example: The marginal relative frequency of students who are Upperclassmen is 90 / 200 = 0.45 or 45%.
Conditional Relative Frequency: This is the most important calculation for identifying associations. It answers, "Given an individual is in one category, what is the probability they are in another?" It restricts the focus to a specific row or column.
Formula:
Example (Conditioning on Grade Level): What proportion of Underclassmen ride the bus? Here, the condition is "Underclassmen," so we only look at that row.
- Calculation: 60 / 110 \approx 0.545 or 54.5%.
Example (Conditioning on Transportation): What proportion of bus riders are Upperclassmen? Here, the condition is "bus riders," so we only look at that column.
- Calculation: 20 / 80 = 0.25 or 25%.
3. Visualizing Data for Two Categorical Variables
Graphs help us see the relationships that calculations reveal.
Side-by-Side Bar Chart: Displays bars for the response variable categories next to each other, grouped by the explanatory variable categories. This makes it easy to compare the heights of the bars (the frequencies or relative frequencies) across groups.
[Image: A side-by-side bar chart showing three groups (Bus, Car, Walk/Bike). Each group has two bars next to each other, one for Underclass and one for Upperclass. The y-axis is labeled 'Percentage' or 'Count'.]
Segmented (or Stacked) Bar Chart: Each bar represents a category of the explanatory variable. The bar is divided into segments, with the length of each segment corresponding to the conditional relative frequency of the response variable's categories. All bars have a total height of 100%. This is excellent for comparing conditional distributions.
[Image: A segmented bar chart with three bars (Bus, Car, Walk/Bike). Each bar is 100% tall and colored in two segments representing the percentage of Underclass and Upperclass students within that transportation mode.]
Mosaic Plot: A modified segmented bar chart where the width of each bar is proportional to the marginal frequency of that category of the explanatory variable. This graph conveys information about both the conditional and marginal distributions simultaneously.
4. Describing Association
The primary goal is to determine if the two variables are related.
Association: An association exists between two categorical variables if knowing the value of one variable helps you predict the value of the other. In other words, the conditional distributions of the response variable are different across the categories of the explanatory variable.
Independence: Two variables are independent if there is no association between them. The conditional distributions of the response variable are the same (or very similar) across all categories of the explanatory variable.
How to Check for Association:
Choose one variable as the explanatory variable (the one you are conditioning on).
Calculate the conditional relative frequencies of the response variable for each category of the explanatory variable.
Compare these conditional distributions.
If they are noticeably different: There is evidence of an association.
If they are the same (or very close): The variables are independent.
Example: Let's check for an association between Grade Level and Transportation. We'll find the conditional distribution of transportation for each grade level.
Underclass:
Bus: 60/110 \approx 54.5%
Car: 30/110 \approx 27.3%
Walk/Bike: 20/110 \approx 18.2%
Upperclass:
Bus: 20/90 \approx 22.2%
Car: 50/90 \approx 55.6%
Walk/Bike: 20/90 \approx 22.2%
Since the distribution of transportation modes for Underclassmen (54.5%, 27.3%, 18.2%) is very different from that of Upperclassmen (22.2%, 55.6%, 22.2%), we can conclude there is an association between grade level and mode of transportation.
Key Vocabulary
Two-Way Table: A table that displays the counts (or frequencies) of individuals falling into each combination of categories for two categorical variables.
Marginal Distribution: The distribution of values of one of the categorical variables in a two-way table of counts, without regard to the values of the other variable. It is found in the "margins" (totals) of the table.
Joint Relative Frequency: The proportion of individuals that have a specific characteristic for one variable AND a specific characteristic for another variable. Calculated as .
Conditional Relative Frequency: The proportion of individuals with a specific characteristic, restricted to a smaller group (the "condition"). Calculated as .
Association: A relationship between two variables where the value of one variable can help predict the value of the other. For categorical variables, this means the conditional distributions are different.
Independence: The opposite of association. Two variables are independent if the conditional distribution of one variable is the same across all categories of the other variable.
Segmented Bar Chart: A graph used to compare the distribution of a categorical variable in several groups. Each bar represents a group, and the bar is divided into segments proportional to the percent of individuals in each category.
Calculator Tech (TI-84)
While most calculations for this topic are simple arithmetic, you can use the TI-84 to store two-way tables in a matrix. This is a foundational skill for the Chi-Squared Test for Independence (Unit 9).
Entering a Two-Way Table into a Matrix:
Let's enter the transportation data (excluding totals) into a 2x3 matrix.
Access the Matrix Menu: Press
2nd-> [MATRIX].Edit a Matrix: Use the arrow keys to move to the
EDITmenu at the top. Select a matrix, for example,1: [A]. PressENTER.Define Dimensions: First, enter the dimensions of your table (rows x columns). For our example, it's a 2x3 table. Enter
ENTERENTER.Enter Data: The calculator will display a 2x3 grid. Enter the counts from the table, pressing
ENTERafter each one. The calculator fills the matrix row by row.ENTERENTERENTERENTERENTERENTER
Quit to Home Screen: Once all data is entered, press
2nd->MODE[QUIT] to return to the home screen. The matrix is now stored.
You can view the matrix by selecting it from the MATRIX -> NAMES menu and pressing ENTER.
How to Show Work on the FRQ
On Free Response Questions, you must clearly communicate your reasoning when determining if an association exists. Simply stating "yes" or "no" will earn no credit. Use the following template to structure your response.
FRQ Template: Justifying Association
Calculate Relevant Conditional Proportions: State which conditional proportions you are calculating. Condition on the explanatory variable.
- Sentence Starter: "To investigate the relationship between [Explanatory Variable] and [Response Variable], I will calculate the conditional proportions of [Response Variable] for each category of [Explanatory Variable]."
Compare the Proportions: Compare the calculated proportions using explicit, numerical evidence and comparison words (e.g., "greater than," "less than," "approximately equal to"). Compare at least two pairs of values.
Sentence Starter: "The proportion of [Category A of Explanatory] that [Outcome 1 of Response] is (___ / ___ = %), which is [greater than/less than/similar to] the proportion of [Category B of Explanatory] that [Outcome 1 of Response] ( / ___ = ___%)."
Continue comparing for other outcomes if necessary.
State a Conclusion in Context: Make a definitive conclusion about association based on your comparison. Link your reasoning directly to the numbers you compared.
- Sentence Starter: "Because the conditional proportions of [Response Variable] are [different/similar] across the categories of [Explanatory Variable], there is [evidence of an association / no evidence of an association] between [Explanatory Variable] and [Response Variable] for this group of [subjects]."
Practice Problems
Problem 1:
A random sample of 400 adults was asked about their highest level of education and whether they smoke cigarettes. The results are summarized in the table below.
| Smoker | Non-smoker | Total | |
|---|---|---|---|
| High School or Less | 60 | 140 | 200 |
| Some College | 30 | 90 | 120 |
| Bachelor's or Higher | 10 | 70 | 80 |
| Total | 100 | 300 | 400 |
(a) What proportion of adults in the sample are smokers with a Bachelor's degree or higher?
(b) What proportion of adults in the sample who have a Bachelor's degree or higher are smokers?
(c) Create a segmented bar chart to display the relationship between education level and smoking status.
Solution:
(a) This asks for a joint relative frequency. We are looking for the proportion of the total sample that fits both criteria.
Calculation: (Number of smokers with Bachelor's or higher) / (Grand Total) = 10 / 400 = 0.025.
Answer: 2.5% of adults in the sample are smokers with a Bachelor's degree or higher.
(b) This asks for a conditional relative frequency. The condition is "have a Bachelor's degree or higher." Our denominator will be the total number of people in that category.
Calculation: (Number of smokers with Bachelor's or higher) / (Total with Bachelor's or higher) = 10 / 80 = 0.125.
Answer: 12.5% of adults in the sample who have a Bachelor's degree or higher are smokers.
(c) To create a segmented bar chart, we first need the conditional proportions of smokers and non-smokers for each education level.
High School or Less:
Smoker: 60/200 = 30%
Non-smoker: 140/200 = 70%
Some College:
Smoker: 30/120 = 25%
Non-smoker: 90/120 = 75%
Bachelor's or Higher:
Smoker: 10/80 = 12.5%
Non-smoker: 70/80 = 87.5%
The chart would have three bars, one for each education level. Each bar would be 100% tall, with the bottom segment representing the percentage of smokers and the top segment representing non-smokers.
[Image: A segmented bar chart with three bars: "HS or Less", "Some College", "Bachelor's+". The "HS or Less" bar is 30% smoker/70% non-smoker. The "Some College" bar is 25% smoker/75% non-smoker. The "Bachelor's+" bar is 12.5% smoker/87.5% non-smoker.]
Problem 2:
Using the data from Problem 1, is there an association between education level and smoking status for the adults in this sample? Use statistical evidence to support your answer.
Solution:
(Applying the FRQ Template)
Calculate Relevant Conditional Proportions: To investigate the relationship between education level and smoking status, I will calculate the conditional proportion of smokers for each education level.
P(Smoker | High School or Less) = 60 / 200 = 0.30 or 30%.
P(Smoker | Some College) = 30 / 120 = 0.25 or 25%.
P(Smoker | Bachelor's or Higher) = 10 / 80 = 0.125 or 12.5%.
Compare the Proportions: The proportion of adults with a High School education or less who smoke is 30%. This is greater than the proportion of adults with Some College who smoke (25%), and significantly greater than the proportion of adults with a Bachelor's degree or higher who smoke (12.5%).
State a Conclusion in Context: Because the conditional proportions of smokers are different across the three education levels (30%, 25%, and 12.5%), there is evidence of an association between education level and smoking status for the adults in this sample. Specifically, as education level increases, the proportion of smokers tends to decrease.
Common Mistakes to Avoid
Using the Wrong Denominator: This is the most frequent error. When asked for a conditional proportion ("What percent of Group A are Category X?"), the denominator MUST be the total for Group A, not the grand total. Always identify the condition first.
Comparing Raw Counts Instead of Proportions: Do not conclude there is an association just because the raw counts are different. Group sizes are often unequal. You MUST compare conditional relative frequencies (proportions or percentages) to make a fair comparison. For example, in Problem 1, there are more smokers in the "High School" group (60) than the "Some College" group (30), but this is misleading because the "High School" group is larger. The proportions (30% vs. 25%) tell the real story.
Vague Comparisons: Statements like "The numbers are different" are not sufficient for an FRQ. You must explicitly state which numbers you are comparing and use comparative language (e.g., "30% is greater than 12.5%").
Forgetting Context: Always relate your conclusion back to the variables in the problem. Don't just say "there is an association." Say "there is an association between education level and smoking status."
Confusing Association with Causation: Finding an association does not mean that one variable causes the other. In our example, a higher education level is associated with less smoking, but we cannot conclude that getting more education causes people to stop smoking. There may be other lurking variables (like income or health awareness) that are related to both.