Data Compression | AP CSP Unit 2 Study Guide

Getting Started

Every time you stream a video, send a photo, or download a program, you are using data that has been shrunk to travel more efficiently. The massive amounts of digital information we create and share would be impossible to manage without a way to make files smaller. Data compression provides the techniques to reduce file sizes, making it faster to send information over the internet and possible to store more of it on our devices.

What You Should Be Able to Do

Explain why data compression is necessary for managing digital information.
Compare the processes and outcomes of lossless and lossy compression.
Determine whether a lossless or a lossy approach is better for a given file type, such as a text document versus a streaming video.
Describe how a heuristic can be used to decide which data to discard in lossy compression.

Key Concepts & Application

The Core Idea

Data compression is the process of encoding information using fewer bits than the original representation. Think of it like sending a text message. You might type "c u l8r" instead of "I will see you later." The core meaning is preserved, but you've used far fewer characters. In computing, this means representing the same (or nearly the same) data with less storage space.

This reduction is achieved by finding and reducing patterns or redundancies. For example, in an image of a blue sky, instead of storing "blue pixel, blue pixel, blue pixel..." a thousand times, a compression algorithm might store "1000 blue pixels." This is the fundamental principle: identify patterns and create a shorthand to represent them, resulting in a smaller file.

There are two primary strategies for compression, each with a critical trade-off between file size and perfect accuracy.

Logic & Application

The choice between compression methods depends entirely on what the data will be used for. Is it essential that the data can be restored perfectly, or is "close enough" acceptable?

Feature	Lossless Compression	Lossy Compression
Data Recovery	The original data can be perfectly reconstructed.	Some data is permanently lost; the original cannot be recovered.
File Size	Moderate reduction.	Significant reduction.
When to Use	When every bit of data is critical.	When some loss of quality is acceptable for a smaller file.
Example Files	Text documents (.txt, .docx), computer programs (.exe), some image formats (PNG, GIF).	Images (JPEG), audio (MP3), video (MPEG).

Lossless Compression in Action

Lossless compression works by finding patterns. A simple method is run-length encoding, where sequences of repeated data are stored as a single value and a count.

Consider a simple black-and-white image represented by a string of pixels, where W is a white pixel and B is a black pixel.

Uncompressed Data:

WWWWWWBBBBW BBBBBBW WW

An algorithm can scan this data and replace the "runs" of identical characters with a count.

Compressed Representation:

6W4B1W6B2W

When the file is opened, the program reads 6W and reconstructs the six white pixels perfectly. No data is lost. This is ideal for text, where changing a single character could alter the meaning of an entire document.

Lossy Compression & Heuristics

Lossy compression achieves much smaller file sizes by permanently deleting data. But how does it decide what to delete? It often uses a heuristic, which is a problem-solving approach to find a satisfactory solution when finding a perfect one is impractical.

In this context, the heuristic is a set of rules used to discard information that humans are unlikely to notice.

For a JPEG image: The algorithm might group pixels into blocks. The heuristic analyzes the colors in a block and, if it finds several very similar shades of blue, it might change them all to a single average shade of blue. The human eye likely won't perceive the difference, but the data required to store that block is reduced.
For an MP3 audio file: The heuristic might remove sounds that are too high or too low for most humans to hear, or it might remove a very quiet sound that occurs at the same time as a very loud one (a phenomenon called "auditory masking").

The "quality" setting when you save a JPEG (e.g., 90% or 60%) adjusts how aggressively the heuristic discards data. A lower quality setting means a more aggressive heuristic and a smaller file, but the data loss may become noticeable.

Tracing & Analysis

Logic Trace: Lossless Compression

Let's trace the compression of the string AAABBCDDDD using a simple run-length encoding algorithm.

Input:AAABBCDDDD
The algorithm reads the first character, A. It counts one A.
It reads the next character, A. It is the same. The count for A is now 2.
It reads the next character, A. It is the same. The count for A is now 3.
It reads the next character, B. This is different. The algorithm outputs the count and character for the previous run: 3A. It resets to count the new character, B. The count for B is 1.
It reads the next character, B. It is the same. The count for B is now 2.
It reads the next character, C. This is different. The algorithm outputs 2B and resets to count C. The count for C is 1.
It reads the next character, D. This is different. The algorithm outputs 1C and resets to count D. The count for D is 1.
It continues counting Ds until the end of the string. The final count for D is 4.
Final Output:3A2B1C4D

Societal Impact

Data compression is a foundational technology of the modern internet. Without it, the digital world would be drastically slower and more expensive.

Accessibility: Compression enables high-quality video and audio streaming even on slower internet connections, making information and entertainment more accessible globally.
Efficiency: It reduces storage costs for large companies (like Google Photos or Dropbox) and individuals. It also lessens the load on network infrastructure, allowing more data to flow through the internet's "pipes."
Innovation: Entire industries, from streaming services like Netflix and Spotify to cloud computing and social media, are built upon the ability to efficiently store and transmit massive compressed data files.

Key Terminology & Logic

Lossless Compression: A compression method that allows for perfect reconstruction of the original data.
Lossy Compression: A compression method that results in permanent data loss to achieve a smaller file size.
Heuristic: An approach to problem-solving that produces a solution that is "good enough" when an exact or optimal solution is too difficult or slow to find.

Core Concepts & Terminology

Data Compression: The process of encoding information to use fewer bits than the original representation. Its purpose is to reduce file sizes for more efficient storage and transmission.
Lossless Compression: A class of data compression algorithms that allows the original data to be perfectly reconstructed from the compressed data. No information is lost in the process.
Lossy Compression: A class of data compression algorithms that involves discarding some data to represent the content with fewer bits. This process is irreversible, as the original data cannot be fully recovered.
Heuristic: A problem-solving technique that is sufficient for finding an immediate, short-term, or approximate solution to a problem. In lossy compression, a heuristic is the set of rules used to determine which data can be discarded without significantly impacting perceived quality.
Core Concept (The Trade-off): The fundamental choice in data compression is between file size and quality. Lossless compression prioritizes perfect quality at the cost of a larger file size. Lossy compression prioritizes a smaller file size at the cost of permanently losing some data quality.

Core Skill Check

Logic Tracing: A lossless algorithm encodes WWBBW as 2W2B1W. What is the output for the string BBBBWGGGG?
Debugging: A developer decided to use lossy compression to store user passwords to save database space. Identify the critical logical error in this plan.
Application: Describe a real-world file or data stream where lossy compression is appropriate and one where only lossless compression should be used.

Common Misconceptions & Clarifications

"Compression always makes files smaller."
- Clarification: For data with no discernible patterns (like a file of random numbers), a lossless compression algorithm can actually increase the file size due to the overhead of its own encoding information.
"Lossy and lossless are just different quality settings."
- Clarification: They are fundamentally different approaches. Lossless preserves data perfectly; lossy discards it. You cannot "turn up the quality" on a lossy file to get the original data back.
"Saving a JPEG file over and over again doesn't matter."
- Clarification: Each time a JPEG is saved, a new lossy compression cycle occurs. This leads to "generational degradation," where the image quality gets progressively worse with each save.
"A heuristic is a fancy word for guessing."
- Clarification: A heuristic is not a random guess. It is an educated, strategic approach based on known principles (like the limits of human perception) to find a good, non-optimal solution quickly.

Summary

Data compression is a critical technique for managing the vast amount of digital information in our world by reducing file sizes. The two main strategies are lossless and lossy compression. Lossless methods, vital for text and data where accuracy is paramount, find patterns to represent data more concisely while allowing for perfect reconstruction. Lossy methods, essential for media like images and video, use heuristics to intelligently discard data that humans are unlikely to miss, achieving significantly smaller files at the cost of some quality. Understanding the trade-off between these two approaches is key to building efficient and effective digital systems.

Data Compression - AP Computer Science Principles Study Guide