「Algebraic ECCs」: Lec1 Basics of ECCs
Topics Covered:
- Basic problem in coding theory
- Code and codeword
- Hamming distance and minimum distance
- Rate
- Hamming bound on trade-off of the rate and distance
Introduction
This course is named “Algebraic Error Correcting Codes”, containing two parts:
- Error Correcting Codes (ECC): ECC are a fundamental tool for communication, storage, complexity theory, algorithm design, cryptography, pseudorandomness and etc.
- Algebraic: algebraic techniques are a fundamental tool for designing ECCs.
Basic Problem in Coding Theory
Let’s start with the basic problem in coding theory:
We encode a message $x$ of length $k$ into a codeword $c$ of length $n$, which adds some redundancy when $n>k$. When we transmit or store this codeword, something bad might happen, i.e. some bits might be corrupted. Thus, we are left with a corrupted codeword $\tilde{c}$.
The goal is to find (something about) the message $x$ given the corrupted codeword $\tilde{c}$.
This has many applications, including communication and storage:
Application1: Communication
- Suppose Alice has a message $x\in \{0,1\}^{k}$ and wants to sends it to Bob.
- In order to add some redundancy, she encodes the message into a codeword $c$ and sends the codeword through a noisy channel.
- After Alice passes $c$ through the noisy channel, Bob receives $\tilde{c}$ and tries to figure out the original message $x$ that Alice intended to send.
Application2: Storage
- Suppose $x$ is a file we want to store.
- Instead of storing $x$ directly, we encode it as a codeword $c$, introducing redundancy.
- If $c$ is stored on a CD or in a RAID array, something bad, like a fire, might happen.
- However, the owner still wants to recover the original file $x$.
Based on the above two applications, we summarize four things we care about in the coding schemes:
Things We Care About (Take 1):
We should be able to handle SOMETHING BAD.
We should be able to recover WHAT WE WANT TO KNOW about $x$.
We want to MINIMIZE OVERHEAD.
Jumping ahead, the overhead is defined as the quantity $k/n\in (0, 1]$, which should be as big as possible.
We want to do all this EFFICIENTLY.
The question about these things are: what is the best trade-off between 1-4?
The trade-off depends on how we model things:
- What is something bad?
- What exactly do we want to know about $x$?
All of $x$ or the first bit of $x$? - What counts as efficient?
This lecture will gives one way of answering these questions and there are many legit ways.
Now, let’s give some formal definitions for a code.
Code
Let $\Sigma$ be any finite set and let $n>0$ be a positive integer.
So far, this is not a very interesting definition. A descriptive example for this definition is provided below.
Example 1:
Another example is slightly more interesting.
Example 2:
The following is a set of 7 vectors, which forms a binary code of length 4 over $\Sigma=\{0, 1\}$.
We can relate this code to the communication between Alice and Bob mentioned earlier.
Thus, the code $\mathcal{C}$ is the image of this map, i.e. $\mathcal{C}=\mathrm{Im(ENC)}$. In other words, $\mathcal{C}$ is the set of all codewords that can be obtained using this encoding map.
Example 2 can actually be used to correct one erasure (something bad).
An erasure means we know which bit got erased, but we don’t know its original value. Based on the map definition, the missing bit must be 1.
Furthermore, it can also be used to detect one error (something bad).
An error means we know one bit my have been changed, but we don’t know which one.
To sum up, we say that the code in Example 2 can correct one erasure or detect one error. But it cannot correct one error.
Let’s see a code that can correct one error.
Example 3:
Given this encoding map, we can define a code $\mathcal{C}=\mathrm{Im(ENC)}$ . Hence, $\mathcal{C}\subseteq\{0,1\}^7$ is a binary code of length 7.
This code has a nice way to visualize as three overlapping circles.
We place the messages $x_1, x_2, x_3, x_4$ in the middle and the remaining spaces in the circles correspond to the parity bits, which are the sum of the other bits in each circle. This ensures that the sum of the bits in each circle equals 0.
With these parity bits, this code can correct one error.
With these circles in mind, we can solve the following puzzle more easily.
In a legit codeword, all three circles should sum to 0. Here, both the green and red circles are messed up while the blue circle is correct. Therefore, $\tilde{c}_3$ must be flipped.
But this solution with the help of circles seems pretty ad hoc. In order to make the solution less ad hoc, we’ll introduce more definitions to formalize the solution and flesh out the four things we care about for ECCs mentioned before.
Hamming Distance
We first give the definition of hamming distance, which equals to the number of positions on which two vectors $x, y\in \Sigma^n$ differ.
Then we can define the relative hamming distance, which is the hamming distance normalized by $n$. It’s the fraction of positions on which two vectors differ.
Another useful definition is the minimum distance, which is the minimum hamming distance over all distinct pairs. Jumping ahead, we will see that the code with large minimum distance can be used to correct errors.
Claim:
The code in Example 3 has minimum distance 3.
If this claim is true, it explains why that code can correct one error using the triangle inequality for hamming distance.
The minimum distance equal to 3 means that the hamming distance $\Delta(c,c’)\ge 3$ for every distinct pair $c, c’\in \mathcal{C}$.
As depicted in the following figure, suppose the corrupted codeword we received is $\tilde{c}$ and $c$ is the correct code should be $c$. There is another codeword $c’\in \mathcal{C}$.
We know that the codeword $\tilde{c}$ has exactly one error so the hamming distance $\Delta(\tilde{c}, c)=1$. The triangle inequality tells us:
$$
3\le \Delta(c,c’)\le \Delta(c, \tilde{c}) + \Delta(\tilde{c}, c’)
$$
that the hamming distance $\Delta(\tilde{c}, c’)\ge 2$ for every codeword $c’$ other than the correct codeword $c$.
Thus, the correct codeword $c\in \mathcal{C}$ is uniquely defined by “the one that is closest to $\tilde{c}$”.
Minimum Distance: Proxy for Robustness
The point of this discussion is that the minimum distance is a reasonable proxy for robustness.
To be more specific, in example 2, the code had minimum distance 2 and could correct 1 erasure and detect 1 error.
In example 3, the code had minimum distance 3 and could correct 1 error.
More generally, a code with minimum distance $d$ can:
- correct $\le d-1$ erasures
- detect $\le d-1$ errors
- correct $\lfloor \frac{d-1}2\rfloor$ errors
For point 1 and point 3, the (inefficient) algorithm is “if you see $\tilde{c}$, return $c\in \mathcal{C}$ that’s closest to $\tilde{c}$.
For point 2, the (inefficient) algorithm is “if $\tilde{c}\notin \mathcal{C}$, say that something wrong.”
To understand why the algorithm works, we look at the following picture:
Red Points represent the codewords $c, c’\in \mathcal{C}$.
Orange Circle correspond to the hamming balls of radius $\lfloor \frac{d-1}{2} \rfloor$ centered at codewords. These hamming balls are disjoint because the minimum distance is $d$.
Green Circles correspond to the hamming balls of radius $d-1$ centered at codewords. These hamming balls are not disjoint, but they each contain exactly one codeword.
The hamming balls help explain the (inefficient) algorithm we mention before.
Correct $\lfloor \frac{d-1}2\rfloor$ errors:
If $c$ is the “correct” codeword (the left red point) and $\le \lfloor \frac{d-1}{2}\rfloor$ errors are introduced, we may end up with $\tilde{c}_1$ (the purple point) within its orange circle. Since the orange circle are disjoint, the algorithm can correct codeword $c$ from $\tilde{c}_1$.Detect $d-1$ errors:
If $\le d-1$ errors are introduced, we may end up with $\tilde{c}_2$ (the pink point) within its green circles. Now it’s possible that $\tilde{c}_2$ came from $c$ or that it came from $c’$; we can’t tell. However, since each green circles only contain exactly one codeword, meaning other codewords has to live outside of this ball, we can tell that something wrong.
Correct $d-1$ erasures:
If $\le d-1$ erasures are introduced in $\tilde{c}$, suppose the candidate codewords are the correct codeword $c$ and some other $c’$. We erase the corresponding positions in both candidate codewords $c$ and $c’$. Since the hamming distance between $c$ and $c’$ is at least $d$,there must be at least one position where $c$ and $c’$ differ. This differing position allows us to resolve the ambiguity the identify the correct codeword $c$.
Returning to the things we care about, we can now clarify the firs two things:
Things We Care About (Take 2):
We should be able to handle $\lfloor \frac{d-1}{2}\rfloor$ worst-case errors or $d-1$ worst-case erasures.
We should be able to recover all of $x$ (aka correct the errors or erasures)
We want to MINIMIZE OVERHEAD.
Jumping ahead, the overhead is defined as the quantity $k/n\in (0, 1]$, which should be as big as possible.
We want to do all this EFFICIENTLY.
Moreover, the first two things can be combined to the minimum distance $d$ and we want it as large as possible.
Rate
Moving on to the point 3, what do we mean by “overhead”?
This definition is in line of the the encoding operation
which encodes a message $x$ of length $k$ over $\Sigma$ into a codeword $c\in \mathcal{C}$.
This map assigns each possible message to a single codeword so we have $|\Sigma^k|=|\mathcal{C}|$ aka $k=\log_{|\Sigma|}|\mathcal{C}|$.
The rate captures the notion of overhead so minimizing the overhead means to maximize the rate.
So if $\mathcal{R}$ is close to $1$, it means not too much overhead. And if $\mathcal{R}$ is close to 0, it means lots of overhead. We want the rate to be as close to 1 as possible.
Now we can denote a code by these parameters
After clarifying the things we care about, a question arises:
What is the best trade-off between rate and distance?
Hamming Bound: Rate vs. Distance
The Hamming bound gives us one bound on the trade of rate and distance. It establishes some limitations on how good this trade-off can be.
Let’s return to the picture we had before, with $|C|$ disjoint Hamming balls of radius $\lfloor \frac{d-1}{2}\rfloor$ in the space $\Sigma^n$.
The idea of the Hamming bound is: there can’t be too much of them or they wouldn’t all fit in space $\Sigma^n$.
To be more precise, we define the Hamming ball as follows.
Say that $|\Sigma|=q$, then we can express the volume of a $q$-array Hamming ball as
$$
\mathrm{Vol}_q(e, n) = 1 + \binom{n}{1}\cdot (q-1) + \binom{n}{2}\cdot (q-1)^2 + \dots + \binom{n}{e}(q-1)^e
$$
where term $1$ refers to the $\vec{0}$, term $\binom{n}{1}\cdot (q-1)$ refers to all the elements of $\Sigma^n$ of weight 1.
Armed with these definitions, we can formalize the bound.
If a code $\mathcal{C}\subseteq \Sigma^n$ has distance $d$ and message length $k$, where $|\Sigma|=q$, we have
$$
|\mathcal{C}|\cdot \mathrm{Vol}_q\left(\left\lfloor \frac{d-1}{2}\right \rfloor, n\right) \le q^n
$$
The LHS corresponds to the total volume taken up by all disjoint Hamming balls while the RHS corresponds to the total volume in the whole space $\Sigma^n$.
Taking logs base $q$ of both sides, it can be written as
$$
\log_q |\mathcal{C}|+\log_q \left(\mathrm{Vol}_q\left(\left\lfloor \frac{d-1}{2}\right \rfloor, n\right)\right)\le n
$$
It gives the bound of the rate in terms of the distance.
Back to Example 3, which was a $(n=7, k=4, d=3)_{q=2}$ code.
We have $\lfloor\frac{d-1}{2}\rfloor=1$, $\mathrm{Vol}_2(1, 7) = 1 + \binom{7}{1}\cdot 1 = 8$.
So we have rate bounded by $\frac k n \le 1 - \frac{\log_2(8)}{n} = 1 - \frac 3 7 = \frac 4 7$.
In fact, the rate $\frac k n= \frac 4 7$, so in this case the Hamming bound is tight!
It means this code achieves the best trade-off between the minimum distance and rate.
When the Hamming bound is tight, we say the code is perfect.
Example 3 (which is perfect) is a special case of something called a Hamming Code.
「Algebraic ECCs」: Lec1 Basics of ECCs