vel’s blog
vel’s blog
ABX of Lossless versus MP3 - Part 1 - Introduction
Friday, 17 July 2009
The word codec stands for ‘coder-decoder' and is a computer program that can compress and decompress digital audio to and from a certain audio file format. There are two main types of codecs that are differentiated by their compression quality - lossy and lossless. The difference between them lies in the way they encode audio from an uncompressed source (such as an audio CD).
Lossy codecs, such as MP3, AAC and WMA are currently the more popular ones due to their small file sizes. You see, what a lossy codec does is it takes the original audio data and compresses it in such a way that when it is decompressed again it differs from the original. An illusion of the original data is created by using the algorithm specific to that codec. Several examples of how lossy codecs cut down on so much data are outlined below:
1)Human ears have difficulty hearing a soft note after a loud one. Lossy codecs take advantage of this by discarding the quiet noises after loud ones.
2)Humans also have trouble hearing above 20,000Hz so a method of saving space is to simply cut everything off above a certain level (depending on how little the required bit rate is).
The result is a smaller file than the original data, but with a significant loss in quality depending on the bit rate. The bit rate is how many kilobits of data the file may use per second of audio (kbps) and MP3s are generally between the range of 128kbps and 320kbps. Obviously, more data per second means higher quality with greater detail and more retention of the original uncompressed sound, while on the lower spectrum you get a horrible sounding hollow sound that makes you cringe. Furthermore, there are different encoders for each codec as well, differing in their abilities of encoding at higher and lower bit rates. LAME (one of the most popular MP3 encoders) is an example of an encoder that works well with high bit rates but is said to struggle with lower bit rates. As far as codecs go, MP3 is actually fairly old, originating in the early 1990’s, and AAC, Ogg Vorbis and WMA Pro have surpassed it in technical terms. The final negative aspect of lossy codecs is a phenomenon known as generation loss. This means that any time the file is converted between codecs and bit rates, or simply resampled due to editing of some sort, more and more data will be lost. This means that having a collection of lossy music means you have less control over what you can do with it without losing quality.
Now, don’t get me wrong, lossy codecs have their place. For purposes of streaming internet radio, or VoIP, lossy codecs go a huge way to enable easy communication without incredibly fast and powerful mobile internet and devices. Also, filling your 1-8 Gb iPod with lossless music won’t get you very far. The question, however, is how much worse really are they when compared to the more robust lossless codecs? Will the future of increased internet speed and data storage remove the need for lossy codecs completely?
There are multiple lossless codecs but the two that are most popular are FLAC (.flac) and Apple Lossless (.m4a). Lossless codecs work by replicating the exact original data from the compressed data, like a .rar/.zip file. Here is how lossless audio is encoded based on the Monkey’s Audio lossless audio compressor website (found here):
Step 1 - Conversion to X/Y
Audio contains two channels of sound, left and right. There is a high correlation between the two channels, that is, when one goes up, the other usually does as well. Lossless audio encoders make use of this in the following way:
X = mean of the left and right values
Y = difference between the two channels
Example 1:
Left channel = 14600
Right channel = 15000
X = 14800
Y = 400
Example 2:
Left Channel = 5300
Right channel = 15200
X = 10250
Y = 9900
We can see in example 1 we have removed two digits, saving space. In example two, because the difference between the channels is quite high, we have only saved one digit.
Step 2 - Predictors
The smallest possible number of values should remain after this step for both the X and Y arrays while still being decompressible. This will have the effect of removing redundancy and the method of doing so is what separates one algorithm from another.
The way in which this works is that the next value is predicted based on the previous values. Thus, a residual value is each time formed based on the sample and predicted value, as seen below:
residual = sample - predictor(Last Sample(s))
and then to get back the original data when decoding, we use:
sample = residual + predictor(Last Sample(s))
The predictor relies on prediction equations (filters) to determine what the next predicted value will be. The simplest filter is called a delta filter and is simply that the predictor is the last sample.
Example 1:
Sample string: 5 8 2 1 3 8 2
Residuals: 3 -6 -1 2 5 -6 (8 - 5 = 3, 2 - 8 = -6, 1 - 2 = -1, etc)
Decoding this is easy knowing the first value is 5.
5 + 3 = 8 - 6 = 2 - 1 = 1, etc.
In this case we didn’t compress the sample string because of the single digits, but in a more complicated example we can see how this would work.
Example 2:
Sample string: 10312 10371 10395 10483 10593
Residuals: 59 24 88 110
Once again, knowing the initial value is 10312 we just add the residuals to obtain the samples.
10312 + 59 = 10371 + 24 = 10395, etc.
Usually, several passes are done through filters and more complicated equations are used. As you can see from example 2, it is much more efficient to encode 59, 24, 88 and 110 than the actual sample strings. Depending on the type of music, we can see that smoother transitions between samples (i.e. less sudden loud sounds) means lower residuals which means less data to encode leading to smaller file sizes. File sizes will be discussed in the next post for this topic.
Step 3 - Encoding of Data/Rice Coding
The final compressed string of numbers must be coded into binary to be stored. Digital audio commonly found on CD’s is 16-bit, meaning that each sample can have a range of 0-65536. I am assuming this would correspond to the frequency produced as humans can hear from 20-20,000Hz. The problem with storing numbers as binary is that you cannot just push them together without breaks in between. For example, the string of numbers 14 3 8 23 7 21 can be read as separate values, whereas 143823721 cannot be differentiated. One method of coding numbers in binary would involve coding for the exact number in binary and then putting in a break. This break would have to fall outside the range of possible values the numbers can possess to ensure an actual value is not mistaken for a break and vice versa. Another method would be coding each number one after the other regardless of how big it is in the same format, i.e. 140308230721 and then read two at a time the whole way through. The problem with this is that when you have such a large range as in music, coding for a lower-mid range number would waste a lot of space with unnecessary zeros. So, a method called “rice coding” is used in lossless audio encoders to fit the numbers together while differentiating the 20 from the 34000.
Now, it took me awhile to understand this as I don’t have much coding experience and I had a friend who does explain it to me, so hopefully I have not made too many mistakes... Essentially what happens is the compressed X and Y values from step 2 are looked at by the encoder and a ‘k’ value is generated based on the average size of the numbers. This k value is the number of bits the numbers take up in terms of decimal places. So, 8 for example is 1000 (4 bit) in binary, so k = 4. If we make our example 3 bit I will try to explain how this works.
Example 1:
We have a string of numbers needing encoding, 2 6 3 7 8 10 20 32 (keeping it small for simplicity). The encoder goes through and decides that the most efficient way of encoding these numbers is using a k value of 3, meaning 3 bits of data. This would probably be because half the numbers in this string can be encoded with 3 bits or less. Now, a quick primer on binary, read from right to left with the following positions corresponding to numbers:
0 0 0 0 0 0 0 0
128 64 32 16 8 4 2 1
This goes on and on depending on the number of bits (the above example is 8 bit). To make a number you simply put a 1 in any space you want to use. For instance, 10010110 = 150 (128 + 16 + 4 + 2).
So, returning to our string of numbers... The encoder first puts a 0 or 1 for negative/positive, this occurs always. Next, a number of zeros is added corresponding to the amount of overflow. Overflow is when a number will not fit into ‘k’ bit, such as the 20 which would be 10100 (5 bit) in binary. In this case, we take the overflow and convert it to numbers. For 20, we have two overflow of 10. 10 is 2 in binary so we would add two zeros to overflow. For 32 we have 100 to the left of the 3 bit which in binary is 4, so we would add four zeros. Next, we have a terminating 1 to show we are done with specifying the overflow quantity. Then, finally, we have the 3 bit that k has specified.
To show this visually...
1/0 00 1 000
+/- Overflow Termination k (3 bits in this example)
So let’s go through and encode our string of numbers.
2 = 11010
6 = 11110
3 = 11011
7 = 11111
8 = 101000
10 = 101010
20 = 1001100
32 = 100001000
Putting it together we have 110101111011011111111010001010101001100100001000. Pretty, isn’t it? I realize this example has the same amount of digits as coding the numbers in 6 bit back to back (48 digits), but it is just meant to showcase how it’s done. Assumably, after exiting the second step, the numbers will be large and variable enough that this method would save a significant amount of space. Note: I tried k values of 2 and 4 and these gave total digits of 51 for both.
The k value is apparently also adjusted as the encoder goes along to always be as efficient as possible. This occurs every 16-128 values. As you can see above, if k is too low, extras zeros are inserted into the overflow which wastes space. Likewise, if it is too high then too many zeros are inserted into the k part of the string. It does not say but I assume there is some way of telling the encoder the k has been changed to a new number.
The above description is based on information primarily from the “Monkey’s Audio” website (see references below) but was very similar to information from a separate source. I am sure different codecs use slightly different methods of encoding the information but for the most part what I have written is the steps that are followed.
I apologize for the length of this post. Originally I planned on making one post with my whole experiment and the introduction was just meant to be around 500 words to introduce the topic. When researching I got carried away and intrigued by how the encoder works so decided to describe what I found. Hopefully it is understandable, believe me, if you look elsewhere it is much more complicated and I would have never understood anything in step 3 had it not been for my friend Glen.
With all that out of the way, what will I be doing in this experiment? What I plan to do is to take some lossless samples that are meant to showcase the differences between FLAC and MP3 and I will convert them to varying bit rates and do an ABX test to see where I stop being able to tell the difference. The next post will be much shorter and will detail materials and methods that will be used in this little experiment as well as a short description of what an ABX test is.

Like what you’ve read here?
Why not discuss it on the vel.co.nz IRC channel?

References:
http://www.lossless-audio.com/theory.htm - Good information on prediction in lossless audio.
http://www.monkeysaudio.com/theory.html - This website provided almost all the lossless audio encoding information in this post.
http://mp3.radified.com/lossless.htm - Some information on lossy codecs and how they work.
Wikipedia - General information on codecs and such.
Copyright 2009 by www.vel.co.nz
Can I tell the difference between MP3 and Lossless quality audio samples?
Part 1 - Introduction to codecs, lossy vs lossless and the encoding process for lossless audio.