Audio Compression

Introduction

When I bought my 30GB iPod, I had to decide which audio compression scheme I wanted to use for encoding my 300 to 400 CDs. As I'm quite critical about sound quality, I wanted to be sure to choose a correct compression.

Searching on the web shows that audible differences between encodings are already reported a couple of times (e.g. see Recordstorereview, Roberto's Pages, iPodlounge, AAC versus MP3 LAME). Because their observations differ a lot (and because I'm quite stuborn), I've decided to conduct my own experiment. I'll explain my experiment and findings on this page. If you're only interested in the outcome, go to the overview table.

For more detailed background info about audio encodings, you could have a look at:

Source material and test equipment

First of all I had to choose some source material that would exploit the differences between encodings. For this, I've chosen the following music:

The music samples contain quite some transient behaviour (harpsicord, guitars, castanets), have a different nature (acoustic pop versus electronic pop versus classical music), have a different ambience (close miked recordings versus an orchestral hall with a lot of room reverbations), and also have a different timbre.

I've converted all these tracks to the following formats, using iTunes and LAME:

  • AIFF: 1444 (as a reference file)
  • AAC iTunes: 96, 128, 160, 192, 224, 320.
  • MP3 iTunes: 96, 128, 128VBR, 160VBR, 192VBR, 320VBR.
  • MP3 LAME: 96, 128, 128VBR, 160VBR, 192VBR, 320VBR (using the LAME -h -v -V0 -b <bitrate> setting).

I've also checked with other music, but less extensively.

My main reference track described on this website is the Tori Amos track. In my experiment I've double checked my findings with all the other tracks. I've encoded a 3 second 128 kbit/sec VBR MP3 file of this track, to give you an impression what I'm talking about (due to copyright reasons I keep the sample short and low quality).

I've listened to these tracks via different equipment, just to be sure I was not limited by specific weak part of the audio chain. I've used the iPod (ofcourse) through its headphone output, as well through the line output via its docking station. I've used a Etymotic ER-4P, Sennheiser HD 497, Beyerdynamics 311, iPod earbuds, Sony MDR-E888LP earbuds (for a very short while), and AKG 340 electrostatic. I've also burned a CD with a decoded version of all the files, so that I could also listen to the differences on any common CD-player, in my case a TEAC VDRS-10 CD player, a self-made DAC, my Audio Innovations Classic 25 Tube amp, and self-built Audio Note E speakers.

Learn about encoder artifacts

First of all, I listened to all different tracks to get an idea about the differences. How can these differences be characterized, and how prominent do they reveal themselves at different bit rates?

Secondly, I've subtracted the encoded file from the original file, to get an idea what the encoding leaves out. This procedure works as follows. First of all, I copied an audio file from a CD to my Desktop. Then I encoded this file three times with the three different options provided by Quicktime. I also encoded the file with the same bitrate using iTunes. Secondly, I converted all these files to AIFF files using iTunes. Finally, I used Sound Studio to substract two different sound files, by first inverting the phase of one of the two, and then mix the two files by a special paste command called "Paste Mixed...".

You can find some of the results here (because of bandwidth reasons and limited disc space, I've encoded the files as 128 kb VBR MP3s, so they are less accurate than the AIFF comparisons, but they give a good idea):

I haven't scaled volume or such, so the differences are absolute differences, w.r.t. the volume of the original files. As you can hear, the differences are quite revealing. Psycho-acoustic models ofcourse say that we cannot hear most of this stuff, because it's being masked, so it's not only the magnitude, but also the type of noise that matters (as explained by JohnV in a thread on 3ivx):

  • Remember, that when you substract the original from the encoded, you get the total quantization noise.

    Total quantization noise = audible quantization noise + inaudible (masked) quantization noise.

    In sound quality when talking about lossy audio what counts is the amount and *type* of the audible quantization noise. Codecs try to maximize inaudible quantization noise and minimize audible quantization noise. When you listen to the total quantization noise, you of course hear also the supposed to be masked quantization noise. The problem is that you can't say 100% precisely what would be masked and what's not. Automatically assuming that the "quieter is better", may not always be accurate, because the type and position of the noise counts a lot too.

    Thus, it's generally accepted fact that the substraction method is not very accurate especially if you are comparing total noises of different codecs using about same bitrates.

This is motivated by the following experiment, substracting a AAC 320kbit/s and AAC 224kbit/sec encoding:

This example shows that the encoding is not only based on magnitudes, but also on "different type" of encodings used for other bit rates.

Another experiment I did was to re-encode a 224 kbit/sec AAC file to a 160 kbit/sec AAC file, and compare it to a 160 kbit/sec AAC file that has been encoded directly from an AIFF file. This also results in differences when subtracting the two file. Hence this shows that coding from an AIFF source is different from re-encoding from an AAC file.

In this way you can also find out that iTunes AAC encoding uses either the Best or Better encoding of Quicktime, and NOT the Good encoding. iTunes uses Quicktime for performing AAC encoding. Quicktime Pro 6.3 users can set different quality settings, Good - Better - Best (File->Export->Movie to MPEG-4->Options... -> Audio Tab), as shown by the following screendump:

As mentioned by somebody on 3ivxforum, it is very likely that iTunes uses the "Better" setting:

  • I've been having a very fruitful e-mail swap with Apple's head AAC developer,  Mr. Stanley Kuo. Here's his clarification on this specific subject:

    iTunes does use the SAME encoder as QT, and iTunes does give the user the ability to change the encoder's mode. iTunes uses the "Better" quality mode (not the fastest) which is optimized to perform best with 16bit source material (ie. CD source). "Best" quality mode is targetted at 24bit source material (eg. DVD source) and there should be NO discernible difference in quality for CD source between these two modes of the encoder. Generally it's just a waste of CPU to use best quality for 16 bit source, but there's no harm done by doing this of course.

It is therefore questionable to use programs like AAChoo, for the sole purpose of improved sound quality. It also shows that some listening results, like shown on RecordStoreReview, are quite debatable, because they claim an audible difference, or don't accurately describe the exact encoding environnent which might have been different from yours.

You can also find out that Quicktime 6.2, 6.3 and 6.4 all have different AAC encodings, for example see the substraction from:

  • 224 kbit/sec encoded in QT 6.4 - decoded to AIFF in QT 6.4
  • 224 kbit/sec encoded in QT 6.3 - decoded to AIFF in QT 6.4

In the following screen dump:

Performing subtraction of MP3 encoded files didn't work, because there seems to be a slight time offset in the MP3 file, and I was not patient enough to find out the size of the offset.

Anyway, after these experiment I got a better feeling about where I should focus on to describe the differences, which I would like to summarize (for the Tori Amos track) as follows:

  • The velocity and timbre of the voice is sharper (for better encodings).
  • The transients of the harpsicord are better defined (for better encodings).
  • The decay and timbre of harpsicord is not interfered by the attack of new harpsicord notes (for better encodings).
  • The timbre of the harpsicord is 'thin' (for better encodings).
  • There is better seperation of the voice from the harpsicord (for better encodings).
  • There is better pace and rhythm, and less fuziness (for better encodings).
  • The timbre of the harpsicord is 'fat' (for worse encodings).
  • The overall sound is more 'granular' (for worse encodings).

Please remind that these observations are personal, and performed under less than ideal conditions for a scientifical justified conclusion (i.e., a double blind test), although I used the iPod or CD player shuffle function a lot to add a "fair amount" of blindness to the tests.

Especially in the beginning of my tests I guessed a couple of times wrong. This improved a lot over time, which means there is a learning effect. I still make a mistake so now and then, which might have to do with fatigue, because conducting and doing such a test intensively demands quite some energy.

My iTunes MP3 impressions

The 96 kbit/sec (file size 508KB for 42 seconds) is a joke. The artifacts are huge, as if the harpsicord is constantly pushed under water (heavy tremelo). This encoding is not useful for music.

The 128 kbit/sec (file size 676KB for 42 seconds) is also very bad. The artifacts are still huge, and the transients of the harpsicord very clearly interfere with the voice.

The 128 kbit/sec VBR (file size 760KB for 42 seconds) is a bit better, though still very fuzzy. Although the artifacts are still very very big, the transients of the harpsicord don't interfere with the voice that heavily anymore. For less demanding music, this encoding is on the edge of usable.

The 160 kbit/sec VBR contains still a significant amount of gurgling. Not useable.

The 192 kbit/sec VBR (file size 1MB for 42 seconds) is just acceptable. There are still artifact, which are especially obvious with background sounds (decay of harpsicord tones). The transients are still not convincing, the voice seems to be merged with the harpsicord. The tapping of the feet (on pedals) of Tori results in loud "bassy" plops (which cannot be heard in the AIFF file). The timbre of the harpsicord is "fat".

The 320 kbit/sec VBR (file size 1.6MB for 42 seconds) is quite acceptable, but still shows some artifact on the transients of the harpsicord, and loud plops due to tapping of the feet/pedals.

As you can guess, (iTunes) MP3 encodings are not my type of music. There is also a bug in the playback of MP3 files on the iPod. Hans Erik Hazelhorst informed me about severe gurgling in some situations (as clear from this piano sample). It only shows up using the iPod, not when played via iTunes on the computer. Another sample I've tried out from Tori Amos, Talula, even shows severe pops and clicks during playback on the iPod. This does not depend on the encoder, e.g. it happens both with LAME or iTunes. I've also tried the tracks on a 4G iPod, and these are artifact free!!! That basically means the old models iPod are badly supported for updates regarding decoding, which I think is a VERY bad thing. The AAC files are free of these severe artifacts. Even translating these MP3 files to AAC files make them more acceptable, even knowing a lot of quality is lost when encoding twice.

My LAME MP3 impressions

The 96 kbit/sec is very different from the iTunes one. Treble is filtered, making the artifects less prominent, though the whole sounds quite dull. Overall it sounds more acceptable than the iTunes MP3 version, though it still sounds bad overall.

The 128 kbit/sec has its treble back. Overall, it sounds less gurgling/tremelo than the iTunes version. Although the artifacts are quite obvious, it is towards listenable.

The 128 kbit/sec VBR has the same file size as the LAME 192 kbits/sec VBR file. It also sounds indistinguishable. I suspect "VBR" of LAME to be VERY variable, and it is not really fair to compare. I therefore excluded it from the table.

The 160 kbit/sec is better, and comparable to the 160 kbit/sec VBR iTunes one. Though it shows less artifacts than the 128 kbit/sec AAC, it misses accuracy.

The 192 kbit/sec VBR is comparable to the iTunes 192 kbit/sec MP3. .

The 320 kbit/sec VBR is also comparable to the iTunes 320 kbit/sec MP3.

My iTunes AAC impressions

The 96 kbit/sec (file size 556KB for 42 seconds) shows large artifact on the harpsicord, and is instable (tremelo). The sound timbre in the upper frequency range is coloured.

The 128 kbit/sec (file size 724KB for 42 seconds) is grainy, the dynamics are not very well, the voice is merged in the rest of the music, and the music sounds "busy" or "stressed".

The 160 kbit/sec (file size 892KB for 42 seconds) already sounds quite acceptable. There is still a level of grain on the harpsicord, and the voice is a bit less dynamic as in the 192 kbit/sec version, but I could imagine that this encoding will work fine for less demanding music.

The 192 kbit/sec (file size 1MB for 42 seconds) is OK. The transients are still a bit fuzzy, the voice is still merged in the music, and the harpsicord is a bit fat compared to the 224 kbit/sec.

The 224 kbit/sec (file size 1.1MB for 42 seconds) is quite good. This is the first encoding where the harpsicord sounds "thin" like in the original, the voice is seperated from the harpsicord, the tonal balance seems to be OK, and the transients of the harpsicord are fierce. Especially w.r.t. the transients of the harpsicord and the strength of the treble of the voice, there are still minor audible differences compared to the original AIFF file.

The 320 kbit/sec (file size 1.6MB for 42 seconds) is almost the same as the AIFF. The AIFF seems to sound a fraction more 'peacefull' and 'thin'. The AAC has a bit of 'grain' over the file. This is more obvious on the iPod; the same track listened to via iTunes is less grainy.

A comparison I cannot make is between iTunes Music store file, and own encoded files. Because I live in Europe, I cannot buy from the iTunes store. Long live the record companies, with their local protective regulations.

There are also some tests around comparing Quicktime AAC encoders with others, like the comparison of Roberto Amorim.

There is no VBR option for AAC files. Ivan Dimkovic shines a light on this on his message on Hydrogenaudio:

  • AAC is always variable bit rate with following rules:

    1. Maximum number of bits per one frame is in range from 0 to 6144, multiplied by the number of channels

    2. If encoder uses bit reservoir of exactly specified and defined size (by formulae in the standard), approx 10000 bits for 128 kbps, 44.1 kHz, Stereo - in that case, encoded stream is CBR and it follows the ISO 13818-7/14496-3 buffer guidelines.

    3. If encoder uses bit reservoir of 0 bits (no bit reservoir), in that case files are perfect CBR and this is used only in LD (Low Delay) AAC to minimize pre-buffering delay when decoding of streamed content. Not recommended in any way except for the said application (two-way low delay communication)..

    4. If encoder uses much bigger bit reservoir than one defined by the standard in that case it is ABR

    5. If encoder does not care about bit reservoir and encodes only according to the psychoacoustic rules it is VBR


    Usual methods are 2 and 5 and in some cases 4
For other backgrounds on AAC (and other encoding technologies), see the Fraunhofer pages.

MP3 and AAC compared

As you might expect from my notes, I prefer AAC above MP3. Although is some cases MP3 might sound a bit "warmer", it misses the accuracy as provided by AAC.

I've compared AAC and MP3, by defining playlists on my iPod containing files which have about the same quality (which is a personal opinion for sure), so that I could quickly compare them. I came to the following classification, represented graphically as follows:

It can be concluded from this figure that I agree with Apple's statement about AAC being generally better than MP3.

Conclusions

Although my experiment contains personal elements, I hope it is useful for you as well. It might help you in setting up experiments for selecting your favourite encoding type and speed.

For me it resulted in selecting AAC 224kbit/sec as my default format, as this is the encoding which provides good sound quality, and still results in acceptable file sizes (at least for my iPod 30GB). Although the AAC 224bit/sec still shows some minor artifact, you have to compare it with the original and critical source material in order to recognize those artifacts, though for the Tori Amos track I can recognize them immediately. The AAC 320 kbit/sec encoding results in much larger files, for just a little bit more quality, which was my reason not to select it. For less critical material, I use the 160kbit/sec AAC encoding, just to gain some space on my iPod. For critical material (piano, lots of cymabals, music close to hard rock with continiously distorted guitars) this results in obvious flanging or tremolo.

© Copyright 2003, Marc Heijligers