PA8W Amateur Radio

Wil, PA8W,  E-mail:           

Basic information on speech

Working SSB, all the information we try to transfer to the other side is enclosed in the spoken word.
So, if we want to transmit the best SSB audio possible, we have to understand exactly how this information is embedded in human speech.

For proper analysis, we will have to look at speech audio in two important domains:

Dynamic properties,
and how our equipment handles it.  

   Spectral properties,
   and again, how our equipment handles it.

The envelope:

The above graph shows the dynamic properties of normal speech, during a few seconds. It is obvious that the power density over time is extremely variable. Unmodified, natural speech shows level variations of more than 25dB. The highest peaks are most challenging for our transmitter; they will easily drive our power stage beyond the limits of linear operation, producing heavy distortion (and splatter). The following audio samples illustrate this:

sound/badaudio1.mp3    sound/badaudio2.mp3
sound/badaudio3.mp3    sound/badaudio4.mp3

Now what causes the degeneration of intelligibility when a voice signal is clipping?
Well, remember that when we speak, air blows over our vocal folds, causing them to vibrate.
This vibration is not sinusoidal, but it contains lots of harmonics above the fundamental frequency. A wide spectrum of the fundamental frequency plus lots of harmonics is arriving in our mouth, which acts as a cavity, as a resonator, tuned by the toungue, lips, and the position of the jaws. This tunable resonator can emphasize or reduce certain harmonics in the spectrum and therefore it is the mechanics behind our sonants like the A, E, O, U etc. The relative level of the available harmonics determines whether we hear a O or a E, or other sonants. This is a highly simplified explanation, but good enough to make the following point: Identification of sonants is done by its harmonics relationship.
However, when we overload a transmitter stage, in a way that the peaks in the audio signal are clipped, new harmonics of the clipping frequency are being made, masking the original information that was contained in the original harmonics relationship.
This also means that the lower the clipping frequency, the more artificial harmonics will be added to the audio bandwidth, and the more damage is done. And since the lower frequencies in our natural voice are the most powerful, this poses a serious threat to our intelligibility!

A way to deal with this threat is limiting the signal; a fast acting device reduces gain quickly when overload is imminent, holding back the excessive peaks before they can do any damage.
There are several different technical approaches to accomplish that.

A second technique very close to limiting is compressing the audio signal; a device increases gain for the lower level signals, and reduces gain for the higher signal levels. Thus, smoothing the dynamics and squeezing the signal into a much smaller dynamic window. This is of great value to our small dynamic range radio connection; the loud parts are held down to levels which the power stage can handle without distortion, and the low level parts of our speech are amplified above the noisy background of our radio connection.
In this way, one could safely reduce the needed dynamic window to 12dB, or even 6dB for tough DX work. 
Otherwise, we would need about 25dB of headroom to be sure that the total dynamics of our speech is audible at the other side.
OM Wim, PA0WV, did some research that really shows the importance of compression:
He digitally recorded about 30 seconds of uncompressed speech (alpha, bravo -until zulu, and 1234567890 in a quick sequence, without pause.) Then he calculated the maximum power peak and the average power.
It turned out that the average power was between 1% and 4% of the peak value! 
A good compressor -often called processor-
if well adjusted, will immensly improve your stations readability!

The frequency spectrum:

The above graph shows how the energy in our natural speech is distributed over the frequency spectrum. You see lots of power needed for the lower frequencies up to 700Hz. This part of the spectrum is the most energyconsuming for your transmitter, and will easily tend to overdrive the power stage, as you can hear in the audio samples left.
In fact, if the above curve is fed unmodified to your transmitter, your transmitters limiter will be activated mainly by this lower region, pulling down the more important higher frequencies as well!
Or, when your limiter is not adjusted properly, the lower frequencies will be clipping quite happily, polluting your audio with artificial harmonics.

If we investigate which frequencies are the most important ones for intelligibility, we will find that higher frequencies show increasing importance!
A man's voice fundamental frequency normally is between 60 and 120Hz. A woman's voice is pitched about one octave higher, 120 to 240Hz.
The fundamental frequency and its first few harmonics barely hold any  intelligibility at all. That's up to about 250Hz for a man's voice.

The part above 1000Hz contains most of the speech intelligibility, but there, the natural energy level drops around 10dB, and drops even more towards the 2800Hz mark. The 1000Hz-2800Hz region is of the utmost importance for the readability of speech in a SSB chain. 

So, the low region could be considered useless?
Yes and no; cut off everything below 700Hz and your speech will be perfectly readable. 
But, the lower region does add some flavour to the cake.
Without the lower region, speech sounds metallic, crispy, not natural, and really unpleasant, like these samples:

  sound/highpitched.mp3   sound/highpitched2.mp3

When we slightly reduce the low region however, speech will maintain enough body to sound natural and pleasant, and be very easy to copy, also over longer periods of time. The way to go is reduced (approx. -10dB) low region, and lifted high end, (approx. +6dB) for two good reasons:
1, It will spoil less transmitter energy and reduce the risk of overload.
2, It will emphasize speech intelligibility.

Overall, your intelligibility and your dynamics is served best if your audio contains no dominant frequencies and no notches. In other words, in average speech, all frequencies from 300Hz to 2800Hz should show about equal energy density, which is not very easy to judge by the untrained ear, but it can simply be tested with a spectrogram (see the other items in the audio menu). 

Note that "energy uniformly and evenly distributed over the entire SSB bandwidth" in most cases will mean something entirely different than a flat frequency response of the audio chain. An honest, flat frequency response is something you need to transfer HiFi music. So that the bass guitar up to the triangle all find their place in the spectrum without any colouration. 
(Achieving that at very high sound levels is what I do for a living...)

Hams, on the contrary, try to communicate via a lousy environment, so thatīs a totally different game.
What frequency response is needed for a uniform energy distribution is very much depending on your voice, your microphone technique, the  mike itself, etc.

So there's no off-the-shelf solution, but that's where it really gets interesting isn't it?