Composers' Desktop Project
AMPLITUDE the loudness level of a sound (i.e., of a sample or a frequency component)
A waveform is usually represented digitally by a graph of its instantaneous amplitude against time. The gradual change of the instantaneous amplitude, from sample to sample, maps out the waveform of the sound. If we now take the average sample height over a largeish group of samples, we can trace out how the loudness, or amplitude, of the sound changes over time. Note that we have to take the absolute value, ignoring the minus signs all minuses become plus or else the average will always be close to zero.
Referring to the perceived loudness of a sound, the power driving a loudspeaker, the volume of playback, amplitude refers to the overall energy level of the waveform. It is often represented graphically by locating the amplitude value for each digital sample on a vertical scale, such as from -32768 (minimum, virtual silence) to +32767 (maximum, full volume, 'normalised' sound). This scale represents the dynamic range available for 16-bit samples and is used by Csound and the CDP 4.5 System.
CDP Release 5 will handle a wide variety of soundfile sample types and formats. Regarding amplitude, 24-bit soundfiles increase the dynamic range to -8388608 to +8388607. However, in Release 5 all types of amplitude measure will be converted internally to a -1.0 to +1.0 range. This is the range with which the user will work. Most internal calculation in the CDP software has always been done in this way as it is more accurate, and it allows the software to handle sound from different file formats in an identical way. The floating point format option gives the highest precision, handles streaming, and doesn't clip overmodulated signal, which means that it can be normalised without distortion occurring.
Another common measure of amplitude is Decibels ('dB'). This relates to the 'intensity' of the sound and measures the ratio of two values. Because intensity if proportional to the square of the amplitude, the Decibel range is logarithmic: every change of 6dB halves or doubles the amplitude. This range used in both the analogue studio and in computer software goes from -96dB (minimum, virtual silence) to 0.0 dB (maximum, full volume, 'normalised' sound). This Decibel range can be used in CDP's MIX program, as can expression of relative amplitude levels in terms of gain_factor. Thus in SUBMIX MIX you could state the relative level of a sound as -6dB or 0.5.
Somewhat confusingly, the Decibel range is also described as going from 0dB (virtual silence) to 120dB, the threshold of pain. The relative levels of different types of sound, such as the rustle of leaves, conversational tones, or airplanes, are often given in this way, as on p. 66 of David Sonnenschein's excellent book, Sound Design.
Changing amplitude levels can also be done with a multiplier. This is called a gain_factor. Thus 1.0 makes no change, while 0.5 halves the amplitude and is therefore equal to -6dB. The Chart of Gain dB Correlations maps out the relationships between the gain_factor, the Decibel and the 32767 measurement scales. The Music Calculator in Sound Loom does these gain/dB calculations. You can also access this facility in the music section of COLUMNS in Soundshaper or on the command line.
A detailed discussion of these matters is given in Computer Music by Dodge & Jerse, pp. 19-25, to which the above summary is indebted.
This is the amplitude time representation of a waveform in the time domain. It is the shape made by a bouncing ball.
Analysis files operate in the spectral domain, in which amplitude (vertical axis) is plotted for each frequency (horizontal axis) component. Here one frame of the same bouncing ball sound is captured in the FFT analysis data stream, showing an amplitude + frequency graph. It can also be represented in several other forms: a bar graph, a sonogram, a spectrogram, or a 3-D 'mountain' graph.
What is relevant here is to realise that altering the amplitude of frequency components affects the timbre, the tonal qualities of the sound: depending on which frequencies are louder. This effect is special to the spectral domain. In the time domain, where the amplitude of each sample is involved, altering amplitude simply makes the whole sound louder or softer, without changing tonal characteristics.
CDP's FOCUS EXAG effect plays with the amplitude of frequency components and therefore has a direct effect on the timbral quality of the sound.
ANALYSIS format conversion of sample to analysis data
ANALYSIS converts from the Time Domain to the Spectral Domain, i.e., from a time amplitude to a frequency amplitude representation of the sound. The analysis data is stored in a completely different way than a soundfile, and these files are referred to as analysis files. The Fast Fourier Transform (FFT) is used to carry out this conversion and the Inverse FFT to convert back to a soundfile.
The detailed workings of these processes do not need to be known by the composer, only that certain sound transformation processes require analysis files as inputs. For the technically minded and for a greater ability to fine-tune the analysis procedure, see the section below on Analysis Settings.
ANALYSIS SETTINGS optimise the conversion process
Normally, a sound is represented as a series of samples. Each sample has two data items: an amplitude and a time. This is the 'time domain' and changes made in this domain can, therefore, adjust the amplitude or the time data items.
The 'spectral domain' also has two data items: frequency and amplitude, and this is derived from the sample data by means of a nifty mathematical manouevre known as a Fast Fourier Transform (FFT). Thus it is directly concerned with which frequencies are present in any given slice of time (they are always changing) and how loud each of them is. Note that time is not represented here time is represented by the regular stream of frames.
There are three ways in which we can fine-tune this FFT analysis process, and in considering what they are, we get a reasonably clear picture of what the analysis is doing.
- FFT size The FFT analysis works on incoming samples in groups. FFT size is the number of samples in each group used to make each analysis frame. The larger the group, the better the frequency resolution is. However, the larger number of samples also means a longer chunk of time, so time resolution is coarsened. Thus there is a frequency - time tradeoff. Higher values for FFT size improves pitch tracking, and therefore transposition accuracy.
- FFT Overlap The analysis overlaps frames so that it produces smoother results. The overlap is given in number of samples from the start of the previous frame, so smaller values mean more overlap. The FFT size divided by the number of overlap samples gives the overlap factor, which is normally kept at 4. E.g., 1024 ÷ 256 = 4, 2048 ÷ 512 = 4. Making the overlap tighter than this will increase the CPU load. The main reason for doing so would be to improve quality in extreme transpositions.
- Window size This is a multiple of FFT size. Normally set to 2, it can be set to 1 in order to reduce latency. This will, however, reduce frequency resolution and increase time resolution but this also means more frames per analysis, more data and higher CPU load.
3 Rules of Thumb provide guidelines for adjusting the above settings:
- Audio quality Audio quality is enhanced with larger FFT sizes and small overlaps.
- Latency FFT size * Window size is the main factor in determining latency, higher values causing more latency. E.g., 1024 * 2 = 2048 (samples in an analysis frame), 2048 * 4 = 8192 (samples in an analysis frame). Use low values for low latency.
- CPU load The analysis rate is the number of frames to process per second. It is calculated by dividing the number of overlap samples into the sample rate: e.g., 44100 ÷ 256 = 172 frames per second. The higher the rate, the higher the CPU load.
- most situations: 1024 - 256 - 2 (current default for low CPU load, but not lowest latency)
- high resolution pitch transposition: 2048 - 320 - 1, especially if transposing up more than 7 semitones.
- drum sounds: 512 - 64 - 1 (Smearing on drum sounds with the phase vocoder is a classic problem. This very small window, low latency setting gives the best results.)
- good quality pitch transposition while retaining a lower latency: 1024 - 160 - 1, a good high quality setting, e.g., for transposition up an octave without increasing latency too much.
ATTACK TRANSIENTS the angle of amplitude increase at the beginning, the 'onset' of a sound
The way a sound begins is one of its most important features. When we imagine beating a drum, or starting a violin string vibrating by moving the bow across it, or blowing into a trumpet with that extra umph to get the note sounding, we realise that the act of producing the sound differs according to the physical nature of the instrument. In fact, it is in hearing this start to the sound that we can recognise what instrument is being played. If we take away the beginning or reverse the sound, the nature of the instrument whatever it was that made that sound becomes surprisingly ambiguous.
This beginning is known as the attack of a sound, and it has both amplitude (loudness curve) and frequency (timbral) components. Striking or blowing an instrument with great force increases the amplitude of many frequency components, especially the higher frequencies. This makes the sound brighter, like the piercing, clarion call of a trumpet. This would be a 'sharp' attack with a fast attack 'transient', meaning that the slope of the amplitude rises steeply, such as from zero to its full volume in 0.1 sec. A slow attack has an amplitude that rises more slowly, such as from zero to its full volume in 2 sec.
Trumpets are known for their sharp attacks, strings for their gentle attacks, while instruments such as the clarinet or saxophone are really good at doing either. Computer sound editors alter the attack transient by drawing an amplitude slope and adjusting the amplitude values to stay within this slope.
The overall amplitude shape of a sound is called its envelope.
BAND width of a group of frequencies
If we think of the whole range of frequencies that make up a sound, a band of frequencies will be a section, a 'ribbon' as it were of adjacent frequencies lying between a low frequency limit and a high frequency limit.
Using computer tools, we can isolate bands of frequency, either retaining them while removing what is above and/or below, or removing them, a process known as filtering. Altering significant chunks of frequency components has a major effect on the sound.
Another type of frequency band is known as a formant, fixed frequency bands of relatively high amplitude which more than anything else define the timbral character of a sound.
BINARY machine-readable encoding
Files can be binary or text in format. If 'text', they can be written and edited in a normal text editor. If 'binary', they have been encoded in a way that can only be read by the computer. Most CDP file formats are text, but some are binary, such as sound, analysis and formant files; others can be both, such as envelope and transposition files.
Another meaning for 'binary' is more generic, namely the binary number system that is base-2 rather than base-10. The binary number system lies at the heart of how a computer operates, because all numbers can be expressed in terms of 0s and 1s: OFF or ON.
BLUR result of averaging data, smoothing over features of the original
When water gets on writing done with water soluble ink, such as used by an ink-jet printer, the ink begins to dissolve and the edges of the letters begin to blur. At first it is hard to read, but you can still make out the letters. But when the dissolving ink from one letter begins to overlap and mix with the ink from another letter, it becomes more difficult to read the text. It's a bit like what comes out when you try to speak without moving your lips. Clear precise diction becomes a kind of verbal porridge as the sharp contrasts between consonants, vowels, pitch levels and amplitudes become smeared together.
It is very exciting that this kind of effect can also be achieved with sounds. In the spectral domain, the two components of the analysis, the amplitude and the frequency, are calculated for e.g., 1024 vertical bands in the layer cake of the sound for every e.g., 100th part of a second (which is called a 'frame' usually several frames overlap to form a 'window' in order to ensure a smooth result). We can therefore picture 100 windows of amplitude & frequency information for every second of sound. Blurring effects can be achieved in Spectral Transformer by accumulating the data of several windows, or by removing some of the frequency components (such as those with lesser amplitude). The ACCUMULATOR and TRACE employ these two methods, respectively.
The overall effect is to smooth out differences by reducing the differences from one window to the next: amplitudes (of the frequencies) are more consistent, meaning that the tonal characteristics are spread over a greater period of time, and the frequency content, even when it becomes more complex, changes more gradually. The result is softer-edged, gentler, more slowly changing sounds, but full of (changing) timbral interest. The ACCUMULATOR adds a glissando effect as well, further blurring with pitch bends.
The blurring effect reduces the recognisability of the original sound, making it more abstract. Thus it is a route towards original sounds with a flowing, sonorous character. These may be used for 'ambient', 'chill-out' music and to create flowing, abstract sonic imagery.
BREAKPOINT file containing time-contour instructions
For music to be supple and engaging, it needs to change over time. That is to say, time-contours need to be implemented. These specify specific values at specific time points, such as:time parameter_value 0.0 0 0.5 6 0.73 2.1 1.0 4.5The 'value' refers to the numerical value assigned to a given parameter. Time is given in seconds.
Usually referred to as 'automation', in much software today these time-contours are entered graphically in real time. In spite of being quick and intuitive to use, there is some limitation on the degree of precision that can be achieved when created in this way: precise time points and precise values. Although a text breakpoint file is actually being created behind the scenes, the user cannot access it.
The breakpoint file mechanism currently used by CDP software allows for direct entry with a text editor. Thus the file can be created, stored and accessed via a text editor, and all values can be absolutely precise to several decimal places. This can be particularly important when specific time shapes are being designed as part of the formal structure of a composition.
CDP also provides graphic text editors for creating breakpoint files, and Richard Dobson's BRKEDIT enables you to make use of exponential and logarithmic curves as well as straight lines. CDP does not, however, have real-time automation facilities at this point. Our Spectral Transformer plugin for Cakewalk's Project 5 does operate in real-time, so we shall have facilities of this nature in our forthcoming software.
A detailed explanation of how to create and use breakpoint files can be found in CDP File Formats.
CHANNELS the frequency bands or 'bins' into which the sound is analysed, i.e., frequency resolution
'Channels' in this context has a specific technical meaning. Normally, we think of 2 or 4+ channels on a tape or in a digital sound file, each of which is eventually routed to a different loudspeaker. In the spectral domain, 'channels' means the number (and size) of the frequency bands into which the sound is analyzed. Normally around 85 Hz apart, more and smaller bands means higher frequency resolution, fewer and larger bands means lower frequency resolution. The FFT analysis examines each of these bands, these channels, these 'bins' to see what is in it: i.e., which frequency/frequencies and its/their amplitude(s). Some of the channels may be empty.
Which frequencies are present in the various frequency bands of an analysis frame determines the tonal quality of the sound at that moment. Each frame covers only a tiny fraction of the length of the sound, and the contents of the successive frame of analysis data are constantly changing.
This contour profile of the data in a frame is called the spectral envelope, and the overall, constantly changing set of profiles for the whole sound, can be referred to as its 'timbral envelope'.
COMMAND LINE a call to an executable program via a 'command line interpreter'
Some computer operating systems have programs known as command line interpreters. This means that the user can call (access and run) a named program from within the interpreter. These are text-based, such as MS-DOS for the PC's Disk-Operating-System (DOS). Essential for low-level computer maintenance because they access computer internals independently of the graphic 'Windows' environment, they also provide useful facilities for composers.
CDP's software began in a UNIXTM-like environment (created on the Atari ST by Martin Atkins) and was therefore based on the use of a command line interpreter. This made it possible for CDP programmers to create a great many sound processing programs without having to worry about graphic features. It also meant that the software was highly portable. There are other advantages in having a command line mechanism, as discussed below.
The instructions for a command line are called its 'usage'. A typical usage includes:
Here is usage and implementation for CDP'S DISTORT REPEAT:
- the program's name
- a mode (if there is one)
- and any parameters (with flags, if used)
program_name mode infile outfile parameters
distort repeat infile outfile multiplier -ccyclecount [-sskipcycles]
distort repeat soundin soundout 2 -c2
Commenting on this command line, we see that:
- There are no modes, so there is nothing between the program name and the infile (modes are usually given as numbers: 1, 2 etc.).
- The multiplier parameter has no flag, so we see just the value in the command line.
- The cyclecount parameter does have a flag (-c) so its value on the command line is preceded by -c.
- Nothing is given for the paramter skipcycles. This is because it is an optional paramter, indicated by the square brackets [...]. Optional parameters are in fact used by the software, but they have Default values. If you are happy with the default value, you don't need to enter anything. In this case, skipcycles means the number of wavecycles to skip before starting to process the sound. The Default is 0 (i.e., none), so when the paramter is omitted, processing begins at the start of the sound.
A program's usage is displayed when you enter only the program name on the command line. In the example above, entering just DISTORT will display all the sub-programs of that Group of programs (Wavecycle Distortion). Entering DISTORT REPEAT will display the usage for that particular program.
The command line mechanism with DOSKEY installed (PC) gives you a command history. This enables you to use the UP-ARROW to return to previous commands. Thus you can process a sound, play it, delete it and return to its command line (as previously filled out by you), alter a parameter or two and re-run it all done more quickly than in a graphic interface.
The command line mechanism is, therefore, simple to understand and to use, though programs with large numbers of parameters can be confusing. Creating a batch file for that program, with its full usage in a 'rem' statement as a guide, can be helpful.
Furthermore, the batch file mechanism can be used to create your own library of sound processing sequences. The Sound-Builder Templates make use of the mechanism. These include generic batch files written so that you only have to alter the input soundfile name in order to run the entire sequence with a different sound. Similarly, you could create a batch file to run the same different program with, e.g., 10 different parameter settings, so that you could quickly make all the different versions and select the one you like the best. Although 'ancient history' in terms of computing, batch file libraries are often developed by CDP's most experienced users.
This text-entry method is also helpful for diagnostic purposes, to check on the operation of a program that appears to be failing, independently of the graphic user interface. This can help to pinpoint where the problem lies: i.e., in the program itself or in the graphic user interface.
Finally, the text entry nature of batch files makes them a beneficial option for those with impaired sight.
CDP software through Release 5 maintains its command line core for all the reasons mentioned above.
DECAY data values reducing over time
Decay refers to a gradual lowering of a set of values, usually amplitude. If a sound ends with the amplitude values descending from full to zero over 2 or 3 seconds, it is said to decay slowly. When applied to other parameters, the effect differs according to the function of the parameter. With FOCUS ACCUMULATE, decay is controlling reverberation. When the value is higher, the reverberation time increases, when at zero, there is no reverberation at all.
DIGITAL NOISE artefacts related to the sample rate
Digital noise is an artefact created by the digital sound sampling process. If there are (only) 22050 samples per second, each with its own amplitude value, then it is possible to have fairly large changes in amplitude level from sample to sample, and if the actual sound is changing during the time of this sample, that change is not captured by the sampling process. When these relatively sudden changes are eventually translated into the power driving the loudspeaker cones, the lack of smoothness in the motion of the cones introduces a noise factor (random frequencies).
This is why higher sampling rates have been sought: 44100, 96000, and even 192000 samples per second: it captures the moment to moment changes in the sound more smoothly, resulting in less information loss, cleaner loudspeaker movement and higher fidelity in the output sound.
The potential presence of digital noise means that applying digital gain needs to be done with care. The reason is simple: if the amplitude level is jacked up indiscriminately, the changes from sample to sample may also be increased to the point where digital noise is created by the irregularities produced.
ENVELOPE amplitude contour
The word 'envelope' is used in music to describe the profile of a sound, like the peaks and troughs of a mountain range. The amplitude envelope describes how the loudness changes with time: e.g., the amplitude envelope would have 'peaks' where the sound became loudest. The frequency envelope of the spectrum of a sound describes which frequencies in the sound are most prominent. For example, a sound described as 'bright' would have a peak in the higher frequencies of its spectrum.
In the time domain, time runs horizontally, and amplitude vertically. Thus each vertical bar will be potentially at a different height. When we connect up the tops of all these vertical bars, we get the waveform shape, as displayed by sound editors which act on the stream of time amplitude samples. We get an image of the amplitude contour of the sound when we average a number of samples. We have to decide at what scale to measure the loudness: we can look at every sample, or we can take an average (absolute) value over a number of samples. If the scale is too small (too few samples), we will not see the changing loudness of the sound, but only the changing shape of the waveform. If it is too big (too many samples), we might miss fine details such as tremolando effects. In some cases (like a tremolando through a crescendo) we might want to ignore the tremolando and just see the crescendo, so the choice of envelope window size is important (i.e., how many samples). The CDP envelope programs use a default window size (which you can change) which avoids the envelope being too small. The range of this parameter is 5ms to the length of the soundfile.
The same comments apply to the spectral envelope. For example, the spectrum of a clarinet playing Ab will only have energy at the frequency of Ab and at (some of) the harmonic frequencies of Ab. All the other channels (of which there are very many more!) will show (approximately) zero amplitude. When we describe the spectral envelope, we're only really interested in how the heights of the channels containing the harmonics change as we go up the spectrum, so we have to choose a scale of operation that shows us this. Within the CDP software, the extraction of formants (peak shapes in the spectral envelope takes care of this scaling for you. This is the -f (frequency-wise) / -p (pitch-wise) formant extraction option discussed in FORMANTS GET.
EXPONENTIAL CURVE increasingly faster increase or slower decrease
The CDP software makes use of both exponential and of its inverse, logarithmic curves. They can be inserted into breakpoint files using the 'extended format' available in BRKEDIT. The exponential function appears in ENVEL DOVETAIL to enable a more supple type of fade in and fade out, and a logarithmic interpolation option is available in STRANGE SHIFT.
The rise of exponential curves starts slowly and then speeds up. This occurs because the underlying y value is being doubled at each subsequent linear increase of x:
Similarly, the descent of exponential curves (< 1) starts quickly and slows down: with every linear negative integer change of x exponents, the value of y halves:
- 2 x = y
- 20 = 1
- 21 = 2
- 22 = 4
- 23 = 8
- 24 =16
Thus, as the exponent increases by 1, the value on the y axis doubles and the curve produced rises with increasing rapidity. With values less than 1 or negative for the exponent, the downward movement along the y axis slows down as the difference between the y values decreases. The y value diminishes infinitely, never reaching the x axis.
- 2 x = y
- 2 0 = 1
- 2-1 = 0.500
- 2-2 = 0.250
- 2-3 = 0.125
- 2-4 = 0.063
The terminology 'starts slow and then speeds up' etc. is used because, in this musical context the x axis represents time.
Curves of this nature are used in the CDP software in various ways to produce more subtle time-varying results. They can be used in:
- Envelope shapes between breakpoint times (our 'extended format'). This is most easily done with the BRKEDIT graphic breakpoint editor, but it can also be done by hand in text files by preceding a second value with an 'e'. This means that the interpolation from the first to the second value will be an exponential curve. An example is given in the createfile section of CDP Files & Codes.
- In the ENVEL DOVETAIL program, Mode 1 enables you to select ('flag') a linear (0) or an exponential (1) fade-in and/or fade-out. In the exponential form, the fade-in starts slowly and rises increasingly quickly, and the fade-out starts slowly and falls increasingly quickly. Besides being a more supple movement, it also seems to work better in starting and ending the sound at zero amplitude.
- In ENVEL DOVETAIL Mode 2, fade-in and/or fade-out can be doubly exponential, i.e., faster than exponential.
FFT Fast Fourier Transform
The FFT analysis is the wonderful mathematical process used by the Phase Vocoder and translates amplitude time sample data into amplitude frequency data. It does this for a whole series of frequency bands called channels for as many tiny time segments as it takes to work through the sound from start to finish. These time segments are called windows. Implied in the fact that the time segments move through the sound is the notion of 'phase', which locates you in the sound: taking into account previous states, you are here now, you were there then.
The result of the FFT analysis is an 'analysis file', which is a huge amount of frequency amplitude data about the sound, covering as it does every channel in every window (plus a frame overlap factor to ensure smoothness when the sound is reconstructed by a 'reverse FFT').
The FFT analysis is what creates the spectral domain and makes possible amazing sonic transformations, of which the Spectral Transformer effects represent good examples of how powerful these processes can be.
FILE TYPES varying sample types and formats for CDP files
CDP Release 5 will not only handle 16-bit and 24-bit soundfiles, but also other forms of sound data, including (as shown in the Usage of COPYSFX):
Basically, the system will by default reflect the input format in the output: i.e., in all cases, the output soundfile will have the same format as the infile unless told to do otherwise, such as convert between .wav and .aif, etc. The conversions are made with COPYSFX, the usage of which gives the full list of formats.
- Sample types:
(.wav only: .aiff and .aifc written as type 4)
- 16-bit integer (shorts)
- 32-bit integer (longs)
- 32-bit floating-point
- 24-bit integer 'packed'
- 20-bit integers in 24-bits
- 24-bit integers in 32-bits
- Output formats:
- standard soundfile: .wav, .aif, .afc, .aifc
- generic WAVE_EX (no speaker assignments)
- WAVE_EX mono/stereo/quad (LF, RF, LR, RR) number of infile channels must match
- WAVE_EX quad surround (L, C, R, S) infile must be quad
- WAVE_EX 5.1 format surround infile must be 6-channel
- WAVE_EX Ambisonic B-format (W,X,Y,Z) infile must be quad
Besides the two basic binary formats for soundfiles and analysis files, CDP software's amazing flexibility is rooted in the many different kinds of text file that can be created as inputs to specific functions. Most of CDP's 50+ file formats are text files that can be hand-written, edited and stored.
Full details on all CDP file formats are summarised in CDP Files & Codes. There is also a version using frames in which a full index is contained in a scrollable lower panel. It is recommended that you place a shortcut to the frame document on your Desktop for easy reference. This is filesfrm.htm and can be found in the top level of the CDP HTML folder. It can also be accessed via CDP's main index to the documentation, ccdpndex.htm.
FILTER to remove part of
In general terms, filtering is removing part of the contents of something, like straining fruit through a cheesecloth to make a jelly.
The musical use of the term relates to the removal of part of the frequency content of a sound. Removal:
- above a given frequency = lo-pass (those above are removed, those below pass through)
- below a given frequency = hi-pass (those below are removed, those above pass through)
- within a pair of frequencies retained = band-pass (those above the upper limit and below the lower limit are removed, while those within the limits pass through). There can be numerous bands-to-keep specified, sometimes enabling the user to tune the sound to a chord.
- within a pair of frequencies rejected = band-reject (those within are removed, and those above the upper limit and below the lower limit pass through. This leaves a hole in the middle of the sound, so it is also called a notch-filter. There can be numerous bands-to-reject, creating a comb-like effect.
- multiple spectral peaks and troughs are what happens in a comb filter, which combines a signal with a delayed version of itself, creating periodic cancellations in the frequency domain.
Another aspect of filters is a boost factor, creating resonant frequencies. The degree to which this resonance is focused on specific frequencies depends on how sharply the adjacent frequencies fall away in amplitude. The slope of this amplitude reduction is known as 'Q'. If amplitude falls away quickly, one tends to hear a focused pitch in the retained portion of the sound, if more slowly, the filter is fuzzier and one hears more of the original sound the 'skirt' of the filter is wider and encompasses more of the neighboring frequencies.
FLAG command line component to signal the presence of a parameter
'Flags' are sometimes used in command lines to make it clear to the software which parameter is being accessed. They take the form of single letters preceded by a minus sign, such as -d. The value for that parameter follows immediately without an intervening space, such as -d2.5.
FORMANT a fixed-position resonant frequency region
Formants are amazing. They are what makes speech comprehensible, regardless of who is speaking (e.g., a man or a woman) and regardless of what pitch (if any) they use. In the human body, formants are created by the variously shaped resonant cavities of the head. Each shape matches particular wavelengths, thus setting up specific resonances, i.e., vibrations. Some of these resonant cavities are fixed in size, allowing us to distinguish between individual speakers or singers, but many of them can be varied by changing the shape of our vocal tract altering the size of the mouth, the position and shape of the tongue, the relative opening or closing of the back of the throat or of the nasal passages, the raising or lowering of the larynx.
Humans' ability to change these formant resonances is what enables us to speak, and what distinguishes the human voice from other musical instruments which, on the whole, have only fixed formants.
REPITCH TRANSPOSEF effect provides a way to transpose while retaining formant information. This is essential in transposing vocal sources if both the sense (particularly the vowel content) and the human-ness of the source are to be preserved. It is less critical in the transposition of instrumental sounds, for which REPITCH TRANSPOSE appears to be robust and effective.
FRAME unit of analysis
The analysis frame is derived from a group of samples in the original sound. This group of samples, e.g., 882, represents a tiny time-slice of the original, e.g., 0.02 (2 100ths of a) second at a sample rate of 44100 per second. To ensure smoothness in analysis and resynthesis, frames are overlapped by a specified number of samples.
The overall analysis usually contains millions of bytes of data from all these frames. This is the analysis data upon which the spectral Transformation processes operate, and it is only recently that computers have been fast enough to handle all this work in real-time without the help of specialist outboard DSP hardware, although that can help as well, as in the very powerful KYMA System.
FREQUENCY number of cycles of a waveform per second (= Herz - Hz)
Frequency means how many times a second a given (periodic) waveform repeats (oscillates). Frequency is measured in hertz. Below about 16 Hz, this is heard as separate clicks, but above ca 16 Hz we begin to hear steady tones.
All but the most artificial of sounds contain many frequencies, also called partials. Sounds need not have any repeating waveform (e.g., noise). Steady sounds that are not clearly pitched may contain many frequency components, often in inharmonic relationship. Even the waveform of a steady pitched sound can usually be broken down into a number of different smaller repeating shapes. Each of these has a different frequency. The frequency of the whole shape is known as the fundamental, and usually (but not always) determines the pitch we hear. The frequencies of the smaller shapes (always whole number multiples of the fundamental in these steady pitched sounds) are also important. These frequencies in 'integer relationship', together with the fundamental, are known as the harmonics of the sound, a subset of all the partials in the sound.
GAIN changes to the loudness of a sound, louder or softer
Gain is the process of increasing or reducing the amplitude of a sound. This is done quite simply by multiplying the numbers by which the amplitude is represented in the computer, usually a range between -32766 and +32767 for 16-bit samples. Thus, if a given amplitude is 10000 (1/3rd max), multiplying it by a gain factor of 2.5 will bring the amplitude to 25000. The use of gain needs to be balanced by an appreciation of digital noise.
As mentioned in the entry on amplitude, CDP Release 5 will handle not only 16-bit files, but also other soundfile formats, including 24-bit soundfiles, enabling a much wider dynamic range: -8388607 to +8388607. But from the users point of view, it will work exclusively within a -1.0 to +1.0 range. A value of 0.5 will therefore be ½ max, and 0.2 will be 1/5th max. Multiplying 0.2 times a gain_factor of 2.5 will bring it to 0.5.
An important feature of the 32-bit floating-point sample type is that it does not clip the signal when it overmodulates. Therefore, when you later apply a gain reduction to normalise it or bring it below a normalised signal, it will do this without incurring distortion. The highest precision and safety is therefore achieved in this format. Conversion to other formats can then be done later with COPYSFX.
SNDINFO MAXSAMP returns the maximum amplitude value of a soundfile and specifies which gain_factor will bring it to full amplitude. Detailed information about this is given in the maxsamp entry in CDP Files & Codes.
GRAIN a very tiny fragment of sound
Sound in the digital domain can be cut into very tiny fragments. For practical purposes, i.e., to avoid clicks, they need to be enveloped, in the sense that they are DOVETAILED with an amplitude line or curve rising from and returning to 0 at the beginning and end of the fragment. The smallest feasible grain is usually considered to be about 12.5ms (0.0125 sec.): 551 samples at a sample rate of 44100.
One approach, pioneered by Barry Truax, synthesises grains digitally in real time and builds huge, dramatic flows of grains, often diffused over multiple loudspeaker systems with complex panning algorithms. One advantage of this method is that the timbral constituents of the sound can also be altered in real time.
Another approach, used in the CDP System, breaks up existing (sampled) soundfiles into grains. Pioneered by the Groupe de Researche Musicale (GRM) as 'brassage' (mashing together), this 'granulation' technique enables the composer to work with pre-existing sound material. Trevor Wishart's program for CDP is appropriately called BRASSAGE, and in its graphic form developed by Richard Dobson, GRAINMILL. This software moves through the sound from beginning to end, with many (time-varying) ways to affect the density, timestretch, pitch etc. of the grains. Timbral variety is limited by the nature of the input sound, so when timbral change is important, it is useful to construct a complex source with SPLICE or MORPH before granulation.
Certain naturally occurring sounds are perceived to be intrinsically grainy, e.g., vocal rolled 'rrr' sounds, or very low sounds on a bass clarinet. In these sounds we seem to be able to hear a rapid sequence of very short events. The CDP program GRAIN allows these grains to be distinguished, counted and manipulated in various ways. One interesting feature of the GRAIN programs is that they enable a sequence of events to be run in reverse without reversing the events themselves, provided that the grain program can distinguish those individual events as grains.
HARMONIC integer relationship among partials
A harmonic is a term used for a partial which is an integer multiple of a real or implied fundamental the fundamental is the predominant pitch perceived by the listener. Anything vibrating produces a complex of oscillations, and when these synchronise in integer relationship, they lock together aurally and are perceived as a single, focused, timbrally colored pitch. For example, if the fundamental is A-220 Hz (A below Middle-C), harmonics will be 2 x 220 = 440 Hz, 3 x 220 = 660 Hz, 4 x 220 = 880 Hz, 5 x 220 = 1100 Hz, etc. Subharmonics go the other way, i.e., below the fundamental.
HERTZ (Hz) units used to measure the number of oscillations per second
One full oscillation is generally represented as starting at zero amplitude, rising to its maximum (speaker cone forward), falling through the zero point to its minimum (speaker cone backward) and back to the starting point. This is also called a 'cycle', and hertz (abbreviated to 'Hz') is a measure of the number of cycles per second, that is to say, it is a measure of frequency.
A full oscillation need not begin at and return to amplitude. Starting locations along the waveform after the zero point mean that the phase is altered.
To say that a sound is oscillating at 1000 Hz therefore means that 1000 full oscillations are taking place in one second, that 1 full oscillation takes 1 millisecond. Regular waveforms are termed 'periodic', the 'period' being the length of time a full oscillation occupies. Frequency and period are therefore the inverse of one another. Another way to look at these matters is to consider the physical length of the waveform at a given frequency.
INHARMONIC a partial not in sync with the fundamental
Inharmonic partials are those which are not integer multiples of the fundamental. If the fundamental were to be A-220 (A below Middle-C) and this were multiplied by 2.01, the resulting partial would be 442.2 Hz, just slightly higher than the octave 440 Hz. This discrepancy means that the fundamental and this partial (which is slightly more than twice as fast) do not end at the same time: the second oscillation of the partial's waveform starts a little before the fundamental, thus overlapping the start of the next oscillation of the fundamental. This overlap puts the two oscillations out of sync and they begin to be heard separately rather than as a single colored pitch. There are various levels of inharmonicity:
- just slightly out and the sound is slightly denser and richer
- a little more out and we begin to hear distinct, separate pitches, as in bells and gongs
- even more out, and the timbral colouration of the sound begins to change
- when there is a jumble of many mutually inharmonic partials, the sound is almost totally aperiodic and loses any sense of pitchedness, and may become noiselike
The BANDSHIFT effect creates inharmonicity by adding or subtracting to the frequency of a partial or group of partials. Harmonic relationships are the result of multiplication: each higher octave is twice the vibration rate as the one below. When values are added, this relationship is broken and the frequencies overlap to varying degrees, with the proportions between them compressing (amount is > 0) or expanding (amount = < 0):
- octave multiplication of 220 220 x 2 = 440 (220:440 = 1:2), 220 x 4 = 880 (440:880 = 1:2), 220 x 8 = 1760 (880:1760 = 1:2 etc. each successive octave maintains an exact 1:2 proportion
- adding 17 to the octaves of 220 440 + 17 = 457 (220:447 = 1:2.077), 880 + 17 = 897 (440:897 = 1:2.039), 1760 + 17 = 1777 (880:1777 = 2.019) etc. notice how the ratio, the proportion between the 'octaves' is in fact getting a little smaller each time: i.e., the frequencies are being compressed together. This compression is more dramatic when larger values are being added to each partial.
- subtracting 50 from the octaves of 220 440 - 50 = 390 (220:390 = 1:1.772), 880 - 50 = 830 (440:830 = 1:1.886), 1760 - 50 = 1710 (880:1710 = 1:1.943) etc. when amount is < 0, it is subtracted, and the proportions between the partials expand.
Thus inharmonicity is created by adding or subtracting fixed values rather than multiplying by a fixed value. The timbral change to the sound is therefore produced by the overlapping of the frequencies and the compression or expansion of the proportions between them.
INTERPOLATION insert intermediate values
Some processes require that you enter a series of values, but the software actually needs many more values than that in order to complete its process. For example, a transposition breakpoint file that you create may specify 0 transposition at time 0.0 sec, and 12 semitones later at time 1.0 sec. You have only had to enter two values. However, the software creates a portamento or glide or, more loosely, a glissando rising through an octave (12 semitones) over time 1 sec. To do this, it automatically creates all the intermediate values required. This is 'interpolation'. In the CDP software, it is always done automatically by the software.
LATENCY perceptible delay in hearing the processed sound
When processing in 'real-time', we expect to hear the processed sound with no perceptible delay. The processing does take a certain amount of time, so the key here is 'perceptible'. When the processing is completed and the sound restored to our ears within about 20ms 0.020 sec., there is normally no noticeable delay. Above this amount of time, the time gap between hearing the original sound and hearing the processed sound becomes increasingly apparent and unacceptable.
In a real-time process, one is grabbing a buffer of input sound and then doing something to it, and the latency is determined by how long it takes to do that something before you can output the result. In the CDP real-time FFT applications, the main factor that affects latency is the window size, which is actually the multiple of FFTsize * Window size. Basically, it means how many samples are being processed in each frame, before moving on to the next frame. Lower values mean lower latency. The FFTsize and Window size can be adjusted in analysis settings.
The above latency factor is added to the latency imposed by the audio subsystem of the soundcard in the computer. Lowest latency is provided by modern cards with ASIO or WDM drivers supporting kernal streaming.
LOGARITHMIC finding the exponent of a number
The logarithmic curve turns out to be the opposite of the exponential curve: it is the inverse, the mirror image. This is quickly seen in BRKEDIT when a linear segment is changed first to 'exponential' and then to 'logarithmic'. Both of these functions deal with exponents: the exponential function calculates the result of raising a number to a given exponent. The logarithmic function works out which exponent results in the given number.
The expression log28 = 3 is usually 'read' as 'log base 2 of 8 equals 3'. This is already a very coded way of reading the expression, so it may be more helpful to read it like this: 'the exponent of 2 that yields 8 is 3'.
Logarithmic curves rise or fall quickly and then slow down. This occurs because the y value is increasing linearly (it is the exponent) while the x value is doubling:
Similarly, with values for x less than 1 halving, the value of y increases linearly in a negative direction, but x never reaches the y axis:
- log2x = y
- log20.5 = -1
- log21 = 0
- log22 = 1
- log24 = 2
- log28 = 3
Thus, as the x value doubles, the y value increases by 1. Thus it moves quickly in the y direction, but then slows down as it spreads across the x axis. And in the negative direction, it the descent moves slowly at first because the points on the x axis start relatively far apart; then the motion speeds up as the x distance between the points diminishes, but the y value continues to move at the same (linear) rate.
- log2x = y
- log20.50000 = -1
- log20.25000 = -2
- log20.12500 = -3
- log20.06250 = -4
- log20.03125 = -5
The terminology 'starts quickly and then slows down' etc. is used because, in this musical context the x axis represents time.
MIDI PITCH values used for pitch in the MIDI system
Pitch and other parameters such as velocity in the MIDI system ('Musical Instrument Digital Interface') uses integer values from 0 to 127. Middle C is set at 60, though there is some variation as to which octave this is considered to be. CDP sets it at C-5, for example, but others set it at C-3. I suppose it depends on the extent of the pitch range being used.
MIDI instruments can also apply pitch bend (microtonal deviations), but this only slides them away from or back to the integer value. In CDP, MIDI values can also be specified microtonally, up to two decimal places. This means that the division of the semitone into 'cents' (hundredths) can be specified. E.g. 60.50 would be a quarter tone above Middle C. This can be very useful when creating new types of harmony in a convenient way.
The CDP Chart of Equivalent Pitch Notations shows pitch values in the MIDI, frequency and Csound octave notations. CDP also has facilities to convert between precise frequency and (microtonal) MIDI values.
MODULATOR data which acts upon and alters other data
Rather like a gear change, a modulator is one value or set of values acting on another value or set of values.
OVERLOAD amplitude level above the upper limit
Amplitude is measured on various scales of values. When the amplitude level of a sound exceeds the maximum limit of the scale, it means that it is exceeding what both the software and the physical playback equipment can handle, and audio distortion results. To correct this, one needs to apply a gain_factor to reduce the level of the sound. It is essential to do this before processing rather than attempt to bring down the level of a sound that has already distorted. Once a sound has been distorted, its actual waveshape is altered, and reducing its level will not change this fact you will get merely a quieter version of the distorted sound.
Another term for overload is 'overmodulation'.
PAN position or move sounds in the virtual space between loudspeakers
An orchestra layout spreads the instrument groups in various types of formation in physical space. Additionally, solo instruments can be placed some distance away from the rest of the orchestra or physically move in space as they play, such as choirs of trumpets on a balcony, a clarinettist on top of a ladder, or a solo bagpipe coming down the main aisle. These placements help clarify the overall sound mass for the listener as well as add dramatic interest.
In electroacoustic music, the positioning of the sound in the virtual space between loudspeakers is similarly important. In complex 'diffusion' systems (such as The Birmingham ElectroAcoustic Sound Theatre BEAST developed by Jonty Harrision), complex arrays of speakers are positioned around the concert hall, as well is in low, mid and high positions.
Although the loudspeakers, whether a stereo pair or a complex array, are in fixed positions, it is possible to locate sound between (and even beyond) the speakers. This process is called PAN or PANNING and is based on adjusting the relative volume outputs of two or more speakers. (There are more complex methods as well.) This makes the electroacoustic setup more flexible than the orchestra's and has led to the evolution of the art of 'diffusion', the dynamic positioning of sound during a live performance. Sometimes spatial location is built into the composition itself, sometimes it is left entirely to live performance, and sometimes it is a mixture of the two. It is an evolving art.
PAN can locate sound in fixed positions, or can cause it to move in time from one speaker to another. Stockhausen's Oktophony uses 8 channels and rich timbral sounds move slowly between them in ever-changing configurations. British composer Denis Smalley has written extensively on the musical importance and technique of sound spatialisation, as has Trevor Wishart in On Sonic Art.
PAN files are described in the CDP Files & Codes document.
PARAMETER feature of a sound that can be adjusted
A sound is a complex entities and have many properties from simple things like their duration, or their average loudness, to subtle features, like the moment-to-moment variation of the spectral energy.
We can change features of a sound using transformation processes, and these processes will usually have a number of 'parameters'. We can think of the parameters like the things we change by turning a knob or moving a slider. On an amplifier, we can change the overall loudness of the music by turning the gain control (thus changing the loudness parameter) and alter the spectral envelope by moving the Equalisation (EQ) sliders (thus changing various parameters of the spectral envelope).
Parameters in CDP processes can usually be controlled through time, so we can not only move the knob or fader, but we can describe exactly how the knobs and sliders are to move as the sound progresses.
The sound transformation algorithms access various parameters and enable the user to adjust them, usually within limited ranges of values. Thus the command to run a sound transformation process calls the program, names an infile and an outfile and then lists the values to be used for each paramater relevant to that process.
It helps to try to think about parameters and their values in as visual a way as possible, such as the contour line created by connecting up the tops of different value 'heights'. Being able to think about, visualise and adjust parameter values is essential in computer-based music, so developing fluency in 'parametric thinking' is advantageous.
PARTIALS frequency components of a sound
A partial is a frequency component of a sound, whether harmonic or inharmonic. Before starting to work with sound in the spectral domain, we may not realise that each sound is a huge amalgam of many partials. This is what gives a sound its richness, its complexity and ever-changing colouration. The task of spectral analysis is to find these partials so that the composer can then alter them in inventive ways in order to transform the sound.
PHASE point along a waveform at which a full oscillation is said to begin
A full oscillation can be described as a 360° circle, which, when unfurled (rolled along) forms a curving up and down shape. It is useful to plot where a given waveform actually begins: at 0, at 90° (0.25) etc. After 360° it will be at the same point in the next oscillation.
Similarly, two identical waveforms may begin at different times relative to one another. For example, a second waveform may begin at the 0.25 point (90°) relative to the first one (which begins at 0.0 (0°). They are therefore said to be 90° 'out of phase'. If two waveforms are 180° out of phase, they cancel each other out: because the + amplitude and the - amplitude levels are equal.
PHASE VOCODER software tool which performs the FFT analysis
The Phase Vocoder performs digital spectrum analysis using a complex mathematical program called a 'fast Fourier transform' (FFT). The analysis process creates a list of each sinusoidal frequency component together with its amplitude and phase. The analysis moves through the sound, capturing the changing detail in a series of analysis frames / windows.
The results of analysis can be a little confused when the input sound has a significant noise component, rather than a clear, periodic waveform. A long analysis window (larger number of samples in each one) improves capturing high frequencies (short wavelengths), while short analysis windows are good for capturing transient detail. Thus there is a constant trade-off between high-frequency resolution and tracking transient detail, and the best solution will differ from sound to sound. Adjusting the analysis settings enables you to fine-tune this trade-off.
The CDP Phase Vocoder is an evolved, 'streaming' form developed by Richard Dobson from the original Phase Vocoder created by CARL (The Computer Assisted Research Laboratory of the University of California at San Diego.) It is now also built into Csound and the Spectral Tranformer plugin (a CDP creation) written for Cakewalk's Project 5.
PITCH the perceived frequency level of a periodic waveform
We are so used to talking about pitches that we may not realise that it is actually a technical term with a very specific meaning. A pitch is a focused tone in which the partials present are dominated by a single frequency, which is called the fundamental. A fundamental frequency which may really be there or may be only a mental construct is perceived because most of the other frequencies are in sync with it, and they are in sync because they are integer multiples of the fundamental, i.e., harmonics. The net result of these synchronised frequencies is the perception of a pulse at regular intervals (i.e., 'periodic'), heard as a 'tonal' sound as opposed to a noisy sound.
If the fundamental were to be 100 Hz, then the first harmonic would be 100 x 2, or 200 Hz. In graphic terms, this first harmonic vibrates exactly twice as fast as the fundamental, such that two cycles of the harmonic end at the same precise moment as one cycle of the fundamental etc. for higher harmonics. This integer relationship synchronisation aurally locks in the harmonics with the fundamental, and we hear mainly the fundamental, but with colouration dependent on which harmonics are present.
When the partials of a sound go slightly out of sync, we begin to hear separate pitches, as in bells and some gongs. As they go further out of sync, they become spoken of as 'frequency complexes', and eventually as 'noise'. All of these terms and states of sound have a place when working with sound material.
The psycho-acoustics of pitch is more complex than the above suggests. Trevor Wishart observes: "The fundamental frequency of the perceived pitch may be entirely absent from the sound, as happens, for example, in the lower strings of the piano. Pitch is essentially a mental construct from the harmonic analysis done by the ear. In the case of the lower notes of the piano, the brain implies a pitch from the existing partials, without having a corresponding fundamental. In a similar way, the pitches heard in inharmonic sounds are not usually those of the partials, but the pitches that the brain thinks are implied by the relationships between the partials. The brain is looking for pitch relationships, and finds what it can."
'Q' slope of amplitude reduction
'Q' is discussed in the section on filtering.
QUANTISE snap data to a range which has regular divisions
To quantise is to move in fixed steps. It implies that more complex data is rounded off so that it fits into these fixed steps. The term is often used in connection with rhythms. Music is usually notated in fixed steps of half-, quarter-, eighth- and sixteenth-notes, etc. But when played, it is seldom absolutely regular in this way in fact it would sound wooden if it was. When music is played into the computer via a MIDI instrument, notation programs would (and do) make quite a mess on the page when they respond to every timing nuance of the performer. Therefore the software sieves the performance through a time grid so that the notations use the conventional fixed steps. This is the process of quantisation.
A similar situation can occur with sliders for changing numerical values, this time causing it to take too long to move between numbers or 'land' on simple numbers if every possible division of the integer is included. Therefore the process of moving the slider is quantised, so that the movement is in steps: small enough to be useful, large enough to enable the user to move through the range of available values in a reasonable amount of time.
SAMPLE-HOLD pick up and sustain lengths of sound
The original meaning of sample-holdsampled at a low rate (a few Hz), and the sampled value held until the next sample instant. The equivalent in Csound is "randh". The hold part of the procedure means that the tone sampled is sustained, creating a harmonic colouration as the different pitch levels were prolonged.
CDP approaches this technique in a characteristically flexible manner.
One aspect of sample-hold is that it can convert a continuously changing, but unpulsed sound into a pulsed sound by sample-holding at regular intervals. This is implemented in CDP's FOCUS STEP, which can convert a continuous and continuously changing, but unpulsed, sound into a sequence of regularly pulsed sustained events. The step paramter defines the length of the hold.
Another way is to set start times for segments of varying lengths and then move forwards or backwards from the specified time (FOCUS FREEZE).
A third approach is to freeze lengths of soundfile with a randomised delay and a number of other parameters designed to give a 'natural' result (EXTEND FREEZE) in the same way that iteration produces a more naturalistic extension of a sound than any kind of pure looping.
SAMPLE RATE the number of times per second that a reading is taken on the amplitude level of a sound
When sound is digitised, signal level (and implicitly, phase) is calculated at a series of time points. This process is called 'sampling' and involves analogue-to-digital conversion (ADC), i.e., conversion from a steady electrical energy stream to a discrete set of numerical values. The number of times per second that the signal level is calculated, i.e., a digital sample created, is called the sample rate. A common sample rate is 44100 times per second.
It is important to realise that the analogue signal is continuous and the digital sample stream discrete. That is to say, nothing is recorded between the sampling time points, even if the sound is in 'real life' changing during this time. The advantage of digital versions of a signal is that the numbers can be manipulated without the loss of signal involved in re-recording analogue tape. The disadvantage is the loss of information between sample points and the introduction of another kind of noise: digital noise. This is why sample rates have become progressively higher, to reduce the time between sample points.
SEGMENT a relatively short length of soundfile
In the classical tape studio, magnetic tape was literally cut into various lengths so that sections of a sound could be rearranged or different sounds could be interpolated. A well known example of this is John Cage's Fontana Mix in which several tapes were cut into hundreds of segments and randomly spliced back together again.
With the advent of the computer, digital editing made it possible to do this much more easily and to segment and rearrange sounds in many different ways. The CDP software abounds in segmentation options, with programs such as BRASSAGE, DRUNK, FREEZE, LOOP, ITERATE, SAUSAGE, SCRAMBLE, STEP, WEAVE and ZIGZAG.
The musical roles of segmentation are many and form fascinating additions to musical technique:
- roughen the surface
- create nonsense speech
- obscure the source of a sound: make more abstract
- completely scramble or shred a sound
- randomise contrasts
- introduce pulsations
SPECTRAL DOMAIN digital representation of sound as frequency and amplitude data
The spectral domain is really a special digital realm in which data about sound is held in a way that gives direct access to its frequency components, the partials. This data is achieved by an FFT analysis, producing a file of analysis data.
Every sound has its own frequency profile, an ever-changing mix of partial components that give the sound its (ever-changing) timbral colouration. For example, a piercing trumpet tone will start with a rich assortment of high frequency components and then settle down to a more harmonically ordered set of partials, giving a warmer steady state tone. Access to frequency components therefore means access to timbral colouration: which frequencies are present and how loud they are and how this profile changes as the sound progresses through time (analysis windows). The frequency profile is called the spectral envelope.
What is important for the composer is to have a modest understanding of what happens when the different aspects of analysis data are altered. The main idea here is that once the data is available, play can commence. The following gives a brief synopsis of what is involved and the terminology used.
- channel Each frame contains a number of bands of frequency, channels, also called 'bins', in which the analysis process 'looks' to see what frequencies at what amplitudes are there. The profile of these frequencies and amplitudes in a frame comprises its spectral envelope. A channel can have more than one or no partials in it. This can and almost invariably does change as one moves from window to window (i.e., from frame to frame) in the analysis. The number of (vertical frequency) bands into which the sound is divided is the frequency resolution of the analysis, and as with windows, more channels can introduce a latency problem. Manipulating the sound by removing channels (HILITE TRACE) means that the partial data in those channels is eliminated, thus altering the timbral colouration or even reducing the sound first to essential components and then to only a mere 'trace' of itself.
- partial This is a frequency component of a sound. A partial can be harmonic or inharmonic. The main factor in the timbral colouration of a sound, the partials can be transposed (multiplication factor), eliminated, selected by type, or shifted (addition factor).
- amplitude As usual, amplitude is a measure of loudness, here relating to partials. It is the contour produced by joining up the amplitude levels of the partials in a single frame that give us the graphic representation of the spectral envelope.
- window An analysis window is the set of frequency and amplitude values obtained for a particular time slice (frame) in the source sound. As the windows derive from time slices, the size of these slices determines the temporal resolution of the analysis. If smaller, the time resolution is finer: there are more of them and therefore much more data but the frequency resolution is diminished. This can result in a smoother analysis, but also an increase in the amount of data, which could case a latency problem: i.e., a perceived delay before hearing the processed sound again. Also, the order of the windows can be rearranged (BLUR WEAVE), shuffled like a deck of cards (BLUR SHUFFLE), something that has a big effect on what we hear! Accumulating data from previous windows (FOCUS ACCU) both builds up frequency components and introduces and introduces sustaining in the interior of the sound. These windows / frames are given an amplitude pattern to avoid glitching, e.g., the 'Hamming window'. The 'window' here is not the superimposed amplitude contour, take note, but the frame, the time slice itself.
- overlap the number of samples after which the next frame begins, e.g., 256. The overlap factor is calculated as the FFT rate divided by the number of overlap samples: e.g., 1024 ÷ 256 gives an overlap factor of 4. This frame overlap improves the quality of the analysis.
SPECTRAL ENVELOPE amplitude profile of the frequencies in one analysis frame
Each frame of a spectral analysis contains the frequency amplitude information for the partials (if any) in each of the channels ('bins') of the analysis. Each frame is therefore a snapshot, as it were, of the state of the partials in that time-slice, with the height of the vertical bars representing the amplitude level (energy) of the frequencies in the various channels the graph moves from lowest frequency on the left to the highest frequency on the right.
Thus the overall profile (which partials are present and their respective energy levels) of a single analysis frame is its spectral envelope. The spectral envelope of each successive frame normally differs, and the overall timbral pattern of the sound is built up as a series of overlapping frames. The frame shown here is within the attack portion of a Tibetan singing bowl sound. The whole sequence of frames (of varying contents) gives the overall 'timbral envelope' of the sound.
To be clear about this we need to distinguish between a number of different ways the term 'envelope' is used:
Some understanding of these various terms can be helpful when working on the CDP spectral processes involving window manipulations, formants, envelope transfers, transitions, and morphing.
- In the Time Domain, the 'envelope' averages the amplitude of the samples at time points in varying degrees of promixity Csound refers to this as 'control data', as it is usually much slower than the sample (audio) rate.
- In the Spectral Domain, there is the amplitude pattern (contour shape) that is imposed on the block of samples to be analysed. This time-slice is a 'frame', and it is amplitude shaped by a window function. This windowing contour shape is a form of envelope, but it is best to focus on 'window' in this context as meaning a time-slice that there is a contour shape is taken for granted. That is, beware confusing the enveloped time-slice with the spectral envelope.
- In the Spectral Domain, the spectral envelope is the profile of the frequency energy levels in the various bins / channels of a single frame (time-slice). Thus it forms the (amplitude) profile across the frequencies from low to high for that instant. The full sequence of (overlapping) frames gives the time-varying spectral envelope, the ever-changing timbral content / colouration of the sound, often shown as a 'mountain' display.
SPECTRUM the changing frequency content of a sound
The word 'spectrum' refers to a range of vibrations. The vibrations of the electro-magnetic field in space cover a huge range of wavelengths, from many light years to the wavelengths defined by the very granularity of space time itself. Sound, on the other hand, is vibrations in the physical medium of air. We are unable to hear sounds whose wavelengths are too long (the sounds are two low in pitch) or too short (the sounds are too high in pitch). Between these limits (normally 20 to 20,000 Hz), we may hear vibrations at any frequency. The mix of vibrations in any sound that we do hear is known as its spectrum.
The spectrum of a sound refers specifically to its overall vibrational content i.e., frequencies. Different frequencies come and go during the course of a sound, and this frequency content is always changing in some way. The FFT analysis finds the partials in each frame, and the sequence of frames gives the overall spectrum.
Most sounds are quite a complex set of (ever-changing) frequencies, and it is this which gives them their timbral colouration.
SPLICE join sounds together
In the early days of musique concrète, when tape recorders and magnetic tape were used, sounds were joined together physically with the help of a splicing bar and white splicing tape. The splicing bar had a groove into which the magnetic tape fitted snugly, and slanting (oblique) and vertical slots into which a razor blade was inserted in order to cut the tape. The oblique slot gave a soft join in which the two sounds overlapped, while the vertical slot gave a 'butt' join so that the second sound began without any overlap. Composers in the 'classical tape studio' were used to performing this operation hundreds of times during a composition.
Thus in the digital domain, this early terminology is still used:
- The act of 'splicing' means joining sounds together.
- The 'splice window' is the length of the overlap: one sound fades while the second gets louder.
- The 'splice window' or just 'splice' is usually calculated in milliseconds, and can be 0 (= a butt join). 15ms is generally accepted as the default (0.015 sec.) but in the CDP software can be as much as 5000ms (5 sec.) for very gradual, smooth, transitions, or as short as you wish.
- Long joins may cause dips in the amplitude level because the first amplitude is fading to 0 while the second is rising from 0 if signals are low, there may be a section of low amplitude before the second sound rises in amplitude.
- Butt joins, because they do not necessarily begin at 0 amplitude, may produce clicks. A practical application would be to use a butt splice to split a large soundfile into chunks for writing to floppy disks, for later reconstruction on another machine. A splice window in this situation would cause dips in amplitude when the chunks were rejoined. Other than this, butt joins should never be used unless the first sound falls to zero and the second begins at zero. CDP's VIEWSF enables you to see the envelope at a single sample zoom, thereby precisely identifying the zero crossings.
- Splices as short as 1ms are often audibly acceptable.
The degree of accuracy needed when splicing may vary. Current software is fairly robust in making the cuts and joins without causing clicks, but when it is vital that there be not the slightest hint of a click or otherwise unwanted 'glitch', the CUT points for material to be spliced should be made at zero crossings.
TIMBRE the tone/'color' qualities of a sound
There are so few words with which to describe the tone quality of sounds! Those that we do find ourselves using are mostly inaccurate: thin, fuzzy, smooth, glowing, metallic, dull, bright ..., but these give us some idea of what is meant by the timbre of a sound. In particular, it refers to those qualities resulting from the frequency content of the sound: which vibrations are present, and how loud each of them is. The frequency content of the sound is its spectrum. This is not a single, fixed frequency configuration, but something which is constantly changing, especially as the sound moves through from the all-important attack portion to its areas of 'sustain' and 'release'. The time-varying character of a sound's timbre is therefore absolutely crucial, and this is what the Phase Vocoder FFT analysis captures in intimate detail.
We are most familiar with tone quality as the recognisable sound of different musical instruments: flute, oboe, violin, horn, clarinet, trumpet etc. In the spectral dimension, tone qualities can be transformed in many amazing and subtle ways.
TIME time values
We are familiar with Metronome markings, e.g., crotchet (quarter note) = 60. This means 60 beats per minutes, which is obviously 1 second duration for each beat. The general formula for calculating the duration of a note event from the Metronome mark is to divide 60 by the Metronome indication. Thus 60 (seconds in a minute) ÷ MM=60 = 1 sec., while 60 ÷ MM=120 = 0.5 sec.
Time in a digital music context is usually given in hours, minutes, seconds and milliseconds. Milliseconds can be relatively unfamiliar ground, as can varying MIDI clock rates and SMPTE frame rates (24 per second for film and 30 per second for video). How all this relates to beats and tempos can be come rather complicated.
A millisecond (ms) divides the second into 1000 parts. Thus 1 ms = 0.001 second, 10 ms = 0.01 second and 100 ms = 0.1 second. A crotchet (quarter note) at 60 per second will be 1 second long. Thus a quaver (eighth note) at this tempo will be ½ second (500 ms), and a semi-quaver (sixteenth note) will be ¼ second (250 ms). Similarly, the smallest grainsize in CDP is 12.5 ms, or 0.0125 sec., producing a very smooth granular flow.
Very small values expressed in milliseconds are used for splice slopes (i.e., attack transients and decay times), delay times, loop lengths and steps, grainsizes etc. Attacks at less than 40 ms are virtually simultaneous, but at around 60 ms start to be perceived as discrete (separate). The default splice slope in CDP is 15 ms. This is perceptually virtually simultaneous with the start of the sound, but it does smooth the beginning of the sound nonetheless. Longer splice slopes noticeably reduce the sharpness of the attack, if not remove it altogether. Applying this to DOVETAIL, which smooths the beginning and end of a sound, a value of 0.01 to 0.05 seconds (10 and 50 ms respectively) will have a smoothing effect without a perceptual alteration of the attack. At the end of the sound, values of ½ second or 1 second give a smooth fade.
The TEXTURE Set currently requires that event durations be entered in seconds. The easiest way to do this is to relate everything to 1 second and then use the tempo parameter (called mult) when available. Our Time Chart describes how to 'bring numbers to life' and covers a range of calculations relating to beats, durations, tempo, correspondence between durations and musical notation, and hit points in film & video.
TIME DOMAIN digital representation of sound as amplitude and time sample data
To digitally sample a sound, we first convert the sound wave into an electrical wave which is an 'analogue' of the sound wave in the air, using a microphone. In the electrical wave, the time-varying displacement of the air (the amplitude of the wave) is represented by the time-varying voltage of the electrical signal.
When a sound is 'sampled', its analogue features (voltages) are converted into digital samples, each of which records an instantaneous amplitude at a given time. The number of samples per second of the sound is its 'sample rate'. There is no data inbetween these sample times, which is why the digital samples are said to be 'discrete' (separate) rather than continuous like the original displacement of air and its corresponding electrical signal.
We see sounds graphed as a waveform in sound editors, showing us the amplitude profile along the duration of the sound. Some editors can zoom this display down to the individual sample. The center-line is zero, above this line is the positive and below is the negative part of the wave.
This specific digital representation of the sound creates the 'time domain', and the sound can be further manipulated by altering those two parameters: amplitude and time. For example, the amplitude contour can be reshaped, the order of the samples can be reversed or shuffled, etc. Transposition in the time domain makes sounds faster (and therefore shorter) when the pitch level is raised, slower (and therefore longer) when lowered. Thus voices become fast and squeaky, or slow and growly. This does not happen in the spectral domain, where transposition does not affect the duration of the sound.
WAVECYCLE a complete wavelength of a sound
A wavecycle is a complete wavelength of a sound, before the pattern repeats itself. The key word here is 'pattern'. It is the recurring pattern of oscillation that enables us to identify a wavelength. The sine wave is the simplest and clearest example: it starts at 0 or some other non-zero amplitude, rises to its peak positive amplitude, falls to its peak negative amplitude, crossing 0 to do so, and then rises again back to the 0 or other non-zero amplitude at which it started, thus completing its full wavelength.
Sounds with a high noise content oscillate in a wild and randomised way. Regular pattern is absent, and we cannot speak in terms of 'wavecycles'.
Also see the description of wavesets, another subtly different wave concept that forms the basis of CDP's distortion programs.
WAVEFORM a single oscillation containing a positive and negative phase
An oscillation comprises an up/down, in/out, back and forth motion however one wants to look at it. A vibrating drumskin goes up and down, the membrane of a speaker cone goes in and out, a violin string goes up and down before repeating. One direction is represented by positive numbers, and the other with negative numbers, with the central position being zero.
When the numbers for a waveform are shown graphically on an X-Y axis, as on an oscilloscope, we clearly see that each oscillation has a characteristic shape, normally very complex. The simplest forms of these contour shapes define a basic set of shapes, as shown in the diagram below:
WAVESET (pseudo-)wavecycles between zero crossings
A complex waveform makes zero crossings at irregularly spaced time points. This forms the basis of the waveset distortion techniques developed by Trevor Wishart as part of the CDP software.
A waveset is defined as that part of the signal between every 2nd crossing of the zero. In general a waveset and a wavecycle are not the same thing. With a simple sine wave, the waveform crosses the zero twice in each cycle (once in the middle and once at the end) but, in general a wavecycle may cross the zero any (even) number of times in a wavecycle, or, with noise, zero crossings may have no relationship to a regular wavecycle. Hence wavesets rarely correspond to wavecycles.
Also see the description in the Reference Manual.
WINDOW the momentary spectrum of a sound derived by FFT analysis
The FFT analysis converts a block of sound samples (a frame) into a block of spectral data, which describes the (momentary) spectrum of the sound at the time where the block-of-samples was found. The momentary spectrum produced is referred to as a spectral window. For technical reasons, the frame of sound has to be given an amplitude contour shape before being converted to a spectral window. The amplitude contour is provided by a window function. (The CARL Phase Vocoder implemented in the CDP software uses the Hamming window function. Other contour shapes are the Kaiser, Blakman_Harris and Triangular.) Note that this window function is applied to the sound before the spectrum is produced and has nothing whatever to do with the shape of the spectrum (the spectral envelope) that results from the FFT analysis.
The 'time slice' here is the block of samples that will be analysed at one time. This block of samples is a frame, also referred to as a window. It is taken for granted that this 'window' will be given an amplitude contour shape prior to FFT analysis in order to prevent glitches (sudden changes in amplitude level), but we musn't confuse the window as time slice with the contour shape imposed on it (although these contour shapes are named e.g., 'Hamming window'), nor the contour shape with the spectral envelope. The CARL Phase Vocoder implemented in the CDP software uses the Hamming window. Other contour shapes are the Kaiser, Blakman_Harris and Triangular.
In practical use, we focus on the fact that the window is a time slice, and take the presence of a contour shape for granted. Thus, when a CDP program such as BLUR SHUFFLE or BLUR WEAVE talks about moving windows around, it is referring to time slices of the original soundfile, i.e., (extremely tiny) segments of soundfile.
Richard Dobson has provided us with a detailed technical discussion of these matters in his article The Operation of the Phase Vocoder, particularly Section 4., The FFT Window.
The analysis looks through this time slice / frame / window in a series of frequency bands (or 'bins') that may or may not contain significant partials. The profile of the partials it finds in all the bins of that time slice is called the spectral envelope.
Another issue with the time slice is that there is a frequency vs. time resolution trade-off. A typical window size is 1024 samples. The window size is in fact the number of frequency bands (= 'bins' = 'channels') into which that time slice is divided for analysis. A longer window has a finer frequency resolution (picks out the partials effectively), but at the cost of time resolution (some time-varying detail is lost). The sample rate divided by the window size gives the frequency resolution: e.g., 44100 sample rate ÷ 1024 samples in the analysis block = 43.06Hz. Thus, the larger the window size, the finer the resolution. This is discussed in considerably more detail in the The analysis 'bandwidth' and Channels sections of the Phase Vocoder Manual, in Section 5 of Richard Dobson's article: Window length and above under Analysis Settings.
Also see the section on Windows in the Phase Vocoder Manual.
ZERO CROSSING a sample in a waveform with zero amplitude
When a speaker cone moves back and forth, it creates a forward and backward vibration that passes through a mid-point. The depth of the movement equates to the amplitude of the sound, and the mid-point therefore equates to zero amplitude. When this movement is converted to digital data, it is shown in graph form, with the mid-line as the zero amplitude mid-point, forward movement above the line and backward movement below it.
A simple sine tone moves smoothly and steadily above and below the mid-line in every full oscillation, but more complex waves have more intricate patterns above and below before they pass through the mid-point.
Passing through the mid-point is called the 'zero-crossing' (amplitude = zero), and this may happen at regular or irregular time intervals depending on the complexity of the waveform. CDP refers to portions of the signal spanning two zero-crossings as wavesets or 'pseudo-wavecycles', and they form the basis of its (Trevor Wishart's) waveset distortion routines.
Editing (i.e., CUT) a soundfile at zero-crossings ensures that there will not be a click at the start of the cut portion of sound.
Last updated: 2 October 2004 - AE
Contributors: Written by A Endrich, with vital suggestions, technical corrections and additions by T Wishart, R Dobson and R Fraser
© 2004 A Endrich & CDP. This material may be reproduced without permission from CDP. Any suggestions for additional terms or regarding technical accuracy would be appreciated.