Each vocal sound has physical characteristics that can be related to perceptual evaluations of that sound. Frequency (measured in Hertz, Hz) corresponds to perceived pitch, sound level (measured in decibels, dB) to loudness, spectrum characteristics to perceived vocal timbre and duration patterns to rhythm. All these four physical and perceptual characteristics of the sound can be quantified. Here we will be focussed only on the measurement of physical characteristics of the sound, or, in other words, acoustic analysis.

Voice is the result of the interaction between three subsystems that constitute the vocal apparatus (see Figure 1): the power source – the driven air stream from the lungs – the sound source – the modulated airflow generated after the interruption of air stream by the vocal folds closing – and the sound modifiers – the articulators that modify the length and shape of the vocal tract and thus its resonance frequencies (Sundberg, 1987).

Figure 1. Subsystems that constitute the vocal apparatus (adapted from Welch & Sundberg, 2002: 253)

The sound source creates pressure variations which travel back and forwards through the vocal tract and are radiated mostly through the mouth and may reach our ears (Herbst, 2017). Figure 2 represents the propagation of a pure tone and how the pressure variations created can be captured by a microphone, and then converted by a computer sound card or external device to an electric and then into a digital signal. This signal is recorded by a software and stored as an audio file.

Figure 2. Graphical representation of a simple waveform (sine wave) (above) and corresponding distribution of air particles (below).

Different sounds have distinct acoustic pressure waveforms. To quantitatively characterise these waveforms, one must apply a mathematical theorem – Fast Fourier Transform (FFT). It states that any periodic waveform can be described as a set of individual frequency components (Fourier components). Each Fourier’s frequency component is a sine wave and if summed, they produce the complex periodic waveform analysed. All sine waves that constitute a complex periodic waveform have a special relationship: their frequencies are all integer (1, 2, 3, 4, 5, etc..) multiples of the fundamental frequency (fo) (1xfo; 2xfo; 3xfo; 4xfo; 5xfo; etc..). This bouquet of frequency components is referred to as a harmonic series. Figure 3 (left) represents a complex periodic waveform with 10 harmonics, each a sine wave. The frequencies of these 10 harmonics are all integer multiplications of fo, the lower ones of which correspond to specific musical intervals. This can also be represented graphically as a spectrum displaying all components of a sound, decomposing the time-domain sound waveforms to their frequency-domain spectrum. Figure 3 (right) displays such a spectrum. Here, the vertical axis represents the amplitude of each individual harmonic whereas the horizontal axis represents its frequency.

Figure 3. Harmonic series of a complex tone (left) and its graphical representation in terms of a spectrum display in a single moment in time of the waveform (right) (adapted from Howard & Murphy, 2008, 12).

Spectral shape relates to perceived timbre (Howard & Murphy, 2008; Titze & Verdolini, 2012). Therefore, acoustic analyses are commonly used in clinical and educational settings. The study of the captured acoustic pressure variations over time, i.e., waveforms, allows both perceptual and acoustical evaluations of voice production. In the voice clinic, analysing the acoustical signal helps the diagnosis of voice function and vocal health (Behlau et al., 2022; Hillenbrand, 2011). In addition, these types of analysis can be applied to quantify effects of therapeutical and surgical interventions (Hillenbrand, 2011; Behlau et al., 2022; Sundberg, 1987, Titze & Verdolini, 2012). In educational settings, voice teachers use information from acoustic analysis as a real-time visual feedback to guide students in developing knowledge of results (Welch et al., 2005) and thus promote their autonomy in finding strategies for solving problems (Lã, 2012) and for achieving particular aesthetic goals (Vennard, 1967; Nair, 1999; McCoy, 2004).

The spectrum of a single moment in time is called power spectrum (Lã, 2012), where the horizontal axis represents frequency and the vertical amplitude (see Figure 4), but in some softwares the axes are switched. Measures based on spectra are spectral slope (i.e., the slope of the averaged slope of the envelope of the spectrum), H1-H2 (i.e., the amplitude difference between the first and the second harmonic partial), and Long-Term Average Spectrum – LTAS (i.e., the spectrum averaged over a given period of time) (Howard & Murphy, 2008).

Figure 4. Power spectrum of an audio signal at a single moment in time , displaying frequencies and intensities of the lowest 11 harmonics.

A Spectrogram is a graphic representation displaying variations of spectra over time . Thus, spectrograms contain 3D information, where time is represented in the X axis, frequency in the Y axis, and intensity by the colour. Spectrograms show timbral characteristics in sound sequences or sustained sounds. They allow analysis of  combined effects of glottal behaviour and vocal tract movements, such as vowels and consonants during speech and singing (Lã, 2012; Stemple et al, 2020).

Spectrograms can be set to present data either according to concentration of energy in wide bands – wide band spectrograms – or in narrow frequencies – narrow band spectrograms (see Figure 5). The former is more appropriate for showing how formant frequencies vary in time. i.e., resonance strategies, whereas the latter offers a more detailed display of how individual harmonic partials vary with time (Lã, 2012).

With wide band spectrograms it is possible to recognize vowels, consonants, and vocal tract resonance or formant frequencies in terms of worm-like patterns, reflecting articulatory gestures. Such information is commonly used in phonetic studies, aiming at identifying outcomes of vocal tract configurations. It is also applied in forensics, when experts need to differentiate individuals’ idiomatic speech patterns or voice characteristics. In addition, it can be applied by teachers of singing (Koenig, 1986; Lã, 2012; Sundberg, 1987; Welch et al., 2005). Narrow band spectrograms display how partials vary through time. It is possible to visualize the fundamental frequency (fo) contour in a sentence or in a sustained tone for analysing prosody, vibrato, pitch matching, register breaks and many other voice source related properties (Lã, 2012; Roubeau et al., 2009). Besides the vibration of the vocal folds, spectrographic evaluation assess underlying interactions of glottal vibrations and air stream during voice onsets (Lã, 2012).

Figure 5. Wide (left) and narrow (right) band spectrograms, displaying the vowel /a/ sung in D4 with modal register (adapted from Lã, 2012: 101).

Long-Term Average Spectrum (LTAS) is a commonly used tool for displaying important physical properties of sounds. They show the average amplitude in different frequency bands along the vertical axis and frequencies of the bands along the horizontal one (see Figure 6). Peaks in an LTAS represent averages of formant frequencies. Commonly used measures emerging from LTAS analysis are spectral balance, i.e., the amplitude difference between a low and a high frequency part of the LTAS curve. Thus, the alpha ratio specifies the relationship between frequency above and below 1000Hz in an LTAS, a measure that often correlates with vocal loudness but also with timbral brilliance.

Figure 6. Long-term-average spectrum (LTAS) of a soprano singing “O mio babbino caro” from the opera Gianni Schicchi by Giacomo Puccini. SPL stands for sound pressure level.

Besides qualitative analysis of voice timbre, acoustic analysis provide quantitative information on the regularity of the vocal fold vibrations in terms of voice perturbation measures. Traditionally, voice regularity has been assessed monoparametrically in terms of frequency and amplitude perturbations and noise levels. Jitter measures frequency variation of fo. It can be extracted either from cycle-to-cycle variation (absolute and relative jitter) or through interpolation (e.g., average of each 3 cycles for Relative Average PerturbationRAP). Shimmer reflects the amplitude variation. Like jitter, shimmer can be assessed cycle-to-cycle (absolute and relative shimmer) or by interpolating cycles (e.g., average of 11 cycles for the Amplitude Perturbation Quotient – APQ). The rational for interpolating cycles is to  reduce the sensibility by smoothing effects of short-term variation, caused e.g., by fo or articulatory changes (Hillenbrand, 2011; Stemple et al., 2020).

Human voice also generates non-periodic signals, or noise (see Figure 7). Measuring noise levels in voice offers information on the degree of periodicity of the glottal cycles and incomplete glottal closure of the vocal fold vibration. For example, Harmonic-to-Noise ratio (HNR), Noise-to-Harmonic ratio (NHR) and Glottal-to-Noise Excitation (GNE) calculate noise and harmonic components, presenting measurable data that are related to the perception of breathiness and roughness (Behlau et al., 2022; Hillenbrand, 2011).

Figure 7. Example of spectra of a non-breathy (left) and breathy (right) voice.

Some measures, such as jitter and shimmer, are difficult to relate to perceptual characteristics of voice, and there is no consensus on its clinical relevance (Behlau et al., 2020; Hillenbrand, 2011).

Another visual representation of sound is the Cepstrum, which is the spectrum of the spectrum; series of adjacent spectra are arranged in a time sequence which is regarded as an audio signal, thus converting the spectrum to a time-domain signal(Behlau et al., 2022; Stemple et al., 2020). It reflects both periodicity and spectral properties of the signal (i.e., the harmonic components).  a commonly used measure from cepstrum analysis is Smooth Cepstral Peak Prominence (CPPS), proven to be clinically related to voice disorders (Englert et al., 2020).

Currently, multiparametrical analysis are being adopted in order to improve the accuracy of acoustic evaluations and their clinical relevance. One example is the Acoustic Vocal Quality Index (AVQI), which summarizes a number of measures: CPPS, harmonic to noise ratio (HNR), relative and absolute shimmer, overall spectrum slope, all in one single final score (Behlau et al, 2022; Maryn et al., 2010). Another new form of acoustic analysis is the application of non-linear dynamic analyses. Such analyses have been reported as suitable for any kind of voice, even completely aperiodic ones. Examples of non-linear analysis are the Hilbert-Huang Transform and Wavelet Analysis (Lopes et al., 2019).

Whatever type of acoustical analysis is applied, careful attention must be paid to the recording procedures. Several factors can impact on the results of acoustic measurements, such as microphone type and placement (Titze & Winholtz, 1993), vowels (Kiliç et al., 2004), vocal intensity (de Oliveira Florencio et al, 2021) and sample length (Englert et al., 2020). Specific recommendations on how to perform acoustical analysis are described elsewhere (Patel et al., 2018). Detailed guidelines on microphone selection and voice intensity measures from acoustical signals can be found in two articles by Švec and Granqvist (2010; 2018). To carefully choose the microphone is extremely important. Microphones for stage/ studio often have a peak around 5 to 10 kHz and therefore are not suitable for acoustical measurements. Dynamic/directional microphones have a distance dependent sensitive curve. Thus, omnidirectional microphones with a flat response over the entire audible frequency range are recommendable if the aim is to perform acoustic analysis in both clinical and educational settings.

Further readings:

Behlau, M., Madazio, G., Vaiano, T., Pacheco, C., & Badaró, F. (2022). Voice evaluation–contribution of the speech-language pathologist voice specialist–SLP-V: Part B. Acoustic analysis, physical examination and correlation of all steps with the medical diagnoses. Hearing, Balance and Communication, 1-7.

de Oliveira Florencio, V., Almeida, A. A., Balata, P., Nascimento, S., Brockmann-Bauser, M., & Lopes, L. W. (2021). Differences and Reliability of Linear and Nonlinear Acoustic Measures as a Function of Vocal Intensity in Individuals With Voice Disorders. Journal of Voice. In Press.

Englert, M., Lima, L., Latoszek, B. B. V., & Behlau, M. (2020). Influence of the voice sample length in perceptual and acoustic voice quality analysis. Journal of Voice. In Press.

Herbst, C. T. (2017). A review of singing voice subsystem interactions—toward an extended physiological model of “support”. Journal of voice, 31(2), 249-e13.

Hillenbrand, J. M. (2011). Acoustic analysis of voice: a tutorial. Perspectives on Speech Science and Orofacial Disorders, 21(2), 31-43.

Howard, D.M. & Murphy, D.T. (2008). Voice Science Acoustics and Recording. Plural Publishing.

Kiliç, M. A., Öğüt, F., Dursun, G., Okur, E., Yildirim, I., & Midilli, R. (2004). The effects of vowels on voice perturbation measures. Journal of Voice, 18(3), 318-324.

Koenig, B. E. (1986). Spectrographic voice identification: a forensic survey. The Journal of the Acoustical Society of America, 79(6), 2088-2090.

Lã, F. M. (2012). Teaching singing and technology. Aspects of singing II-unit in understanding-Diversity in aesthetics, 88-109.

Lopes, L. Dajer, E. & Camargo, Z. Análise acústica na Clínica Vocal in Lopes, (2019). L., Moreti, F., Ribeiro, L. L., & Pereira, E. C. Fundamentos e atualidades em voz clínica. Thieme Revinter. 31-47.

Maryn, Y., De Bodt, M., & Roy, N. (2010). The Acoustic Voice Quality Index: toward improved treatment outcomes assessment in voice disorders. Journal of communication disorders, 43(3), 161-174.

McCoy, S. (2004). Your voice: An Inside View: multimedia voice science and pedagogy. Inside View Press.

Miller, D. G. (2008). Resonance in singing: Voice building through acoustic feedback. Inside view press.

Nair, G. (1999). Voice – Tradition and Technology: a state-of-the-art studio. Singual Publishing Group.

Patel, R. R., Awan, S. N., Barkmeier-Kraemer, J., Courey, M., Deliyski, D., Eadie, T., … & Hillman, R. (2018). Recommended protocols for instrumental assessment of voice: American Speech-Language-Hearing Association expert panel to develop a protocol for instrumental assessment of vocal function. American journal of speech-language pathology, 27(3), 887-905.

Roubeau, B., Henrich, N. & Castellengo, M. (2009). Laryngeal vibratory mechanisms: the notion of vocal register revisited. Journal of Voice, 23(4), 425-438.

Stemple, J. C., Roy, N. & Klaben, B. K. (2020). Clinical voice pathology: Theory and management. Plural Publishing.

Švec, J. G. & Granqvist, S. (2010). Guidelines for selecting microphones for human voice production research. American Journal of Speech-Language Pathology, 19, 356-368.

Švec, J. G. & Granqvist, S. (2018). Tutorial and Guidelines on Measurement of Sound Pressure Level in Voice and Speech. Journal of Speech, Language and Hearing Research, 61, 441-461.

Sundberg, J. (1987). The science of the singing voice. Northern Illinois University Press.

Titze, I. R. & Verdolini-Abbot, K. (2012). Vocology: The science and practice of voice habilitation. The National Center for Voice and Speech.

Titze, I.R. & Winholtz, W. S. (1993). Effect of microphone type and placement on voice perturbation measurements. Journal of Speech, Language, and Hearing Research, 36(6), 1177-1190.

Vennard, W. (1967). Singing: the mechanism and the technic. Carl Fischer.

Welch, G. F., Howard, D. M., Himonides, E., & Brereton, J. (2005). Real-time feedback in the singing studio: an innovatory action-research project using new voice technology. Music Education Research, 7(2), 225-249.

Welch, G.F. & Sundberg, J. (2002). Solo Voice. In R. Parncutt & G.E. McPherson (eds). The Science and Psychology of Music Performance: creative strategies for teaching and learning. Oxford University Press.