Institute for Telecommunication Sciences / Research / Quality of Experience / Audio Quality Research / Publications

Publications

Audio Quality Research Program publications are listed here in reverse chronological order.

We develop two complementary advances for training no-reference (NR) speech quality estimators with independent datasets. Multi-dataset finetuning (MDF) pretrains an NR estimator on a single dataset and then finetunes it on multiple datasets at once,...

Jaden Pieper, Stephen D. Voran, and Kenneth R. Tilley, “Improving Speech Audio for Prerecorded and Live Online Conference Sessions,” Special Publication NTIA SP 24-572, July 2024

Learn how to identify impaired speech and avoid its causes. Listen to audio clips that demonstrate six useful tips for online conference and meeting organizers, as well as contributors who plan to speak during a live online event or submit a pre-reco...

The short-time Fourier transform (STFT) represents a window of audio samples as a set of complex coefficients. These are advantageously viewed as magnitudes and phases and the overall distribution of phases is very often assumed to be uniform. We sho...

Recently, prerecorded audio and video presentations, as well as virtual meetings, have become a common component of professional life, due to health and environmental considerations. This places new responsibility on participants to generate audio th...

Andrew A. Catellier and Stephen D. Voran, “Wideband Audio Waveform Evaluation Networks: Efficient, Accurate Estimation of Speech Qualities,” Journal Article, November 2023

Speech quality and speech intelligibility can vary dramatically across the wide range of currently available telecommunications systems, devices, and operating environments. This creates a strong demand for efficient real-time measurements of quality...

Jaden Pieper and Stephen D. Voran, “Relationships between Gilbert-Elliot Burst Error Model Parameters and Error Statistics,” Technical Memorandum TM-23-565, January 2023

The Gilbert-Elliot model is a popular and effective tool for treating bursty (nonindependent) errors in communication links. This memorandum provides linkages between model parameters and error statistics. The motivation is that these linkages can al...

Jaden Pieper and Stephen D. Voran, “Gilbert-Elliot Model Software Tools,” Software, December 2022

The Gilbert-Elliot burst error model is a popular and effective tool for treating bursty (non-independent) errors in communication links. This software accompanies the following publication: Pieper J; Voran S, "Relationships between Gilbert-Elliot Bu...

We demonstrate that the optimal audio signal processing frame duration in oracle binary masking and oracle magnitude restoration is determined by joint minimization of two antagonistic artifacts: temporal blurring (which increases with frame duration...

Objective speech quality and intelligibility estimators do not correctly assess speech generated by deep neural networks (DNNs). We use 256 speech files and subjective scores that cover 14 DNN speech conditions and 18 nonDNN speech conditions to show...

We present a set of relatively small-scale proof-of-concept experiments where we construct no-reference (NR) speech quality estimators that give reliable values of system-under-test (SUT) input speech quality in spite of the fact that NR estimators c...

Andrew A. Catellier and Stephen D. Voran, “WAWEnets Reference Implementations,” Software, May 2020

This GitHub repository presents MATLAB®/Octave and C++ implementations of Wideband Audio Waveform Evaluation networks or WAWEnets. This WAWEnets implementation produces one or more speech quality or intelligibility values for each input speech signal...

Building on prior work we have developed a no-reference (NR) waveform-based convolutional neural network (CNN) architecture that can accurately estimate speech quality or intelligibility of narrowband and wideband speech segments. These Wideband Audi...

Andrew A. Catellier and Stephen D. Voran, “WEnets: A Convolutional Framework for Evaluating Audio Waveforms,” September 2019

In this white paper, we describe a new convolutional framework for waveform evaluation, WEnets, and build a Narrowband Audio Waveform Evaluation Network, or NAWEnet, using this framework. NAWEnet is single-ended (or no-reference) and was trained thre...

Stephen D. Voran and Andrew A. Catellier, “Intelligibility Robustness of Five Speech Codec Modes in Frame-Erasure and Background-Noise Environments,” Technical Report TR-18-529, December 2017

Frame erasures and background noise are two factors that can interact with speech coding to reduce speech intelligibility and thus impair public safety mission-critical voice communications. We conducted two tests of intelligibility in the face of th...

Separating an acoustic signal into desired and undesired components is an important and well-established problem. It is commonly addressed by decomposing spectral magnitudes after exponentiation and the choice of exponent has been studied from numero...

We present ABC-MRT16—a new algorithm for objective estimation of speech intelligibility following the Modified Rhyme Test (MRT) paradigm. ABC-MRT16 is simple, effective and robust. When compared to subjective MRT data from 367 diverse conditions that...

Stephen D. Voran and Andrew A. Catellier, “A Crowdsourced Speech Intelligibility Test that Agrees with, Has Higher Repeatability than, Lab Tests,” Technical Memorandum TM-17-523, February 2017

Crowdsourcing of subjective speech, audio, and video quality of experience (QoE) tests has received much interest and study, but crowdsourcing of speech intelligibility testing has not. We hypothesize that speech intelligibility tests offer a unique ...

Andrew A. Catellier and Stephen D. Voran, “Intelligibility of Selected Speech Codecs in Frame-Erasure Conditions,” Technical Report TR-17-522, November 2016

We describe the design, implementation, and analysis of a speech intelligibility test. The test included five codec modes, four frame-erasure rates, and two background noise environments, for a total of 40 conditions. The test protocol required twent...

Stephen D. Voran, “Exploration of the Additivity Approximation for Spectral Magnitudes,” Conference Paper, October 2015

The separation of acoustic signals is often accomplished through subtractive decompositions of frequency-domain representations. This is typically enabled by the zero phase approximation or the uncorrelated signals approximation but both of these are...

Stephen D. Voran and Andrew A. Catellier, “Speech Codec Intelligibility Testing in Support of Mission-Critical Voice Applications for LTE,” Technical Report TR-15-520, September 2015

We describe a major effort to quantify the speech intelligibility associated with a range of narrowband, wideband, and fullband digital audio coding algorithms in various acoustic noise environments. The work emphasizes the relationship between these...

We present an objective estimator of speech intelligibility that follows the paradigm of the Modified Rhyme Test (MRT). For each input, the estimator uses temporal correlations within articulation index bands to select one of six possible words from ...

Stephen D. Voran, “Lossless Compression of G.711 Speech Using Only Look-Up Tables,” Conference Paper, May 2013

The lossless compression algorithm specified in ITU-T Recommendation G.711.0 provides bit-exact G.711 speech coding at reduced bit-rates. We introduce two Look-Up Coders (LUCs) that also offer bit-exact G.711 speech coding at reduced rates but the LU...

Stephen D. Voran and Andrew A. Catellier, “When Should a Speech Coding Quality Increase be Allowed Within a Talk-Spurt?,” Conference Paper, May 2013

The value or harm associated with an increase in speech coding quality depends on the type of the increase as well as the temporal location of the increase in an utterance. For example, some increases in speech coding bandwidth can be perceived as im...

This report describes a modified rhyme test (MRT) conducted to characterize the behavior of digital and analog communication in the presence of background noise and moderate RF channel degradation. This is done through the use of reference systems to...

In an extended P25/VoLTE public safety communication system voice signals will pass through both Multi-Band Excitation (MBE) and Adaptive Multi-Rate (AMR) speech coders. Thus it is important to quantify the speech quality that can be expected for MBE...

David J. Atkinson, Stephen D. Voran, and Andrew A. Catellier, “Intelligibility of the Adaptive Multi-Rate Speech Coder in Emergency-Response Environments,” Technical Report TR-13-493, December 2012

This report describes speech intelligibility testing conducted on the Adaptive Multi-Rate (AMR) speech coder in several different environments simulating emergency response conditions and especially fireground conditions. The intelligibility testing ...

Stephen D. Voran and Andrew A. Catellier, “Gradient Ascent Subjective Multimedia Quality Testing,” Journal Article, March 2011

Subjective testing is the most direct means of assessing multimedia quality as experienced by users. When multiple dimensions must be evaluated, these tests can become slow and costly. We present gradient ascent subjective testing (GAST) as an effic...

Stephen D. Voran and Andrew A. Catellier, “Multiple Description Speech Coding Using Speech Polarity Decomposition,” Conference Paper, December 2010

We present and evaluate a new multiple–description coding extension to the international standard for pulse code modulation speech coding (ITU–T Rec. G.711). This extension is inserted between the G.711 encoder and decoder. It uses speech–polarity de...

In advanced heterogeneous telecommunication networks, network resources can dynamically dictate the type of speech coding that is used. An increase in resources allows for lower coding distortion or it might also be used to provide wideband speech in...

Andrew A. Catellier and Stephen D. Voran, “Low Rate Speech Coding and Random Bit Errors: A Subjective Speech Quality Matching Experiment,” Technical Report TR-10-462, October 2009

When bit errors are introduced between a speech encoder and a speech decoder, the quality of the received speech is reduced. The specific relationship between speech quality and bit error rate (BER) can be different for each speech coding and channel...

Stephen D. Voran and Andrew A. Catellier, “Gradient Ascent Paired-Comparison Subjective Quality Testing,” Conference Paper, July 2009

(This paper won the QoMEX 2009 Best Paper Award.) Subjective testing is the most direct means of assessing audio, video, and multimedia quality as experienced by users and maximizing the information gathered while minimizing the number of trials ...

Andrew A. Catellier and Stephen D. Voran, “Relationships Between Intelligibility, Speaker Identification, and the Detection of Dramatized Urgency,” Technical Report NTIA TR-09-459, November 2008

The systems used for public safety speech communications must be intelligible. It is also desirable that they transmit secondary information, such as the attributes of a speaker's voice. This secondary information can allow a user to identify the spe...

David J. Atkinson and Andrew A. Catellier, “Intelligibility of Selected Radio Systems in the Presence of Fireground Noise: Test Plan and Results,” Technical Report TR-08-453, June 2008

This report describes an experiment conducted to measure the intelligibility of selected radio communication systems when those systems are employed in high-background-noise environments experienced by firefighters. The test plan for a Modified Rhyme...

Andrew A. Catellier and Stephen D. Voran, “Speaker Identification in Low-Rate Coded Speech,” Conference Paper, May 2008

While useful speech communication systems must be intelligible, most systems aim to transmit secondary information, such as attributes of a speaker's voice, as well. This secondary information can allow a listener to identify the speaker and his emot...

Stephen D. Voran, “Listener Detection of Talker Stress in Low-rate Coded Speech,” Conference Paper, March 2008

We describe an experiment where listeners were asked to detect two specific forms of stress in talkers' recorded voices heard via six different simulated communication systems. Both task–induced stress and dramatized urgency were used. Communication ...

Stephen D. Voran, “Lossless Audio Coding with Bandwidth Extension Layers,” Conference Paper, October 2007

Layered audio coding typically offers reduced distortion as bit rate is increased, but that distortion is spread across the entire band until the lossless coding bit rate is reached and distortion is eliminated. We propose a layered audio coding para...

Stephen D. Voran, “Reducing Quantization Error by Matching Pseudoerror Statistics,” Conference Paper, September 2006

We investigate the use of an adaptive processor (a quantizer pseudoinverse) and the statistics of the associated pseudoerror signal to reduce quantization error in scalar quantizers when a small amount of prior knowledge about the signal x is availab...

We have designed, conducted, and analyzed a subjective speech quality experiment with unrestricted timing where subjects can vote whenever their opinions are fully formed, rather than at fixed time intervals. Analysis of the resulting listening times...

Stephen D. Voran, “A Basic Experiment on Time-Varying Speech Quality,” Conference Paper, June 2005

We present a general formulation of a basic open question regarding the perception of time-varying speech quality. We then describe the design, implementation, conduct, and analysis of a practical experiment that addresses a small but fundamental par...

We describe new 2-channel multiple-description speech coders based on the ITU-T Recommendation G.711 PCM speech coder. The new coders operate in the PCM code domain in order to exploit the companding gain of PCM. They apply pairs of complementary asy...

We describe a 2-channel multiple-description speech coder based on the ITU-T Recommendation G.711 PCM speech coder. The new coder operates in the PCM code domain in order to exploit the companding gain of PCM. It applies a pair of 2-dimensional struc...

When objectively estimating speech, audio, or video quality, it is often necessary to compensate for a system gain or to "gain match" two or more signals. One can take three views of a system, leading to three different definitions of gain, and three...

In packetized speech transmission, end–to–end delay can vary, even over short timescales. Estimating the resulting speech delay histories is critical to diagnostic and quality estimation efforts. We present a new bottom–up algorithm for estimating ti...

Temporal discontinuities in received speech are a reality of Internet Telephony or Voice over Internet Protocol (VoIP) systems. These relatively new impairments pose unique challenges to objective estimators of perceived speech quality. We suggest th...

Stephen D. Voran, “The Channel-Optimized Multiple-Description Scalar Quantizer,” Conference Paper, October 2002

Multiple–description coding is one way to gain robustness against lossy channels. We extend the multiple–description scalar quantizer (MDSQ) to a channel–optimized MDSQ (COMDSQ) that minimizes mean–squared error for a given channel environment. We di...

Stephen D. Voran, “An Iterated Nested Least-Squares Algorithm for Fitting Multiple Data Sets,” Technical Memorandum NTIA TM-03-397, October 2002

A multiple data set fitting problem often arises in conjunction with the development of objective estimators of perceived audio or video quality. In such development work, we often seek the best linear relationship between a set of objective audio or...

Stephen D. Voran, “Compensating for System Gain: Motivations, Derivations, and Relations for Three Common Solutions,” Technical Memorandum NTIA TM-03-398, October 2002

It is often desirable to compensate for system gain, especially before objectively estimating perceived audio or video quality from system inputs and outputs. A common approach is to scale the system output to compensate for system gain. One can take...

Stephen D. Voran, “Estimation of system gain and bias using noisy observations with known noise power ratio,” Technical Report NTIA TR-02-395, September 2002

The identification of linear systems from input and output observations is an important and well-studied topic. When both the input and output observations are noisy, the resulting problem is sometimes called the "errors in variables" problem. Existi...

This paper identifies optimum levels of reverse water-filling for codebook-based coding of noise and speech signals. We find that there is little to be gained from optimizing an effective rate parameter. We identify trade-offs between SNR and log-spe...

Stephen D. Voran and Stephen Wolf, “Objective Estimation of Video and Speech Quality to Support Network QoS Efforts,” Conference Paper, February 2000

One of the questions that ongoing QoS efforts seek to answer is: "Given fixed network resources, how does one provide the highest possible quality of service to the maximal number of users in a fair way, even when those users are generating competing...

Perceived speech quality is most directly measured by subjective listening tests. These tests are often slow and expensive, and numerous attempts have been made to supplement them with objective estimators of perceived speech quality. These attempts ...

Part 1 of this paper describes a new approach to the objective estimation of perceived speech quality. This new approach uses a simple but effective perceptual transformation and a distance measure that consists of a hierarchy of measuring normalizin...

Stephen D. Voran, “Advances in Objective Estimation of Perceived Speech Quality,” Conference Paper, June 1999

We present two techniques that can be used to enhance objective estimators of perceived speech quality. Frame normalization and frame-energy plane partitioning are described and applied to a log-spectral-error-based estimator. The resulting estimator...

Stephen D. Voran, “Observations on Frequency-Domain Companding for Audio Coding,” Conference Paper, August 1998

Frequency-domain companding can be used in conjunction with audio coders that produce white coding noise. In [1-2] it is demonstrated empirically that this technique colors white coding noise so that it is better masked by audio signals, resulting in...

ITU-T Recommendation P.861 describes an objective speech quality assessment algorithm for speech codecs. This algorithm transforms codec input and output speech signals into a perceptual domain, compares them, and generates a noise disturbance value,...

Perceived speech quality is most directly measured by subjective listening tests. These tests are often slow and expensive, and numerous attempts have been made to supplement them with objective estimators of perceived speech quality. These attempts ...

Stephen D. Voran, “Perception-Based Bit-Allocation Algorithms for Audio Coding,” Conference Paper, October 1997

We describe six algorithms for bit allocation in audio coding. Each algorithm stems from the minimization of a different perceptually–motivated objective function. Three of these objective functions are extensions of existing ones, and three are new....

David J. Atkinson and Stephen D. Voran, “Summary of Objective Audio Quality Measure Performance Data Presented to T1A1,” Technical Contribution, October 1997

This contribution aggregates the available performance data on the MNB and P.861 objective speech quality measures. Specifically, results presented in contributions T1A1.7/97-032 and T1A1.7/97-034 are examined. Based on examination of the aggregated ...

We describe a new approach to the estimation of perceived speech quality. The approach uses a simple, but effective, perceptual transformation to emulate hearing and a hierarchy of Measuring Normalizing Blocks (MNB's) to emulate auditory judgment. Th...

Stephen D. Voran, “Listener Ratings of Speech Passbands,” Conference Paper, September 1997

We describe a listening experiment that measures the perceived speech quality of 19 speech passbands using 8 talkers and 28 listeners. Results are referenced to the traditional wide-band and narrow-band telephony passbands. Our findings may help thos...

Stephen D. Voran, “An Algorithm for Estimating the Delay of Telephony Speech,” Technical Contribution, September 1996

This contribution is provided for informational purposes. It contains a description of an algorithm that has successfully been used to estimate the delay of telephony band speech. The algorithm features a coarse stage that uses speech envelopes and a...

Stephen D. Voran, “Observations on Auditory Excitation and Masking Patterns,” Conference Paper, October 1995

Excitation patterns and masking patterns are used extensively in perceptual audio coders and quality assessment algorithms. Numerous algorithms for calculating these patterns have been proposed. This paper provides comparisons among the patterns gene...

Stephen D. Voran and C. Sholl, “Perception-based Objective Estimators of Speech Quality,” Conference Paper, September 1995

Four proposed perception-based techniques for objectively estimating speech quality and three traditional estimators are applied to coded speech samples. Agreement between objective estimates and corresponding subjective test scores is reported. Seve...

Stephen D. Voran, “Techniques for Comparing Objective and Subjective Speech Quality Tests,” Conference Paper, November 1994

Objective (or instrumental) tests of speech quality have been proposed as ways to reduce the need for expensive and time-consuming subjective (or auditory) tests. Both types of tests attempt to quantify the range of opinions that listeners express in...

Stephen D. Voran and Stephen Wolf, “Proposed Framework for Subjective Audiovisual Testing,” Technical Contribution, November 1993

Working Group T1A1.5 is supporting ITU-T Study Group 12 in developing subjective audiovisual testing methods under Question 22/12 which addresses audiovisual quality in multimedia services. A previous contribution from Bellcore, T1A1.5/93-104, descri...

Stephen D. Voran, “Observations on the T-Reference Condition for Speech Coder Evaluation: SQ 13.92,” Technical Contribution, February 1992

In a Study Group XII (Experts Group on Speech Quality) Contribution dated September 1991, John Rosenberger and Bill Cotton of Bellcore introduced an algorithm for generating temporally correlated distortion on 8 KHz sampled speech data. This distorti...