Institute for Telecommunication Sciences / Research Topics / Audio Quality Research / Publications &Talks

Publications & Talks

Audio Quality Research Program publications and talks are listed here in reverse chronological order. Abstracts and slideshows are available in HTML format and documents are available in the Adobe Acrobat portable document format.

We demonstrate that the optimal audio signal processing frame duration in oracle binary masking and oracle magnitude restoration is determined by joint minimization of two antagonistic artifacts: temporal blurring (which increases with frame duration...

Objective speech quality and intelligibility estimators do not correctly assess speech generated by deep neural networks (DNNs). We use 256 speech files and subjective scores that cover 14 DNN speech conditions and 18 nonDNN speech conditions to show...

We present a set of relatively small-scale proof-of-concept experiments where we construct no-reference (NR) speech quality estimators that give reliable values of system-under-test (SUT) input speech quality in spite of the fact that NR estimators c...

Building on prior work we have developed a no-reference (NR) waveform-based convolutional neural network (CNN) architecture that can accurately estimate speech quality or intelligibility of narrowband and wideband speech segments. These Wideband Audi...

Andrew A. Catellier; Stephen D. Voran, "WEnets: A Convolutional Framework for Evaluating Audio Waveforms", September 2019

In this white paper, we describe a new convolutional framework for waveform evaluation, WEnets, and build a Narrowband Audio Waveform Evaluation Network, or NAWEnet, using this framework. NAWEnet is single-ended (or no-reference) and was trained thre...

Stephen D. Voran; Andrew A. Catellier, "Intelligibility Robustness of Five Speech Codec Modes in Frame-Erasure and Background-Noise Environments", NTIA Technical Report TR-18-529, December 2017

Frame erasures and background noise are two factors that can interact with speech coding to reduce speech intelligibility and thus impair public safety mission-critical voice communications. We conducted two tests of intelligibility in the face of th...

Separating an acoustic signal into desired and undesired components is an important and well-established problem. It is commonly addressed by decomposing spectral magnitudes after exponentiation and the choice of exponent has been studied from numero...

We present ABC-MRT16—a new algorithm for objective estimation of speech intelligibility following the Modified Rhyme Test (MRT) paradigm. ABC-MRT16 is simple, effective and robust. When compared to subjective MRT data from 367 diverse conditions that...

Stephen D. Voran; Andrew A. Catellier, "A Crowdsourced Speech Intelligibility Test that Agrees with, Has Higher Repeatability than, Lab Tests", NTIA Technical Memo TM-17-523, February 2017

Crowdsourcing of subjective speech, audio, and video quality of experience (QoE) tests has received much interest and study, but crowdsourcing of speech intelligibility testing has not. We hypothesize that speech intelligibility tests offer a unique ...

Andrew A. Catellier; Stephen D. Voran, "Intelligibility of Selected Speech Codecs in Frame-Erasure Conditions", NTIA Technical Report TR-17-522, November 2016

We describe the design, implementation, and analysis of a speech intelligibility test. The test included five codec modes, four frame-erasure rates, and two background noise environments, for a total of 40 conditions. The test protocol required twent...

Stephen D. Voran, "Exploration of the Additivity Approximation for Spectral Magnitudes", Conference Paper , October 2015

The separation of acoustic signals is often accomplished through subtractive decompositions of frequency-domain representations. This is typically enabled by the zero phase approximation or the uncorrelated signals approximation but both of these are...

Stephen D. Voran; Andrew A. Catellier, "Speech Codec Intelligibility Testing in Support of Mission-Critical Voice Applications for LTE", NTIA Technical Report TR-15-520, September 2015

We describe a major effort to quantify the speech intelligibility associated with a range of narrowband, wideband, and fullband digital audio coding algorithms in various acoustic noise environments. The work emphasizes the relationship between these...

We present an objective estimator of speech intelligibility that follows the paradigm of the Modified Rhyme Test (MRT). For each input, the estimator uses temporal correlations within articulation index bands to select one of six possible words from ...

Stephen D. Voran, "Lossless Compression of G.711 Speech Using Only Look-Up Tables", Conference Paper , May 2013

The lossless compression algorithm specified in ITU-T Recommendation G.711.0 provides bit-exact G.711 speech coding at reduced bit-rates. We introduce two Look-Up Coders (LUCs) that also offer bit-exact G.711 speech coding at reduced rates but the LU...

Stephen D. Voran; Andrew A. Catellier, "When Should a Speech Coding Quality Increase be Allowed Within a Talk-Spurt?", Conference Paper , May 2013

The value or harm associated with an increase in speech coding quality depends on the type of the increase as well as the temporal location of the increase in an utterance. For example, some increases in speech coding bandwidth can be perceived as im...

David J. Atkinson; Andrew A. Catellier, "Intelligibility of Analog FM and Updated P25 Radio Systems in the Presence of Fireground Noise: Test Plan and Results", NTIA Technical Report TR-13-495, May 2013

This report describes a modified rhyme test (MRT) conducted to characterize the behavior of digital and analog communication in the presence of background noise and moderate RF channel degradation. This is done through the use of reference systems to...

In an extended P25/VoLTE public safety communication system voice signals will pass through both Multi-Band Excitation (MBE) and Adaptive Multi-Rate (AMR) speech coders. Thus it is important to quantify the speech quality that can be expected for MBE...

David J. Atkinson; Stephen D. Voran; Andrew A. Catellier, "Intelligibility of the Adaptive Multi-Rate Speech Coder in Emergency-Response Environments", NTIA Technical Report TR-13-493, December 2012

This report describes speech intelligibility testing conducted on the Adaptive Multi-Rate (AMR) speech coder in several different environments simulating emergency response conditions and especially fireground conditions. The intelligibility testing ...

Stephen D. Voran; Andrew A. Catellier, "Gradient Ascent Subjective Multimedia Quality Testing", Journal Article , March 2011

Subjective testing is the most direct means of assessing multimedia quality as experienced by users. When multiple dimensions must be evaluated, these tests can become slow and costly.We present gradient ascent subjective testing (GAST) as an efficie...

Stephen D. Voran; Andrew A. Catellier, "Multiple Description Speech Coding Using Speech Polarity Decomposition", Conference Paper , December 2010

We present and evaluate a new multiple–description coding extension to the international standard for pulse code modulation speech coding (ITU–T Rec. G.711). This extension is inserted between the G.711 encoder and decoder. It uses speech–polarity de...

In advanced heterogeneous telecommunication networks, network resources can dynamically dictate the type of speech coding that is used. An increase in resources allows for lower coding distortion or it might also be used to provide wideband speech in...

Andrew A. Catellier; Stephen D. Voran, "Low Rate Speech Coding and Random Bit Errors: A Subjective Speech Quality Matching Experiment", NTIA Technical Report TR-10-462, October 2009

When bit errors are introduced between a speech encoder and a speech decoder, the quality of the received speech is reduced. The specific relationship between speech quality and bit error rate (BER) can be different for each speech coding and channel...

Stephen D. Voran; Andrew A. Catellier, "Gradient Ascent Paired-Comparison Subjective Quality Testing", Conference Paper , July 2009

(This paper won the QoMEX 2009 Best Paper Award.) Subjective testing is the most direct means of assessing audio, video, and multimedia quality as experienced by users and maximizing the information gathered while minimizing the number of trials ...

Andrew A. Catellier; Stephen D. Voran, "Relationships Between Intelligibility, Speaker Identification, and the Detection of Dramatized Urgency", NTIA Technical Report TR-09-459, November 2008

The systems used for public safety speech communications must be intelligible. It is also desirable that they transmit secondary information, such as the attributes of a speaker's voice. This secondary information can allow a user to identify the spe...

David J. Atkinson; Andrew A. Catellier, "Intelligibility of Selected Radio Systems in the Presence of Fireground Noise: Test Plan and Results", NTIA Technical Report TR-08-453, June 2008

This report describes an experiment conducted to measure the intelligibility of selected radio communication systems when those systems are employed in high-background-noise environments experienced by firefighters. The test plan for a Modified Rhyme...

Andrew A. Catellier; Stephen D. Voran, "Speaker Identification in Low-Rate Coded Speech", Conference Paper , May 2008

While useful speech communication systems must be intelligible, most systems aim to transmit secondary information, such as attributes of a speaker's voice, as well. This secondary information can allow a listener to identify the speaker and his emot...

S. Voran, "Listener Detection of Talker Stress in Low-rate Coded Speech", Conference Paper , March 2008

We describe an experiment where listeners were asked to detect two specific forms of stress in talkers' recorded voices heard via six different simulated communication systems. Both task–induced stress and dramatized urgency were used. Communication ...

S. Voran, "Lossless Audio Coding with Bandwidth Extension Layers", Conference Paper , October 2007

Layered audio coding typically offers reduced distortion as bit rate is increased, but that distortion is spread across the entire band until the lossless coding bit rate is reached and distortion is eliminated. We propose a layered audio coding para...

S. Voran, "Reducing Quantization Error by Matching Pseudoerror Statistics", Conference Paper , September 2006

We investigate the use of an adaptive processor (a quantizer pseudoinverse) and the statistics of the associated pseudoerror signal to reduce quantization error in scalar quantizers when a small amount of prior knowledge about the signal x is availab...

We have designed, conducted, and analyzed a subjective speech quality experiment with unrestricted timing where subjects can vote whenever their opinions are fully formed, rather than at fixed time intervals. Analysis of the resulting listening times...

S. Voran, "A Basic Experiment on Time-Varying Speech Quality", Conference Paper , June 2005

We present a general formulation of a basic open question regarding the perception of time-varying speech quality. We then describe the design, implementation, conduct, and analysis of a practical experiment that addresses a small but fundamental par...

We describe new 2-channel multiple-description speech coders based on the ITU-T Recommendation G.711 PCM speech coder. The new coders operate in the PCM code domain in order to exploit the companding gain of PCM. They apply pairs of complementary asy...

We describe a 2-channel multiple-description speech coder based on the ITU-T Recommendation G.711 PCM speech coder. The new coder operates in the PCM code domain in order to exploit the companding gain of PCM. It applies a pair of 2-dimensional struc...

When objectively estimating speech, audio, or video quality, it is often necessary to compensate for a system gain or to "gain match" two or more signals. One can take three views of a system, leading to three different definitions of gain, and three...

In packetized speech transmission, end–to–end delay can vary, even over short timescales. Estimating the resulting speech delay histories is critical to diagnostic and quality estimation efforts. We present a new bottom–up algorithm for estimating ti...

Temporal discontinuities in received speech are a reality of Internet Telephony or Voice over Internet Protocol (VoIP) systems. These relatively new impairments pose unique challenges to objective estimators of perceived speech quality. We suggest th...

S. Voran, "The Channel-Optimized Multiple-Description Scalar Quantizer", Conference Paper , October 2002

Multiple–description coding is one way to gain robustness against lossy channels. We extend the multiple–description scalar quantizer (MDSQ) to a channel–optimized MDSQ (COMDSQ) that minimizes mean–squared error for a given channel environment. We di...

Stephen D. Voran, "An iterated nested least-squares algorithm for fitting multiple data sets", NTIA Technical Memo TM-03-397, October 2002

A multiple data set fitting problem often arises in conjunction with the development of objective estimators of perceived audio or video quality. In such development work, we often seek the best linear relationship between a set of objective audio or...

It is often desirable to compensate for system gain, especially before objectively estimating perceived audio or video quality from system inputs and outputs. A common approach is to scale the system output to compensate for system gain. One can take...

Stephen D. Voran, "Estimation of system gain and bias using noisy observations with known noise power ratio", NTIA Technical Report TR-02-395, September 2002

The identification of linear systems from input and output observations is an important and well-studied topic. When both the input and output observations are noisy, the resulting problem is sometimes called the "errors in variables" problem. Existi...

This paper identifies optimum levels of reverse water-filling for codebook-based coding of noise and speech signals. We find that there is little to be gained from optimizing an effective rate parameter. We identify trade-offs between SNR and log-spe...

Stephen D. Voran; Stephen Wolf, "Objective Estimation of Video and Speech Quality to Support Network QoS Efforts", Conference Paper , February 2000

One of the questions that ongoing QoS efforts seek to answer is: "Given fixed network resources, how does one provide the highest possible quality of service to the maximal number of users in a fair way, even when those users are generating competing...

Perceived speech quality is most directly measured by subjective listening tests. These tests are often slow and expensive, and numerous attempts have been made to supplement them with objective estimators of perceived speech quality. These attempts ...

Part 1 of this paper describes a new approach to the objective estimation of perceived speech quality. This new approach uses a simple but effective perceptual transformation and a distance measure that consists of a hierarchy of measuring normalizin...

S. Voran, "Advances in Objective Estimation of Perceived Speech Quality", Conference Paper , June 1999

We present two techniques that can be used to enhance objective estimators of perceived speech quality. Frame normalization and frame-energy plane partitioning are described and applied to a log-spectral-error-based estimator. The resulting estimator...

S. Voran, "Observations on Frequency-Domain Companding for Audio Coding", Conference Paper , August 1998

Frequency-domain companding can be used in conjunction with audio coders that produce white coding noise. In [1-2] it is demonstrated empirically that this technique colors white coding noise so that it is better masked by audio signals, resulting in...

ITU-T Recommendation P.861 describes an objective speech quality assessment algorithm for speech codecs. This algorithm transforms codec input and output speech signals into a perceptual domain, compares them, and generates a noise disturbance value,...

Stephen D. Voran, "Objective Estimation of Perceived Speech Quality Using Measuring Normalizing Blocks", NTIA Technical Report TR-98-347, April 1998

Perceived speech quality is most directly measured by subjective listening tests. These tests are often slow and expensive, and numerous attempts have been made to supplement them with objective estimators of perceived speech quality. These attempts ...

S. Voran, "Perception-Based Bit-Allocation Algorithms for Audio Coding", Conference Paper , October 1997

We describe six algorithms for bit allocation in audio coding. Each algorithm stems from the minimization of a different perceptually–motivated objective function. Three of these objective functions are extensions of existing ones, and three are new....

David J. Atkinson; Stephen D. Voran, "Summary of Objective Audio Quality Measure Performance Data Presented to T1A1", Technical Contribution , October 1997

This contribution aggregates the available performance data on the MNB and P.861 objective speech quality measures. Specifically, results presented in contributions T1A1.7/97-032 and T1A1.7/97-034 are examined. Based on examination of the aggregated ...

We describe a new approach to the estimation of perceived speech quality. The approach uses a simple, but effective, perceptual transformation to emulate hearing and a hierarchy of Measuring Normalizing Blocks (MNB's) to emulate auditory judgment. Th...

S. Voran, "Listener Ratings of Speech Passbands", Conference Paper , September 1997

We describe a listening experiment that measures the perceived speech quality of 19 speech passbands using 8 talkers and 28 listeners. Results are referenced to the traditional wide-band and narrow-band telephony passbands. Our findings may help thos...

S. Voran, "An Algorithm for Estimating the Delay of Telephony Speech", Technical Contribution , September 1996

This contribution is provided for informational purposes. It contains a description of an algorithm that has sucessfully been used to estimate the delay of telephony band speech. The algorithm features a coarse stage that uses speech envelopes and a ...

S. Voran, "Observations on Auditory Excitation and Masking Patterns", Conference Paper , October 1995

Excitation patterns and masking patterns are used extensively in perceptual audio coders and quality assessment algorithms. Numerous algorithms for calculating these patterns have been proposed. This paper provides comparisons among the patterns gene...

S. Voran; C. Sholl, "Perception-based Objective Estimators of Speech Quality", Conference Paper , September 1995

Four proposed perception-based techniques for objectively estimating speech quality and three traditional estimators are applied to coded speech samples. Agreement between objective estimates and corresponding subjective test scores is reported. Seve...

Objective (or instrumental) tests of speech quality have been proposed as ways to reduce the need for expensive and time-consuming subjective (or auditory) tests. Both types of tests attempt to quantify the range of opinions that listeners express in...

Stephen Voran; Stephen Wolf, "Proposed Framework for Subjective Audiovisual Testing", Technical Contribution , November 1993

Working Group T1A1.5 is supporting ITU-T Study Group 12 in developing subjective audiovisual testing methods under Question 22/12 which addresses audiovisual quality in multimedia services. A previous contribution from Bellcore, T1A1.5/93-104, descri...

S. Voran, "Observations on the T-Reference Condition for Speech Coder Evaluation", Technical Contribution , February 1992

In a Study Group XII Contribution dated September 1991, John Rosenberger and Bill Cotton of Bellcore introduced an algorithm for generating temporally correlated distortion on 8 KHz sampled speech data. This distortion is parameterized by a single in...