Institute for Telecommunication Sciences / Research / Quality of Experience / Audio Quality Research / Publications
Publications
Audio Quality Research Program publications are listed here in reverse chronological order.
We develop the wideband fixed-size modulation spectra (FMS) and show that they contain the necessary information to perform perceptually consistent evaluation of speech. We compare FMS with the already established frame-based modula-tion spectra as r...
We develop two complementary advances for training no-reference (NR) speech quality estimators with independent datasets. Multi-dataset finetuning (MDF) pretrains an NR estimator on a single dataset and then finetunes it on multiple datasets at once,...
Learn how to identify impaired speech and avoid its causes. Listen to audio clips that demonstrate six useful tips for online conference and meeting organizers, as well as contributors who plan to speak during a live online event or submit a pre-reco...
The short-time Fourier transform (STFT) represents a window of audio samples as a set of complex coefficients. These are advantageously viewed as magnitudes and phases and the overall distribution of phases is very often assumed to be uniform. We sho...
Recently, prerecorded audio and video presentations, as well as virtual meetings, have become a common component of professional life, due to health and environmental considerations. This places new responsibility on participants to generate audio th...
Speech quality and speech intelligibility can vary dramatically across the wide range of currently available telecommunications systems, devices, and operating environments. This creates a strong demand for efficient real-time measurements of quality...
The Gilbert-Elliot model is a popular and effective tool for treating bursty (nonindependent) errors in communication links. This memorandum provides linkages between model parameters and error statistics. The motivation is that these linkages can al...
The Gilbert-Elliot burst error model is a popular and effective tool for treating bursty (non-independent) errors in communication links. This software accompanies the following publication: Pieper J; Voran S, "Relationships between Gilbert-Elliot Bu...
We demonstrate that the optimal audio signal processing frame duration in oracle binary masking and oracle magnitude restoration is determined by joint minimization of two antagonistic artifacts: temporal blurring (which increases with frame duration...
Objective speech quality and intelligibility estimators do not correctly assess speech generated by deep neural networks (DNNs). We use 256 speech files and subjective scores that cover 14 DNN speech conditions and 18 nonDNN speech conditions to show...
We present a set of relatively small-scale proof-of-concept experiments where we construct no-reference (NR) speech quality estimators that give reliable values of system-under-test (SUT) input speech quality in spite of the fact that NR estimators c...
This GitHub repository presents MATLAB®/Octave and C++ implementations of Wideband Audio Waveform Evaluation networks or WAWEnets. This WAWEnets implementation produces one or more speech quality or intelligibility values for each input speech signal...
Building on prior work we have developed a no-reference (NR) waveform-based convolutional neural network (CNN) architecture that can accurately estimate speech quality or intelligibility of narrowband and wideband speech segments. These Wideband Audi...
In this white paper, we describe a new convolutional framework for waveform evaluation, WEnets, and build a Narrowband Audio Waveform Evaluation Network, or NAWEnet, using this framework. NAWEnet is single-ended (or no-reference) and was trained thre...
Frame erasures and background noise are two factors that can interact with speech coding to reduce speech intelligibility and thus impair public safety mission-critical voice communications. We conducted two tests of intelligibility in the face of th...
Separating an acoustic signal into desired and undesired components is an important and well-established problem. It is commonly addressed by decomposing spectral magnitudes after exponentiation and the choice of exponent has been studied from numero...
We present ABC-MRT16—a new algorithm for objective estimation of speech intelligibility following the Modified Rhyme Test (MRT) paradigm. ABC-MRT16 is simple, effective and robust. When compared to subjective MRT data from 367 diverse conditions that...
Crowdsourcing of subjective speech, audio, and video quality of experience (QoE) tests has received much interest and study, but crowdsourcing of speech intelligibility testing has not. We hypothesize that speech intelligibility tests offer a unique ...
We describe the design, implementation, and analysis of a speech intelligibility test. The test included five codec modes, four frame-erasure rates, and two background noise environments, for a total of 40 conditions. The test protocol required twent...
The separation of acoustic signals is often accomplished through subtractive decompositions of frequency-domain representations. This is typically enabled by the zero phase approximation or the uncorrelated signals approximation but both of these are...
We describe a major effort to quantify the speech intelligibility associated with a range of narrowband, wideband, and fullband digital audio coding algorithms in various acoustic noise environments. The work emphasizes the relationship between these...
We present an objective estimator of speech intelligibility that follows the paradigm of the Modified Rhyme Test (MRT). For each input, the estimator uses temporal correlations within articulation index bands to select one of six possible words from ...
The lossless compression algorithm specified in ITU-T Recommendation G.711.0 provides bit-exact G.711 speech coding at reduced bit-rates. We introduce two Look-Up Coders (LUCs) that also offer bit-exact G.711 speech coding at reduced rates but the LU...
The value or harm associated with an increase in speech coding quality depends on the type of the increase as well as the temporal location of the increase in an utterance. For example, some increases in speech coding bandwidth can be perceived as im...
This report describes a modified rhyme test (MRT) conducted to characterize the behavior of digital and analog communication in the presence of background noise and moderate RF channel degradation. This is done through the use of reference systems to...
In an extended P25/VoLTE public safety communication system voice signals will pass through both Multi-Band Excitation (MBE) and Adaptive Multi-Rate (AMR) speech coders. Thus it is important to quantify the speech quality that can be expected for MBE...
This report describes speech intelligibility testing conducted on the Adaptive Multi-Rate (AMR) speech coder in several different environments simulating emergency response conditions and especially fireground conditions. The intelligibility testing ...
Subjective testing is the most direct means of assessing multimedia quality as experienced by users. When multiple dimensions must be evaluated, these tests can become slow and costly. We present gradient ascent subjective testing (GAST) as an effic...
We present and evaluate a new multiple–description coding extension to the international standard for pulse code modulation speech coding (ITU–T Rec. G.711). This extension is inserted between the G.711 encoder and decoder. It uses speech–polarity de...
In advanced heterogeneous telecommunication networks, network resources can dynamically dictate the type of speech coding that is used. An increase in resources allows for lower coding distortion or it might also be used to provide wideband speech in...
When bit errors are introduced between a speech encoder and a speech decoder, the quality of the received speech is reduced. The specific relationship between speech quality and bit error rate (BER) can be different for each speech coding and channel...
(This paper won the QoMEX 2009 Best Paper Award.) Subjective testing is the most direct means of assessing audio, video, and multimedia quality as experienced by users and maximizing the information gathered while minimizing the number of trials ...
The systems used for public safety speech communications must be intelligible. It is also desirable that they transmit secondary information, such as the attributes of a speaker's voice. This secondary information can allow a user to identify the spe...
This report describes an experiment conducted to measure the intelligibility of selected radio communication systems when those systems are employed in high-background-noise environments experienced by firefighters. The test plan for a Modified Rhyme...
While useful speech communication systems must be intelligible, most systems aim to transmit secondary information, such as attributes of a speaker's voice, as well. This secondary information can allow a listener to identify the speaker and his emot...
We describe an experiment where listeners were asked to detect two specific forms of stress in talkers' recorded voices heard via six different simulated communication systems. Both task–induced stress and dramatized urgency were used. Communication ...
Layered audio coding typically offers reduced distortion as bit rate is increased, but that distortion is spread across the entire band until the lossless coding bit rate is reached and distortion is eliminated. We propose a layered audio coding para...
We investigate the use of an adaptive processor (a quantizer pseudoinverse) and the statistics of the associated pseudoerror signal to reduce quantization error in scalar quantizers when a small amount of prior knowledge about the signal x is availab...
We have designed, conducted, and analyzed a subjective speech quality experiment with unrestricted timing where subjects can vote whenever their opinions are fully formed, rather than at fixed time intervals. Analysis of the resulting listening times...
We present a general formulation of a basic open question regarding the perception of time-varying speech quality. We then describe the design, implementation, conduct, and analysis of a practical experiment that addresses a small but fundamental par...
We describe new 2-channel multiple-description speech coders based on the ITU-T Recommendation G.711 PCM speech coder. The new coders operate in the PCM code domain in order to exploit the companding gain of PCM. They apply pairs of complementary asy...
We describe a 2-channel multiple-description speech coder based on the ITU-T Recommendation G.711 PCM speech coder. The new coder operates in the PCM code domain in order to exploit the companding gain of PCM. It applies a pair of 2-dimensional struc...
When objectively estimating speech, audio, or video quality, it is often necessary to compensate for a system gain or to "gain match" two or more signals. One can take three views of a system, leading to three different definitions of gain, and three...
In packetized speech transmission, end–to–end delay can vary, even over short timescales. Estimating the resulting speech delay histories is critical to diagnostic and quality estimation efforts. We present a new bottom–up algorithm for estimating ti...
Temporal discontinuities in received speech are a reality of Internet Telephony or Voice over Internet Protocol (VoIP) systems. These relatively new impairments pose unique challenges to objective estimators of perceived speech quality. We suggest th...
Multiple–description coding is one way to gain robustness against lossy channels. We extend the multiple–description scalar quantizer (MDSQ) to a channel–optimized MDSQ (COMDSQ) that minimizes mean–squared error for a given channel environment. We di...
A multiple data set fitting problem often arises in conjunction with the development of objective estimators of perceived audio or video quality. In such development work, we often seek the best linear relationship between a set of objective audio or...
It is often desirable to compensate for system gain, especially before objectively estimating perceived audio or video quality from system inputs and outputs. A common approach is to scale the system output to compensate for system gain. One can take...
The identification of linear systems from input and output observations is an important and well-studied topic. When both the input and output observations are noisy, the resulting problem is sometimes called the "errors in variables" problem. Existi...
This paper identifies optimum levels of reverse water-filling for codebook-based coding of noise and speech signals. We find that there is little to be gained from optimizing an effective rate parameter. We identify trade-offs between SNR and log-spe...
One of the questions that ongoing QoS efforts seek to answer is: "Given fixed network resources, how does one provide the highest possible quality of service to the maximal number of users in a fair way, even when those users are generating competing...
Perceived speech quality is most directly measured by subjective listening tests. These tests are often slow and expensive, and numerous attempts have been made to supplement them with objective estimators of perceived speech quality. These attempts ...
Part 1 of this paper describes a new approach to the objective estimation of perceived speech quality. This new approach uses a simple but effective perceptual transformation and a distance measure that consists of a hierarchy of measuring normalizin...
We present two techniques that can be used to enhance objective estimators of perceived speech quality. Frame normalization and frame-energy plane partitioning are described and applied to a log-spectral-error-based estimator. The resulting estimator...
Frequency-domain companding can be used in conjunction with audio coders that produce white coding noise. In [1-2] it is demonstrated empirically that this technique colors white coding noise so that it is better masked by audio signals, resulting in...
ITU-T Recommendation P.861 describes an objective speech quality assessment algorithm for speech codecs. This algorithm transforms codec input and output speech signals into a perceptual domain, compares them, and generates a noise disturbance value,...
Perceived speech quality is most directly measured by subjective listening tests. These tests are often slow and expensive, and numerous attempts have been made to supplement them with objective estimators of perceived speech quality. These attempts ...
We describe six algorithms for bit allocation in audio coding. Each algorithm stems from the minimization of a different perceptually–motivated objective function. Three of these objective functions are extensions of existing ones, and three are new....
This contribution aggregates the available performance data on the MNB and P.861 objective speech quality measures. Specifically, results presented in contributions T1A1.7/97-032 and T1A1.7/97-034 are examined. Based on examination of the aggregated ...
We describe a new approach to the estimation of perceived speech quality. The approach uses a simple, but effective, perceptual transformation to emulate hearing and a hierarchy of Measuring Normalizing Blocks (MNB's) to emulate auditory judgment. Th...
We describe a listening experiment that measures the perceived speech quality of 19 speech passbands using 8 talkers and 28 listeners. Results are referenced to the traditional wide-band and narrow-band telephony passbands. Our findings may help thos...
This contribution is provided for informational purposes. It contains a description of an algorithm that has successfully been used to estimate the delay of telephony band speech. The algorithm features a coarse stage that uses speech envelopes and a...
Excitation patterns and masking patterns are used extensively in perceptual audio coders and quality assessment algorithms. Numerous algorithms for calculating these patterns have been proposed. This paper provides comparisons among the patterns gene...
Four proposed perception-based techniques for objectively estimating speech quality and three traditional estimators are applied to coded speech samples. Agreement between objective estimates and corresponding subjective test scores is reported. Seve...
Objective (or instrumental) tests of speech quality have been proposed as ways to reduce the need for expensive and time-consuming subjective (or auditory) tests. Both types of tests attempt to quantify the range of opinions that listeners express in...
Working Group T1A1.5 is supporting ITU-T Study Group 12 in developing subjective audiovisual testing methods under Question 22/12 which addresses audiovisual quality in multimedia services. A previous contribution from Bellcore, T1A1.5/93-104, descri...
In a Study Group XII (Experts Group on Speech Quality) Contribution dated September 1991, John Rosenberger and Bill Cotton of Bellcore introduced an algorithm for generating temporally correlated distortion on 8 KHz sampled speech data. This distorti...