Institute for Telecommunication Sciences / Research / Quality of Experience / Audio Quality Research / Audio Home

ITS Audio Quality Research Program

The ITS Audio Quality Research Program addresses selected open questions in digital speech and audio quality assessment, enhancement, compression, and transmission. The breadth and depth of our contributions are most easily appreciated via our Publications page. Here are some of the more recent highlights:

Dataset Concealment: As we continue to study how machine learning (ML) is used to create (NR) speech quality estimators, we often recognize a lack of robustness and clarity in evaluating and reporting innovations. Small changes in correlation are often said to be meaningful, but their proper interpretation depends strongly on context. To address this situation, we have developed Dataset Concealment (DSC), a rigorous new procedure for evaluating and interpreting NR speech quality estimators. DSC quantifies and decomposes the performance gap between research results and real-world application requirements, while offering context and additional insights into estimator behavior and dataset characteristics. We report this work in our ICASSP 2026 paper and we have made code available as well. The paper also shows the benefits of addressing the corpus effect by using the dataset Aligner from AlignNet when training models with multiple datasets. One key result is that adding the 1000 parameter dataset Aligner to the 94 million parameter Wav2Vec model during training does significantly improve the resulting model’s ability to estimate speech quality for unseen data.

Frequency-Domain SNRs: Restoration of degraded audio signals is commonly performed on complex-valued frequency-domain (FD) representations via manipulation of magnitudes and phases or manipulation of real and imaginary parts. In general, these manipulations do not produce consistent representations. The consequence is that the magnitudes and phases (or real and imaginary parts) of the restored time-domain signal (which are always consistent) do not match the generally inconsistent values imposed during FD restoration. One might say ``What we get is not what we asked for.'' In order to better understand the complex interplay at work, we developed two-dimensional FD SNR frameworks that visually reveal how consistency enforcement changes the applied FD restorations to arrive at the achieved FD restorations. We report this work in our WASPAA 2025 paper and we have made code available as well. Examples in the paper show how extended Griffin-Lim algorithms can reduce and direct, but not eliminate, the changes produced by consistency enforcement. Also in the paper, we show how objective estimators can connect this work to estimated speech quality and intelligibility.

AlignNet: We have made significant progress in using machine learning (ML) to create (NR) speech quality estimators. ML approaches are powerful and effective, but also highly dependent on the quantity and diversity of listening experiment results that comprise the ground-truth training data. This motivates combining the results of multiple listening experiments to achieve the needed quantity and diversity of data, but combining is often impossible due to dataset inconsistencies.

We have developed two complementary machine-learning advances that address this issue. Multi-dataset finetuning (MDF) pretrains an NR estimator on a single dataset and then fine tunes it on multiple datasets at once, including the dataset used for pretraining. AlignNet uses an AudioNet to generate intermediate score estimates before using the Aligner to map intermediate estimates to the appropriate score range. AlignNet is agnostic to the choice of AudioNet so any successful NR speech quality estimator can benefit from its Aligner. The methods can be used in tandem, and we have completed two studies that show how they improve on current solutions by efficiently and effectively removing inconsistencies that impair the learning process, and thus they enable successful training with larger amounts of more diverse data.

We will presented this work at Interspeech 2024 in September. The paper and the code are available now.

Audio Signal STFT Phase Distributions: The discrete-time short-time Fourier transform (STFT) is a ubiquitous tool for analysis and processing of audio signals. It is commonly assumed that STFT coefficient phases are uniformly distributed. While this is approximately true for global distributions, our analysis of individual STFT coefficients and our analysis by coefficient magnitude both reveal highly nonuniform phase distributions. We have used mathematical derivations to identify the source of the non-uniformities. We combined these results with tone-based simulations and successfully reproduced the non-uniform phase distributions we have observed in audio signals.

We suggest that when audio signal processing requires a prior phase distribution, one should consider using a per-frequency, per-band, or per-magnitude level nonuniform prior distribution to see if that additional specificity leads to improved performance or efficiency compared to simply adopting the uniform prior. This could lead to improved separation and enhancement of speech and audio signals.

We presented this work at the 2024 IEEE International Conference on Multimedia and Expo. The paper is available.

Real World Speech Quality Measurements: This technical memorandum documents our work to apply a variety of ML and DSP based no-reference speech quality estimators to excerpts from over 2500 different prerecorded conference presentations. The recordings contain unscripted speech recorded by individual conference presenters in their uncontrolled local environments, using available equipment. The recordings are truly “in-the-wild” and there are no assumptions or artificial motivations of any sort. This is a significant departure from the speech databases that are typically used in the development and evaluation of no-reference speech quality tools.

We find that the ML and SP tools can easily identify problematic recordings, and that different tools key in on different issues, so the use of multiple tools can identify a wider set of problematic recordings than any single tool can identify. The memorandum provides a reality check for no-reference speech quality estimators by applying them to real-world recordings rather than the recordings typically found in development and testing databases. The ML tools show great utility, but work also identifies some limitations that are important to recognize.

WAWEnets: We have completed simplification, unification, regularization, further training, and more thorough analysis of the ITS Wideband Audio Waveform Evaluation Networks (WAWEnets). WAWEnets are fully convolutional neural networks that operate directly on wideband audio waveforms in order to produce evaluations of those waveforms. Example evaluation scales are overall speech quality, intelligibility, and noisiness. WAWEnets are no-reference networks because they do not require “reference” (original or undistorted) versions of the waveforms they evaluate. Our work has leveraged 334 hours of speech in 13 languages, more than two million full-reference target values, and more than 93,000 subjective mean opinion scores.

Our paper provides full details and our codebase is available as well. We first introduced WAWEnets at ICASSP 2020 in this paper.

Bursty packet losses or bit errors: Our technical memorandum and supporting code are available. These result from our work on the mathematical links between the Gilbert-Elliot model parameters (2, 3, and 4 parameter cases) and the resulting packet loss or bit error statistics. These links allow one to set parameters to obtain desired average loss rates, average burst lengths, loss covariances, etc. The code can estimate models and parameters from loss patterns and can generate error patterns dictated by model parameters or error statistics.

Optimal Frame Durations: We presented our work addressing optimal frame durations for separation of audio signals at the IEEE 23rd International Workshop on Multimedia Signal Processing (MMSP 2021).The paper "Optimal Frame Duration for Oracle Audio Signal Separation is Determined by Joint Minimization of Two Antagonistic Artifacts" was one of the papers nominated for the Best Paper Award at the conference. The work shows that optimal processing frame duration in oracle binary masking and oracle magnitude restoration is determined by joint minimization of two antagonistic artifacts: temporal blurring (which increases with frame duration) and log-spectral-error change per unit time (which deceases with frame duration). These effects are related to the stationarity of the signals but saying that “stationarity determines optimal frame duration” falls far short of describing the true nature and complexity of the interaction. The supporting demonstration and code are available on a separate page: Audio Demos for Frame Duration Study.

The Bigger Picture: The quality of speech sent over a telecommunication system depends on a variety of factors, such as the background noise and reverberation in the environment, the equipment and algorithms used to capture, enhance, encode, transmit, decode, and reproduce the speech signal, the bandwidth used in transmitting the speech signal, and others. The ITS Audio Quality Research Program supports community-wide efforts towards robust and adaptable telecommunication speech services and equipment with high quality and intelligibility. Application areas include wireless and wired traditional telephony, conferencing tools, and mission critical voice communications for first responders.

In the program we identify and address carefully selected open issues in these areas. We often apply signal processing and machine learning to develop novel algorithms that better characterize the user experience of speech quality and speech intelligibility. We also investigate numerous related issues. The best way to appreciate the full breadth and depth of our work is to visit the Publications & Talks page.

Publications Search