Institute for Telecommunication Sciences / About ITS / 2024 / Real World Solutions for Improving Audio Quality in Virtual Conference Settings
Real World Solutions for Improving Audio Quality in Virtual Conference Settings
ITS has released a new technical memorandum titled “Joint Analyses of No-Reference Speech Quality Estimation Tools and Conference Speech Recorded in Diverse Real-World Conditions.” This work started in the ITS Audio Quality Research Program and was significantly enhanced when Ken Tilley agreed to create an accompanying video that illustrates key content: “Real World Solutions for Improving Audio Quality in Virtual Conference Settings.”
Program staff participated in a major hybrid electrical engineering conference in 2023 and noticed that the prerecorded presentations had a wide range of audio qualities, including far too many with appallingly low quality and even marginal intelligibility. This situation unfortunately impaired the tutorial value of some contributions—it was hard for participants to extract the information that presenters were attempting to share.
Even though these presentations were made in individual non-studio locations with consumer grade equipment, program staff knew that much better results could be easily achieved. In addition, staff knew of tools available to automatically detect issues, thus allowing conference organizers to ask for replacement versions and to ensure some basic audio quality standards without major manual efforts.
This motivated staff to apply multiple available speech quality estimation tools to excerpts from over 2500 different conference recordings. These tools are typically developed through machine learning (ML) and produce estimates of perceived speech quality on a five-point scale (“one” means “bad” and “five” means “excellent”). Staff also developed and applied some new signal processing (SP) tools to identify excess noise, poor frequency response, and excessive clipping. The memorandum provides details of these analyses and shows that a variety of tools can reliably identify the problematic recordings, and that different tools tend to key in on different issues. Thus, the use of multiple tools can identify a wider set of problematic recordings than a single tool can identify.
For example, using one set of thresholds, the ML tools identified 88 very bad recordings. Of these, 25 were also identified by the SP tools while 63 of them were not. This emphasizes that speech quality is not fully characterized by the basic signal analyses performed by the SP tools. This makes sense as digital speech coding and noise suppression artifacts are prevalent in these recordings, and these impairments are not easily detected by basic the SP tools.
The memorandum also offers solutions with low-cost and low-effort that can reduce or eliminate many of the issues identified, thus allowing presenters to more successfully communicate the message they seek to share. The accompanying video makes the work more accessible and also brings the work to life with examples of the various impairments found in the conference recordings and the techniques that can reduce or eliminate them.
While the memorandum addresses very practical and important issues for conference participants and organizers, it also supports and informs those who are seeking to build no-reference speech quality estimators. That work is typically based on clean speech from a single, controlled recording environment which then simulates noise, reverberation, and other impairments via software. Some work has moved a bit beyond this paradigm using scripted voicemail messages through real phone connections and spontaneous speech while walking through outdoor and indoor public spaces.
This most recent ITS contribution to the body of work takes realism to a new level. The ITS work analyzes samples from the population of real-world conference-related recordings. These recordings contain unscripted speech recorded by individual conference presenters in their uncontrolled local environments, using available equipment. The recordings are truly “in-the-wild” and there are no assumptions or artificial motivations of any sort.
Thus, this latest ITS Technical Memorandum and accompanying video accomplish several distinct but related goals. They document real problems with recorded presentations and provide very accessible solutions to move conference participants and organizers beyond those problems. The video contributes greatly towards this goal by making the key points very accessible and even entertaining.
The memorandum also provides a reality check for ML estimators of speech quality by applying them to real-world recordings rather than the recordings typically found in development and testing databases. The ML tools show great utility, but the ITS results also identify some limitations that are important to recognize.