A Powerful, Fixed-Size Modulation Spectrum Representation for Perceptually Consistent Speech Evaluation

Stephen D. Voran

September 2024 | Technical Memorandum NTIA TM-24-574

A Powerful, Fixed-Size Modulation Spectrum Representation for Perceptually Consistent Speech Evaluation

doi: 10.70220/klnzhwkg

Cite This Publication

Stephen D. Voran and Jaden Pieper

Abstract:

We develop the wideband fixed-size modulation spectra (FMS) and show that they contain the necessary information to perform perceptually consistent evaluation of speech. We compare FMS with the already established frame-based modulation spectra as representations for estimating speech quality and speech natural-ness. We feed the two representations into equally sized, relatively small, fully connected networks for five proof-of-concept experiments and find that the two representations perform similarly when estimating speech quality and speech naturalness. But the FMS representation captures an entire speech file into a fixed size representation, which means that additional temporal processing is neither needed nor possible. This is in contrast to the frame-based representations, where additional processing can either destroy information (e.g., averaging over time) or lead to more complex and difficult to train networks.

We also demonstrate that when speech quality changes within a speech file, FMS has another distinct advantage which is to be able to efficiently and reliably identify different situations in a way that is not well-addressed by the frame-based approach. Our experiments include more than 274 hours of speech in two languages. This speech is contained in 139,000 speech files and there is a subjective score for each file. File lengths range from 0.6 to 28 seconds and five different sample rates are present. Supporting software is available at GitHub.

Keywords: speech quality; time-varying speech quality; machine learning; auditory perception; fixed-size modulation spectra (FMS); mel spectrum; modulation spectrum; speech naturalness

(TM-24-574.pdf)

Related Links:

Link to GitHub

For technical information concerning this report, contact:

Stephen D. Voran
Institute for Telecommunication Sciences
(720) 446-6425
svoran@ntia.gov

For funding information concerning this report, click this link.

Disclaimer:

Certain commercial equipment, components, and software may be identified in this report to specify adequately the technical aspects of the reported results. In no case does such identification imply recommendation or endorsement by the National Telecommunications and Information Administration, nor does it imply that the equipment or software identified is necessarily the best available for the particular application or uses.

This document contains software developed by NTIA. NTIA does not make any warranty of any kind, express, implied or statutory, including, without limitation, the implied warranty of merchantability, fitness for a particular purpose, non - infringement and data accuracy.NTIA does not warrant or make any representations regarding the use of the software or the results thereof, including but not limited to the correctness, accuracy, reliability or usefulness of the software or the results.You can use, copy, modify, and redistribute the NTIA - developed software upon your acceptance of these terms and conditions and upon your express agreement to provide appropriate acknowledgments of NTIA's ownership of and development of the software by keeping this exact text present in any copied or derivative works.

For questions or information on this or any other NTIA scientific publication, contact the ITS Publications Office at ITSinfo@ntia.gov or 303-497-3572.

Back to Search Results

Search Research Publications

A Powerful, Fixed-Size Modulation Spectrum Representation for Perceptually Consistent Speech Evaluation

Cite This Publication

Funding Information

Performing Agency

Funding Agency