September 2024 | Technical Memorandum NTIA TM-24-274
A Powerful, Fixed-Size Modulation Spectrum Representation for Perceptually Consistent Speech Evaluation
Cite This Publication
Stephen D. Voran and Jaden Pieper, “A Powerful, Fixed-Size Modulation Spectrum Representation for Perceptually Consistent Speech Evaluation,” Technical Memorandum NTIA TM-24-274, U.S. Department of Commerce, National Telecommunications and Information Administration, Institute for Telecommunication Sciences, September 2024.
Stephen D. Voran and Jaden Pieper
Abstract:
We develop the wideband fixed-size modulation spectra (FMS) and show that they contain the necessary information to perform perceptually consistent evaluation of speech. We compare FMS with the already established frame-based modula-tion spectra as representations for estimating speech quality and speech natural-ness. We feed the two representations into equally sized, relatively small, fully connected networks for five proof-of-concept experiments and find that the two representations perform similarly when estimating speech quality and speech nat-uralness. But the FMS representation captures an entire speech file into a fixed size representation, which means that additional temporal processing is neither needed nor possible. This is in contrast to the frame-based representations, where addi-tional processing can either destroy information (e.g., averaging over time) or lead to more complex and difficult to train networks.
We also demonstrate that when speech quality changes within a speech file, FMS has another distinct advantage which is to be able to efficiently and reliably iden-tify different situations in a way that is not well-addressed by the frame-based approach. Our experiments include more than 274 hours of speech in two languages. This speech is contained in 139,000 speech files and there is a subjective score for each file. File lengths range from 0.6 to 28 seconds and five different sample rates are present. Supporting software is available at https://github.com/NTIA/fms.
Keywords: speech quality; time-varying speech quality; machine learning; auditory perception; fixed-size modulation spectra (FMS); mel spectrum; modulation spectrum; speech naturalness
For technical information concerning this report, contact:
Stephen D. Voran
Institute for Telecommunication Sciences
(303) 497-3839
svoran@ntia.gov
Disclaimer: Certain commercial equipment, components, and software may be identified in this report to specify adequately the technical aspects of the reported results. In no case does such identification imply recommendation or endorsement by the National Telecommunications and Information Administration, nor does it imply that the equipment or software identified is necessarily the best available for the particular application or uses.
This document contains software developed by NTIA. NTIA does not make any warranty of any kind, express, implied or statutory, including, without limitation, the implied warranty of merchantability, fitness for a particular purpose, non-infringement and data accuracy. NTIA does not warrant or make any representations regarding the use of the software or the results thereof, including but not limited to the correctness, accuracy, reliability or usefulness of the software or the results. You can use, copy, modify, and redistribute the NTIA-developed software upon your acceptance of these terms and conditions and upon your express agreement to provide appropriate acknowledgments of NTIA's ownership of and development of the software by keeping this exact text present in any copied or derivative works.
For questions or information on this or any other NTIA scientific publication, contact the ITS Publications Office at ITSinfo@ntia.gov or 303-497-3572.