Proceedings of the IEEE Thirteenth International Workshop on Quality of Multimedia Experience (QoMEX 2021), Montreal, June 14-17, 2021
Stephen D. Voran
Objective speech quality and intelligibility estimators do not correctly assess speech generated by deep neural networks (DNNs). We use 256 speech files and subjective scores that cover 14 DNN speech conditions and 18 nonDNN speech conditions to show that 8 different full-reference (FR) estimators consistently underestimate subjective scores for the DNN conditions. Conversely, we find that five no-reference (NR) estimators consistently overestimate subjective scores for the DNN conditions. We show that a rudimentary but effective solution to these shortcomings is to simply average an FR result with an NR result. We also explore root causes and propose more fundamental solutions. It has been previously suggested that FR estimators over-penalize inaudible timing variations or jitter. We conduct several experiments that measure and remove jitter from spectral representations of DNN speech inside FR estimators. Jitter removal compensates for some of the underestimation, thus confirming that jitter is a part of the cause. In additional experiments we show that power mismatches on a syllabic time-scale also contribute to the underestimation issue in FR estimators. Regarding NR estimators, we suggest that they can be trained to accurately rate DNN speech when sufficient speech signals and corresponding subjective scores are available.
Watch the recording of Voran's presentation in the NTIA YouTube channel.
Keywords: speech quality; speech intelligibility; objective estimator; DNN speech; neural speech
For technical information concerning this report, contact:
Stephen D. Voran
Institute for Telecommunication Sciences
Disclaimer: Certain commercial equipment, components, and software may be identified in this report to specify adequately the technical aspects of the reported results. In no case does such identification imply recommendation or endorsement by the National Telecommunications and Information Administration, nor does it imply that the equipment or software identified is necessarily the best available for the particular application or uses.