Full-Reference and No-Reference Objective Evaluation of Deep Neural Network Speech

Stephen D. Voran

June 2021 | Conference Paper

Full-Reference and No-Reference Objective Evaluation of Deep Neural Network Speech

Stephen D. Voran

Abstract:

Objective speech quality and intelligibility estimators do not correctly assess speech generated by deep neural networks (DNNs). We use 256 speech files and subjective scores that cover 14 DNN speech conditions and 18 nonDNN speech conditions to show that 8 different full-reference (FR) estimators consistently underestimate subjective scores for the DNN conditions. Conversely, we find that five no-reference (NR) estimators consistently overestimate subjective scores for the DNN conditions. We show that a rudimentary but effective solution to these shortcomings is to simply average an FR result with an NR result. We also explore root causes and propose more fundamental solutions. It has been previously suggested that FR estimators over-penalize inaudible timing variations or jitter. We conduct several experiments that measure and remove jitter from spectral representations of DNN speech inside FR estimators. Jitter removal compensates for some of the underestimation, thus confirming that jitter is a part of the cause. In additional experiments we show that power mismatches on a syllabic time-scale also contribute to the underestimation issue in FR estimators. Regarding NR estimators, we suggest that they can be trained to accurately rate DNN speech when sufficient speech signals and corresponding subjective scores are available.

Watch the recording of Voran's presentation in the NTIA YouTube channel.

Keywords: speech quality; speech intelligibility; objective estimator; DNN speech; neural speech

(qomex2021voran1.pdf)

For technical information concerning this report, contact:

Stephen D. Voran
Institute for Telecommunication Sciences
(720) 446-6425
svoran@ntia.gov

For funding information concerning this report, click this link.

Disclaimer:

Certain commercial equipment, components, and software may be identified in this report to specify adequately the technical aspects of the reported results. In no case does such identification imply recommendation or endorsement by the National Telecommunications and Information Administration, nor does it imply that the equipment or software identified is necessarily the best available for the particular application or uses.

For questions or information on this or any other NTIA scientific publication, contact the ITS Publications Office at ITSinfo@ntia.gov or 303-497-3572.

Back to Search Results

Publications Search

Full-Reference and No-Reference Objective Evaluation of Deep Neural Network Speech

Cite This Publication

Funding Information

Performing Agency