October 1, 2021
Video streaming is a highly competitive market that dominates internet traffic. Video consumes 65% of worldwide mobile downstream traffic. Upcoming video technologies have steadily increasing bandwidth requirements: ultra-high definition (UHD) video surveillance (16 Mbps), self-driving vehicle diagnostics (20 Mbps), cloud gaming (30 Mbps), 8K wall televisions (100 Mbps), and UHD virtual reality (500 Mbps).
The most accurate way to assess video quality is a subjective video quality test, conducted according to ITU-R Rec. BT.500 or ITU-T Rec. P.913. Basically, a panel of subjects rate the quality of video sequences within a carefully controlled experiment. Large companies like Intel, AT&T, and YouTube conduct subjective tests when making major business decisions. Subjective tests are expensive, time consuming, and thus rare.
Companies mostly depend on ad-hoc quality assessments to choose the optimal tradeoff between bandwidth and quality of experience (QoE). One to three engineers or stakeholders might run several videos through competing vendor equipment and choose the system that looks best. Ad-hoc evaluators then assume—without proof—that consumers will agree with their conclusions.
NTIA Technical Report TR-21-550 Confidence Intervals for Subjective Tests and Objective Metrics That Assess Image, Video, Speech, or Audiovisual Quality, a groundbreaking report issued by ITS's Video Quality Research team, measures the relative accuracy of subjective tests and ad-hoc quality assessments. To find answers, ITS collected data from 60 subjective tests where 2,331 subjects rated the quality of 17,665 media. The data included 90 lab-to-lab comparisons, where two labs conducted the same subjective test.
A 24 person subjective test will typically conclude that videos have equivalent quality if their ratings differ by less than 13% of the subjective scale. This assumes that the subjective scale spans the full range of video quality, from excellent to bad. With 200 subjects, this can be reduced to 5% of the subjective scale, but that would be expensive and unnecessary. Businesses do not base their decisions on the quality of individual videos. They make decisions about systems. These decisions can be much more accurate, because their precision depends on the number of videos used to characterize each system, and how well those videos represent the company’s service.
Subjective tests have extremely low error rates. When two labs conduct the same test, the odds that they will disagree on the relative ranking of two media is less than 1% and typically ≈0.17%. Higher error rates may mean subjects are having difficulty reaching decisions, like how to rate a very funny YouTube video with crummy picture quality.
Ad-hoc quality assessments are entirely different. When comparing video systems with similar quality, half of better/worse conclusions are false distinctions. That is, a subjective test would conclude the videos have equivalent quality. A one person ad-hoc quality assessment has an 11.4% error rate, dropping to 8.5% with two people, and 6.8% with three people. The actual error rates for one person range from 3% to 30%, depending on the person, but there is no easy way to choose the right person. The confidence of an ad-hoc evaluator in their own judgements is a particularly bad method, because self-confidence is uncorrelated to accuracy.
These analyses highlight the need for video quality metrics: computer programs that can emulate subjective test results. To build trust, U.S. industry needs to understand the accuracy and precision of these video quality metrics, relative to subjective testing. ITS provides a solution: new statistical methods that compute the metric’s confidence interval and, when confidence intervals are used to make decisions, prove whether the metric performs similarly to a subjective test with 15 or 24 subjects. When confidence intervals are not used, the metric’s precision is likened to a certain number of people in an ad-hoc quality assessment. The methods in this report are developed and evaluated using speech quality, video quality, image quality, and audiovisual quality datasets. Code implementing these methods is available on GitHub.