Institute for Telecommunication Sciences / Research / Quality of Experience / Video Quality Research / Standards / Subjective Tests / J.av-dist related research

Research Related to New Audiovisual Quality Recommendation, 2011

This report is intended to help progress the draft new Recommendation J.av-dist, "Methods for subjectively assessing audiovisual quality of internet video and distribution quality television, including separate assessment of video quality and audio quality".

This report summarizes documented research that has been conducted into the best way to perform subjective tests for modern audiovisual devices. One goal is to ensure that J.av-dist contains the best available methodology. Another goal is to avoid requirements that are either unnecessary or harmful for this application.

Conclusions follow the summary of documented research.

Absolute category Rating (ACR) Subjective Scale in ITU Recommendations

The Absolute Category Rating (ACR) scale is defined in ITU-T Rec. P.800 and ITU-T Rec. P.910. This is a single stimulus rating method, where the subject is presented once with the stimuli, then asked to rate the stimuli on a discrete, five point scale. Language translation is allowed.

ITU-T P.800 specifies a discrete 5-point scale, with three choices of wording:

Listening-quality scale, rating quality of the speech

5 Excellent

4 Good

3 Fair

2 Poor

1 Bad

Listening-effort scale, rating effort required to understand the meaning of the sentences:

5 Complete relaxation possible; no effort required

4 Attention necessary; no appreciable effort required

3 Moderate effort required

2 Considerable effort required

1 No meaning understood with any feasible effort

Loudness-preference scale, rating loudness preference:

5 Much louder than preferred

4 Louder than preferred

3 Preferred

2 Quieter than preferred

1 Much quieter than preferred

ITU-T P.800 allows the use of alternate wording choices only when the above three opinion scales do not meet the needs of the experimenter.

ITU-T P.910 specifies a discrete 5-point scale, with one choices of wording:

5 Excellent

4 Good

3 Fair

2 Poor

1 Bad

ITU-T P.910 allows the use of a nine-level scale, eleven-level scale, and continuous scale if more discriminative power is required. For example, the nine-level scale is as follows:

9 Excellent

8

7 Good

6

5 Fair

4

3 Poor

2

1 Bad

ITU-T Rec. P.910 identifies a variant of ACR, ACR with hidden reference (ACR-HR). In ACR-HR, the source video is included as one of the stimuli but not identified as such. The ratings for the source sequence are removed from the scores of the other stimuli during data processing.

SAMVIQ / MUSHRA

ITU-R Recommendation BS.1534 describes the method "Multiple stimulus with hidden reference and anchor" (MUSHRA) for audio quality evaluation. ITU-R Rec. BT.1788 describes this same method for video quality evaluation under the name of "Subjective assessment of multimedia video quality" (SAMVIQ). MUSHRA/SAMVIQ is an alternative to Paired Comparison tests for a narrow range of quality. One example is to test post processing in HDTV.

MUSHRA is a double-blind multi-stimulus test method with hidden reference and hidden anchor(s). The test is performed from a computer interface, where the subject is presented with multiple versions of the same source sequence. The subject may play each stimuli multiple times, and may choose the order in which stimuli are rated. One of the stimuli is the reference video and labelled as such. The subject rates each version of the source sequence and adjusts the ratings relative to each other. SAMVIQ is basically the same but does not use the hidden anchors; however hidden anchors are not generally used in video quality testing.

Degradation Category Rating (DCR), also known as Double Stimulus Impairment Scale (DSIS)

The degradation category rating (DCR) method from ITU-T Rec. P.910 and ITU-T Rec. P.800 also appears in the current version of ITU-R Rec. BT.500-3 under the name double stimulus impairment scale (DSIS).

DCR presents stimuli to subjects in pairs. The source (or reference) sequence is presented first, and the subject knows that this was the source sequence. The stimuli to be rated is presented second. The subject rates the difference in quality on an impairment scale. For ITU-T Rec. P.910 and ITU-R Rec. BT.500 use the following labels:

5 Imperceptible

4 Perceptible but not annoying

3 Slightly annoying

2 Annoying

1 Very annoying

ITU-T Rec. P.800 uses the following alternative labels:

5 Degradation is inaudible

4 Degradation is audible but not annoying

3 Degradation is slightly annoying

2 Degradation is annoying

1 Degradation is very annoying

Alternate wordings are mentioned in literature, though not approved in the existing ITU Recommendations. For example, the MPEG video compression testing uses DCR with ACR labels excellent, good, fair, poor, and bad (see [1]). This same variant of the DCR method is being used for the MPEG 3D subjective test effort currently underway.

Comparison Category Rating (CCR), aka Double Stimulus Comparison Scale (DSCS)

ITU-T Rec. P.800 also identifies the comparison category rating (CCR). A pair of stimuli is presented to the subject; however the order of stimuli is random. The subjects rate the quality of the second stimuli compared to the quality of the first on the following scale:

3 Much better

2 Better

1 Slightly better

0 About the same

-1 Slightly worse

-2 Worse

-3 Much worse

CCR is also mentioned in ITU-R Rec. BT.500, including a variant that uses a continuous scale instead of a discrete, 7-point scale. An obsolete version of BT.500 referred to this method as double stimulus comparison scale (DSCS). CCR is appropriate for comparisons between impaired stimuli, though this may complicate the data analysis.

Double Stimulus Continuous Quality Scale (DSCQS)

The double stimulus continuous quality scale (DSCQS) method involves four presentations of two stimuli, A and B. One of these is the reference stimuli, assigned randomly to position A or B. The subject is presented with stimuli A, then B, then A, then B. Afterward, the subject rates A and B separately, each on a continuous or 100-level scale showing the ACR labels (excellent, good, fair, poor or bad).

Single Stimulus Continuous Quality Evaluation SSCQE

The single stimulus continuous quality evaluation (SSCQE) method presents the subject with a stimulus of long duration. The subject has a slider that is to be constantly moved, so that it reflects the subject's current opinion of the video quality. Ratings are sampled every half second. SSCQE was intended for the development of monitoring applications, such as no-reference video quality metrics.

Comparing Different Numbers of Rating Levels

A paper by Huynh-Thu [2] compared subjective test results from four varieties of ACR scales, using the same video samples. This experiment shows a very strong linear relationship between the 5-level discrete, 9-level discrete, 5-level continuous, and 11-level continuous scales. No significant statistical difference was found between the subjective results obtained with the different scales. These results indicate that the ACR single-stimulus presentation method produces very repeatable subjective results, even across different groups of participants, provided that the test design and instructions were carefully conducted.

Comparing Rating Scales: ACR, SAMVIQ / MUSHRA, DCR, DSCQS and SSCQE

Papers by Péchard [3], Brotherton [4] and Huynh-Thu [5] compare the ACR and SAMVIQ methods using the same set of video sequences. The two methods are shown to have similar behaviors, with correlations ranging from 0.899 to 0.969. SAMVIQ with 15 subjects is as precise as ACR with more than 22 subjects [3]. SAMVIQ takes approximately twice as long as SAMVIQ for the same number of stimuli, due to the ability of subjects to replay test sequences and change their vote.

A paper by Tominaga [6] compares DSCQS, DCR, SAMVIQ, and ACR. Two versions of SAMVIQ were considered: with and without hidden reference removal (that is, subtracting the source sequence scores from all impaired versions). Four versions of ACR were considered: a discrete 5-level scale (ACR5) and a discrete 11-level scale (ACR11), both with and without hidden reference removal. The same videos were rated on all scales by a single set of subjects. In a questionnaire after the subjective testing, subjects were asked to rate the ease of using each method on a scale of: difficult (1), slightly difficult (2), fair (3), slightly easy (4), and easy (5). The impairment in [6] spanned a wide range of quality.

Correlations between scores for all methods were very high. This indicates that the rating scale choice has only a minor impact on data accuracy. The total assessment time for each method from fastest to slowest was: ACR5 (12 sec), ACR11 (14 sec), DCR (20 sec), SAMVIQ (29 sec), and DSCQS (41 sec). ACR5 was the easiest form of evaluation (4.33), followed by DSIS (3.92), SAMVIQ (3.48), DSCQS (3.31) and ACR11 (3.25). DCR, ACR5 and SAMVIQ had smaller normalized confidence intervals than ACR11 and DSCQS. Overall, [6] concluded that ACR5 is the most suitable method for quality assessment of mobile video services.

A paper by Pinson [7] compares SSCQE with DSCQS and CCR. This paper took samples from existing DSCQS and CCR experiments, then rated them with SSCQE. The data processing included hidden reference removal and multiple randomized viewer orderings. With these constraints, SSCQE was as accurate as either DSCQS or CCR. By extension of this research and [6], CCR and DSCQS likely have similar accuracy (this comparison was not made).

SSCQE has an added advantage of speed-it is the fastest of all methods-but the subjective testing community has not warmed to this method. This is perhaps because the data evaluation becomes more difficult, occasionally requiring time series analysis. SSCQE has the potential for allowing the most evaluations from a single subject in a short time, as there are no pauses between stimuli for ratings. SSCQE may be an awkward choice for subjective testing of mobile devices, as these are often hands-on devices; thus no hand may be available to adjust the slider.

Comparing Double Stimulus (DS) Rating Scales

The double stimulus (DS) methods are: MUSHRA/SAMVIQ, DCR, CCR and DSCQS. DS allow a type of evaluation that ACR cannot provide: an explicit comparison between stimuli. This can reveal types of impairments that a single stimulus method cannot. One example is color impairments. Changes to the color schema in a video may not be detected as impairments in ACR, because people expect to colors change in response to different lighting conditions. Between them, all DS methods ask one of two basic questions

The first DS question is "how well does this impaired sequence reproduced the reference sequence." DSCQS, DCR, and MUSHRA/SAMVIQ can all be used to answer this question. Of these, DCR is identified within [6] as more desirable than DSCQS or SAMVIQ, due to improved speed and ease of use, without loss of accuracy.

The second DS question is "which of these two sequences do you like better." Either CCR or MUSHRA/SAMVIQ can be used to answer this question. CCR performs more similarly to DCR than DSCQS or SAMVIQ in terms of speed. Given the results in [6], CCR seems an obvious choice to include in J.av-dist to answer this question.

The case for or against MUSHRA/SAMVIQ is less obvious. There is currently no systematic evidence that this method is superior to CCR, DSCQS, or DCR. However, MUSHRA/SAMVIQ can be used to answer both types of DS questions simultaneously. This could be an advantage when an experiment that focuses on a narrow range of quality. The studies mentioned so far focus on wide ranges of quality.

Impact of Language, Environment, Labels and Number of Subjects on an ACR

A paper by Pinson [8] compared results a single ACR experiment when run in different laboratories and different environments. A summary of this research will be presented orally during the ITU-T JRG-MMQA, at the Video Quality Experts Group (VQEG, www.vqeg.org) meeting in December, 2011. This experiment used a wide range of audiovisual sequences and the 5-point discrete ACR scale. Data was collected by six different laboratories in four different countries (USA, France, Germany and Poland). Some of the environments were conducted in controlled environments and others were conducted in public environments. This experiment showed that ACR audiovisual subjective tests were highly repeatable from one laboratory and environment to the next. The number of subjects was the most important factor.

Based on [8], 24 or more subjects are recommended for Absolute Category Rating (ACR) tests. With 24 or more subjects, lab-to-lab correlations were always 0.97 or above. In public environments, approximately 35 subjects are required to obtain the same Student T-test sensitivity. The second most important variable was individual differences between subjects (i.e., personal opinion). Other environmental factors had minimal impact. This includes language, country, lighting, background noise, color blindness, wall color, monitor calibration, and translation of the ACR scale. These analyses indicate that the results of experiments done in ITU-T Recs. P.800 and P.910 environments are highly representative of those devices in actual use, in a typical user environment.

For the purposes outlined in the scope of the draft new Recommendation J.av-dist, a highly repeatable experiment can be performed in a controlled room without detailed specification of the room itself (e.g., background noise, monitor calibration, lighting, walls). Such an experiment when conducted in a public environment appears to only require additional subjects.

A paper by Cai [9] used ITU-T Rec. P.800 with ACR and speech samples to compare a set of coding conditions in three different languages: Chinese, Japanese and English. Unlike [8], this experiment used different audio source samples for each language. The lab-to-lab correlations were lower (0.903 to 0.95).

A paper by Zielinski [10] summarizes literature that investigates sources biases in subjective testing. Zielinski shows that the perceptual meaning ACR labels shift in magnitude as these labels are translated; however a direct study comparing ACR with an unlabelled scale yielded identical results. This agrees with the results shown in [8]. Given the same source stimuli, the translation of labels did not impact ratings, and slightly uneven the perceptual distribution of labels did not impact ratings. These results indicate that labels may be modified to suit the test, without fear that the test accuracy will be negatively impacted.

The ACR Variant, ACR-HR

The Video Quality Experts Group has successfully used ACR-HR to validate video quality models. These efforts resulted in ITU-T Recommendations J.247, J.246, J.340, and J.341; plus ITU-R Recommendations BT.1866 and BT.1867. This ACR variation has proven value for evaluating the performance of objective video quality metrics, where the choice of methods must be a compromise between competing priorities (e.g., evaluating no-reference and full-reference models on the same subjective data).

Conclusions

Based on this research, the following seem appropraite for for Draft New Recommendation J.av‑dist:

Include three methods:

o ACR method as a discrete 5-level scale

o CCR method as a discrete 7-level scale

o DCR method as a discrete 5-level scale

Consider further whether or not MUSHRA/SAMVIQ should be included as an option.

Require different numbers of subjects depending upon the method environment

o ACR, CRR and DCR:

24 subjects minimum in a controlled environment

35 subjects minimum in a public environment

o Allow smaller groups of subjects for pilot studies, to find trending

Allow modification to labels

o Change must be mentioned in the subjective test description

o Allow modified labels to suit experiment

Allow ACR-HR as an option.

Disallow a change to the number of levels.

o A comparison of rating methods indicates no increase in method accuracy, yet the subject's task becomes more difficult.

o If modifications to levels are allowed, a warning might be appropriate that this modification will not increase the method's accuracy. However, the standard deviations of ratings will display fewer quantization effects.

Provide two environment options: controlled environment and public environment.

o The controlled environment is a room devoted only to the experiment at that time (e.g., a sound isolation booth, an office, or a simulated living room).

o The public environment is a multi-purpose area that includes people not involved with the experiment. The public environment is an area where someone would commonly use the audiovisual device.

o Instead of specifying environment details in this Draft New Recommendation, the experiment description should specify pertinent details of the environment that should be measured and reported (e.g., background noise level, lighting level, whether or not people uninvolved with the experiment were present).

o A picture of the environment may be appropriate.

References

[1] C. Fenimore, V. Baroncini, T. Oelbaum, and T. Tan, "Subjective testing methodology in MPEG video verification,"' SPIE Conference on Applications of Digital Image Processing XXVII, 2004.

[2] Q. Huynh-Thu, M. Garcia, F. Speranza, P. Corriveau and A. Raake, "Study of rating scales for subjective quality assessment of high-definition video," IEEE Transactions on Broadcasting, vol. 57. No. 1, p. 1-14, Mar. 2011.

[3] S. Péchard, R. Pépion, and P. Le Callet, "Suitable methodology in subjective video quality assessment: a resolution dependent paradigm." IMQA 2008.

[4] M. Brotherton, Q. Huynh-Thu, D. Hands, and K/ Brunnström, "Subjective multimedia quality assessment," IEICE Trans. Fundamentals, Electron. Commun. Comput. Sci., vol. E89-A, no. 11, pp. 2920-2932, 2006.

[5] Quan Huynh-Thu and Mohammed Ghanbari, "A comparison of subjective video quality assessment methods for low-bit rate and low-resolution video," in Proceedings of Signal and Image Processing, M.W. Marcellin, Ed., Honolulu, Hawaii, USA, 2005, vol. 479.

[6] T. Tominaga, T. Hayashi, J. Okamoto, and A. Takahashi, "Performance comparisons of subjective quality assessment methods for mobile video," Quality of Multimedia Experience (QoMEX), Jun. 2010.

[7] M. Pinson and S. Wolf, "Comparing subjective video quality testing methodologies," SPIE Video Communications and Image Processing Conference, Lugano, Switzerland, Jul. 2003.

[8] M. Pinson1, L. Janowski, R. Pépion, Q. Huynh-Thu, C. Schmidmer, P. Corriveau, A. Younkin, P. Le Callet, M. Barkowsky, W. Ingram, "The influence of environment on audiovisual Subjective Tests: An International Study," publication pending.

[9] Z. Cai, N. Kitawaki, T. Yamada, and S. Makino, "Comparison of MOS evaluation characteristics for Chinese, Japanese, and English in IP Telephony," 4th International Universal Communication Symposium (IUCS), Oct. 2010.

[10] S. Zielinski, F. Rumsey, and S. Bech, "On some biases encountered in modern audio quality listening tests-a review," Journal of Audio Engineering Society, vol. 56, no 6, Jun. 2008.