Institute for Telecommunication Sciences / Research / Quality of Experience / Video Quality Research / No Reference Metrics / Artistic Intent

A Missing Factor in NR Metrics: Object Size and Artistic Intent

 By Margaret H Pinson, July 2025

 During the June 29, 2025, meeting of the Video Quality Experts Group (VQEG) Subjective and objective assessment of GenAI content (SOGAI) group, Abhijay Ghildyal of Sony Interactive Entertainment and Portland State University shared research insights on using no reference (NR) metrics to predict Wölfflin's System for Characterizing Art. This effort builds upon his PhD thesis and presented insights into the value of NR metrics for assessing scales that do not assume a spectrum from good to bad. This article focuses on one of Wölfflin's factors that extends from unity (single object) to multiplicity (multiple objects).

Abhijay's research on Wölfflin's System for Characterizing Art may be a critical part of improving the accuracy and reliability of NR metrics. My hypothesis is that unity vs multiplicity (and other artistic factors) influence human perception of video impairments.

To support my hypothesis, we will begin with a dataset that is designed for NR metric research, then explore an algorithm that we trained on this dataset, and then close with my observations on NR metric accuracy vs object size and, by extension, art characterization. Throughout, I will provide historical context and links to other resources.

Related insights on challenges to NR metric development can be found in Section VII, "Caveats and Complications" of "Why No Reference Metrics for Image and Video Quality Lack Accuracy and Reproducibility." 

One frame from ITS4S, video PublicSafety_002-Psrc22A_0915K (tiny people)

Overview of the ITS4S Dataset for NR Metric Development

In 2007, ITS filmed crowd scenes at the University of Colorado at Boulder, to support research into video quality assessment for first responder tasks. Our goal was to make available studio quality, simulated surveillance footage that systematically depicts people of different sizes, measured in the number of rows of stadium seats depicted. This work was sponsored by the U.S. Department of Homeland Security (DHS). The full set of crowd scene footage can be downloaded from the Consumer Digital Video Library (CDVL) by selecting "Crowd Scenes" on the advanced search "Dataset" pulldown menu. For the purpose of this discussion, these crowd scenes depict the same object (people) filmed in different sizes. 

Many of these crowd scenes appear in the subjective video quality dataset ITS4S from 2018. ITS4S was designed to provide insights into improved experiment designs for training NR video quality metrics. The ITS4S dataset focuses on two factors. First, the metric performance must degrade gracefully in response to new content (i.e., subject matter, camera, editing). Second, the metric must accurately predict the quality of original videos (e.g., broadcast quality, contribution quality, professional cameras, prosumer cameras).

To accurately assess the ITS4S mean opinion scores (MOS), an NR metric must accurately predict the quality of video sequences that do not contain coding artifacts. To address these needs, ITS4S contains 813 unique video sequences, 35% of which contain no compression artifacts. The remaining 65% contain simple impairments, to minimize the confounding factor of coding impairments on the original video’s quality as the coding bitrate falls. 

The videos in the ITS4S dataset are each unique (no repeated use of the same source). Of particular interest is a series of videos that depict crowds of people at an American football stadium, filmed at different zooms, with file names "PublicSafety_001..." through "PublicSafety_034_...". The size of the person depicted increases as you progress from 001 through 034. Scroll down for sample frames.

One frame from ITS4S, video PublicSafety_013-Psrc22D_2340K (medium size people)

Optimal Edge Filter Sizes for Video Quality Metrics

Let us now turn our attention to objective metrics. 

In 2000, we linked the point of failure of the ITS reduced reference (RR) metric in the VQEG FRTV Phase I validation test to our use of the Sobel edge detection filter. Loosely worded, people care more about large edges than small edges. This insight led to the development of our spatial information (SI) filter. Replacing the (3×3) Sobel filter with a (13×13) SI filter led to our later success with the NTIA General Model (nicknamed VQM) and the other RR metrics described here.

At the VQEG 2017 meeting in Poland, industry very clearly stated that NR metrics must provide root cause analysis (RCA) in addition to MOS. Industry needs to know why the quality is bad and what they can do about it. This guidance led to my decision that NR Metric Sawatch combines multiple smaller NR metrics, each providing RCA for a single impairment. Our RR metrics used this same design strategy. 

By 2019, I was examining outliers from the dataset ITS4S to understand the impact of edge energy on NR metric quality assessment. My task was to determine optimal filter sizes. These investigations would lead to the development of NR metric S-FineDetail, which compares small edges in the luma plane (5×5) with large edges in the luma plane (15×15). S-FineDetail is an NR metric for RCA that assesses whether all small edges were pieces of larger edges. This could indicate up-sampling, too aggressive noise filtering, or low bit-rate compression that erased fine details. 

The scatter plot between MOSs and my draft NR metric's values showed very different responses to this set of crowd scenes. The size of the coding artifact was identical, but the quality impact differed.

One frame from ITS4S, video PublicSafety_034_Psrc22F_0915K (large people)

Visual Examination 

This white paper has frames extracted from ITS4S PublicSafety_001-... through Public Safety_034-...., showing the progression of person size depicted. These were filmed at the same event, with the same professional camera. 

The first and third pictures were filmed at the same bit-rate of 0.915 Mbps. Zoom in so that you can easily see single people in the last picture, where people are very small. You may need to copy this image into another app. The edges of people are very distorted and the people seem composed of large blocks.

Now scroll up the first picture (at the same zoom) and look at faces, such as the woman looking directly at the camera. These faces are about as tall as two people in the first picture. The faces consist of medium sized blocks. The edges are distorted but most of the faces remain distinct. 

What I see in these frames is that edge noise and blockiness impact the visibility of impairments more severely when you look at a small object and less when you look at a large object. 

Abhijay's metric and Wölfflin's System for Characterizing Art may provide a means to automatically detect these differences in image composition, which could in turn remove this confounding factor from NR metrics.