Assuring Video Quality
Testing Live Quality
What about tests for live encoding? To eliminate the variables presented by network congestion, each live encoded file was looped, archived, and then analyzed.
This approach has been used by another group of researchers at the University of Trento (UT) in Italy: Csaba Kiraly, Luca Abeni, and Renato Lo Cigno.
“For the evaluation of the received video quality, the 25fps CIF ‘foreman’ sequence has been used,” they wrote in an early 2010 report. “The sequence has been looped 4 times, to obtain 1200 frames and 100 GOPs.”
While we were encoding the archival files at the live encoder, using IP transport streams or HD-SDI baseband video source, Kiraly and the other researchers were actually trying to tackle a much larger issue facing our industry today: the effects of new forms of delivery on quality measurements.
Given the nature of adaptive bitrate (ABR) content, in which the local player will default to playing the highest quality video file in a given manifest, it appears impossible to test what a client is seeing at lower bitrates, unless the content is viewed in an elementary stream prior to segmentation. This is true regardless of whether we tested Adobe, Apple, or Microsoft encoding solutions and players, so we had to settle on single bitrate video files for the subjective testing portions.
Likewise, the UT researchers studying the effect of peer-to-peer delivery on video quality noted that most P2P systems have a “way of splitting a stream in chunks [that we refer to … as ‘media unaware’ in contrast to … a ‘media aware’ distribution strategy [that] uses some knowledge about the encoded media stream to optimize its distribution and improve the streaming performance.”
To test their theory of quality effects from P2P distribution, the UT researchers used a novel testing scheme: They simulated an overlay of 1,000 peers, connected according to a “random n-regular graph of degree 20'' in which each peer and the source had an upload bandwidth limit of 1Mbps. The researchers then used the loop model to establish the 4x loop, and 100 groups of pictures (GOPs) disseminated 2,000 chunks, but only the middle 100 were used to evaluate video quality.
The industry needs to assess whether alternative measurements, such as J.120 or SSIM, are better measures of objective quality than the current status quo.
In addition, we also need to jettison the straw-man argument that every system uses the same base codec from the same codec firm. The thinking goes that if everyone’s using the same codec, the quality differences occur based on the system’s preprocessing and other software modules peripheral to the main codec. This argument bore some weight at the outset of the Transitions testing, until the PSNR test results were so divergent between the systems that were using the same manufacturer’s codec. In addition, several of the tested systems had their own H.264 codecs. Suffice it to say that the variability in terms of codec usage is great enough that the “consistent codec” argument is not yet—and is not likely to be for some time—a reflection of reality.
In conclusion, while these types of research tests sound like sheer torture to the average content owner, it is precisely this type of consistency in evaluation that will lead to better overall quality reference points. The television industry had 25 years of tinkering before it adopted standards, but the streaming industry (or P2P or ABR delivery via HTTP, for that matter) doesn’t have the luxury of waiting a quarter of a century to nail down quality settings.
This article originally ran in the 2011 Streaming Media Industry Sourcebook as "Questioning Quality."