Balancing Video Fidelity and Gameplay in the Cloud
Video Compression on Stadia
Video compression is a core technology used by Stadia to deliver cloud gaming experiences to our customers. Ultra low latency video encoding of video game content is a hard problem that requires us to push video encoders to their limits to produce good video quality in such a resource-constrained environment.
Our engineers work on finding the right balance between visual quality, encoding latency and video processing cost which is best for our players and the business. Tradeoffs between these metrics are complex because of the nonlinear nature of impact on user experience, which depends on various factors like game content and network conditions. At Google we make data driven decisions and for that we need to have a reliable way to measure these entities.
In this blog post, we outline why measuring video quality is challenging, and share how we approach the problem at Stadia.
Perceptual Video Quality
When measuring video quality, we would like to understand perceptual quality, not the absolute pixel difference between original and compressed streams. We would like to know what details or information on the video are redundant and can be imperceptibly removed to reduce the size of the output video delivered to our gamers so they can enjoy playing Stadia.
If you watched an old movie or played a favorite video game from childhood recently, in addition to potentially feeling a sense of nostalgia, you might have also remarked on how different the visual and auditory experiences felt compared to more recent content. When you enjoyed the same content all those years ago, you probably didn’t perceive it to be “low quality.”
This last point reveals a foundation of cognition: Our perceptions change over time without us knowing it. Our brains don’t directly perceive reality; rather, they construct and reconstruct oversimplified models of the world. Perception is a context-dependent moving target.
In order to build high-quality products in an environment with ever-increasing standards for perceptual quality, our team needs to constantly measure and monitor perceptual quality and adjust our products to new user needs.
A proven way to measure perceptual video quality is to run a subjective test, which is usually expensive and time-consuming since it involves human raters. During the test the same encoded sequences are shown to the raters to get a score on a scale from 1 to 5 corresponding to their perceived quality called Mean Opinion Score (MOS), where 1 stands for “Unacceptable” and 5 is “Excellent” quality.
Perceptual quality is inherently subjective and varies from person to person. In what one gamer might find acceptable, another might notice significant distortions ruining their experience. Therefore, relying on one person to assess the video quality is not sufficient, and usually the same videos are shown to multiple people to get more representative data.
A traditional alternative to running expensive subjective tests is using objective video quality metrics like peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM). PSNR measures the absolute difference between original and distorted videos, while SSIM measures the structural similarity between videos. Still, both of these metrics don’t have high enough correlation with Mean Opinion Score (MOS) that would let us rely on these metrics when working on our products. On the graphs below the red line represents the ideal correlation. You can see many data points (blue dots) that are far from the red line. These outliers make it hard to make video quality decisions based on the metric, because the metric might score good looking videos low and vice versa.
The red line represents the ideal correlation between MOS and a given metric. PSNR on the first image. SSIM on the second image
In order to accurately measure perceptual visual quality without running time consuming subjective tests, the Stadia team partnered with the Neural Compression Team from Google Research to build a video perceptual metric based on Deep Learning that would have higher correlation scores than the existing metrics used by the industry and let us build state-of-the-art products in the environment of constantly increasing visual quality expectations.
Video Perceptual Metric (VPM) is a reference-based video quality metric built using a deep neural network which we trained on the data from subjective tests our team conducted with internal human raters. It takes both original and distorted video streams and predicts a score in the range from 1 to 5.
The metric processes uncompressed videos, which typically have enormous size. For instance, a 10-second uncompressed 4K video at 60 fps would take around 7GiB of space. Our training data set comprises around 3000 videos at the moment. Training a deep neural network model to predict a MOS score using 20+ terabytes of data would be a really slow process. Therefore, the metric first precomputes some key features from the input videos and then passes these features together with MOS data from subjective tests into TensorFlow for training. The feature values usually vary within a video stream, therefore these features are computed for each video frame and then histogram of frame values is passed to the training stage.
In contrast to traditionally used metrics like PSNR and SSIM, which compute quality score for each frame separately, VPM is also trying to understand how motion and transition between frames would be perceived by Stadia gamers. To achieve that, the metric introduces a set of temporal features that try to look specifically at those characteristics.
What makes VPM a perceptual video quality metric is that it tries to separate distortions that can be safely ignored from the ones that have side effects and might ruin the experience of Stadia gamers.
By using machine learning and training our models on the data from subjective tests, we got tools necessary for tuning our infrastructure and building next level experiences for our gamers. ML continues to help us solve hard problems and achieve results that would otherwise take years to accomplish, if using human raters.
The red line represents the ideal correlation between MOS and a given metric. PSNR on the first image. VPM on the second image
Stadia operates in a rapidly evolving environment where quality standards and expectations are constantly growing. In order for our technological innovations to reflect how humans perceive reality, we must make sure we’re building metrics that model the world in the ways we do. VPM is one of the ways that we strive to provide the best user experience for gamers. We will continue to iterate on this and many of our core technologies as cloud gaming continues to evolve.
This project is conducted in collaboration with many people. Thanks to Danielle Perszyk, Brad Rathke, George Toderici, Eirikur Agustsson, Ramachandra Tahasildar, Danny Hong, Jani Huoponen and Richard Xie.
-- Alex Sukhanov, Software Engineer @ Google Stadia and Troy Chinen, Software Engineer @ Google Research