On Monday, Netflix engineers posted an engaging explainer of how it uses “prioritized load shedding” to ensure users’ viewing experience is as uninterrupted as possible. As late as last year, the streaming service has suffered outages caused by load congestion. It now has a “priority throttling filter” that can shed unnecessary server requests in real-time whenever there is a problem on the backend.
In a nutshell, the filter, which Netflix dubbed “Zuul” prioritizes traffic base on how much a user needs it for playback. The system uses three buckets to categorize server requests—non-critical, degraded experience, and critical.
Non-critical items include logs and background requests, and according to engineers, it makes up a large portion of system throughput. Even so, these requests can usually go ignored when the server load reaches a certain threshold.
Degraded-experience items are not necessary for playback of content but are used to improve the user experience. Stop and pause markers, language selection in the player, and viewing history are examples of server requests that can be shed when problems arise on the backend. Most of the time, users will not even notice that these items are missing, particularly while watching content.
The critical bucket is for traffic that affects users’ ability to play content. If these requests go down, trying to play a movie or show will result in an error message.
As a first step, Zuul scores each of these items between 1 and 100. If problems develop on the backend, or even with Zuul itself, the filter can throttle loads with the lowest priority first. Serving playback content always gets preferential treatment over everything else, so when there are hiccups, they go largely unnoticed by most viewers.
As to the system’s effectiveness, Netflix points to a 2019 outage that prevented a “sizable percentage” of subscribers from playing content. Earlier this year, just days after implementing the filter, Netflix experienced a similar failure. However, this time Zuul kicked in and started shedding loads until the backend was stable. Users on the frontend experienced no interruptions.
“Unlike then [the 2019 outage], Zuul’s progressive load shedding kicked in and started shedding traffic until the service was in a healthy state without impacting members’ ability to play at all,” say engineers. “Members were happily watching their favorite show on Netflix while the infrastructure was self-recovering from a system failure.”
We have provided just a brief overview of how the system functions. If you are interested in the technical details, Netflix has a full writeup on Zuul. It’s a good read if you are interested in the backend workings of online services.