Resolving Stuck Congestion Window in QUIC with CUBIC: A Step-by-Step Guide

Overview

When implementing congestion control for QUIC, developers often port existing TCP algorithms like CUBIC. However, subtle differences between TCP and QUIC can lead to unexpected behavior. This guide recounts a real-world bug where a Linux kernel change in CUBIC's treatment of application-limited flows caused the congestion window (cwnd) to remain permanently pinned at its minimum after a congestion collapse. We'll walk through the root cause, the flawed code, and the elegant one-line fix that restored normal operation. By the end, you'll understand how to avoid this pitfall in your own QUIC implementations.

Resolving Stuck Congestion Window in QUIC with CUBIC: A Step-by-Step Guide
Source: blog.cloudflare.com

Prerequisites

Before diving in, ensure you have:

  • Basic understanding of TCP and QUIC transport protocols.
  • Familiarity with congestion control algorithms, especially CUBIC (RFC 9438).
  • Access to a QUIC implementation source code (e.g., Cloudflare's quiche or similar).
  • Ability to read C/C++ code for congestion control logic.
  • A test environment to reproduce the issue (e.g., a network simulator with high initial loss).

Step-by-Step Instructions

Step 1: Understand CUBIC's Normal Behavior

CUBIC uses a cubic function to grow the congestion window (cwnd) after a loss event. During congestion avoidance, cwnd increases slowly near the previous maximum window and faster elsewhere. The key state variables include:

  • cwnd: sender-side limit on bytes in flight.
  • ssthresh: slow start threshold, set after loss.
  • W_max: window size just before last loss.

When packet loss occurs, CUBIC reduces cwnd (typically by a factor of 0.7) and sets ssthresh accordingly. After that, it probes for new bandwidth.

Step 2: The Bug Context – App-Limited Exclusion

An app-limited flow is one where the application restricts the sending rate (e.g., not enough data to fill the window). RFC 9438 §4.2-12 recommends that CUBIC should not increase cwnd during app-limited periods, because such increases could be misleading. In 2023, a Linux kernel patch enforced this exclusion more strictly: cwnd growth was suppressed whenever the connection was app-limited, even during recovery.

In TCP, this change worked fine because app-limited events are transient. However, when ported to QUIC (which operates over UDP), a new interaction emerged: QUIC's packet pacing and delayed acknowledgments could prolong the app-limited state during early connection loss.

Step 3: Reproduce the Failure

To see the bug, set up a test with high initial loss (e.g., 50% packet loss for the first few RTTs). In quiche, the CUBIC implementation after the kernel patch would respond as follows:

  1. After the first loss, cwnd drops to minimum (e.g., 2 MSS).
  2. The connection enters app-limited state because the sender cannot fill even that tiny window (application may not have data yet, or pacing delays cause idle periods).
  3. Because the app-limited exclusion is active, CUBIC never attempts to increase cwnd after recovery.
  4. Cwnd stays at minimum forever, effectively stalling throughput.

Here's an example of the buggy code (simplified):

// In CUBIC's congestion_avoidance function:
if (app_limited) {
    // Do nothing – this is the app-limited exclusion
    return;
}
// Otherwise, grow cwnd based on cubic function

Step 4: Identify the Root Cause

The problem is the blanket application of the app-limited exclusion during recovery. After a loss event, the connection is naturally in a vulnerable state; preventing cwnd growth entirely means it can never escape the minimum. The RFC intended the exclusion for normal congestion avoidance phases, not for recovery. The fix is to allow cwnd growth when the connection is recovering from loss, even if it is app-limited.

Resolving Stuck Congestion Window in QUIC with CUBIC: A Step-by-Step Guide
Source: blog.cloudflare.com

Step 5: Apply the One-Line Fix

Modify the condition to check the recovery state:

// Fixed version:
if (app_limited && !in_recovery) {
    // Keep app-limited exclusion only when not in recovery
    return;
}
// Otherwise, proceed with cubic growth

This ensures that during recovery (after loss), cwnd can still increase, albeit cautiously (the cubic function already limits growth near W_max).

Step 6: Verify the Fix

Re-run the same high-loss test. The connection should now recover normally: cwnd climbs from minimum back to a useful size within a few RTTs. In Cloudflare's testing, the failure rate dropped from 61% to 0%.

Common Mistakes

Applying App-Limited Exclusion Too Broadly

Developers often copy the kernel's app-limited check verbatim without considering the phase of congestion control. Always limit such exclusions to non-recovery states.

Ignoring QUIC-TCP Differences

QUIC's packet pacing and delayed ACKs can make app-limited events more frequent. Test with realistic network conditions (loss, delay, application traffic patterns).

Forgetting to Update ssthresh

After fixing cwnd growth, ensure ssthresh is also set appropriately during recovery. Otherwise, the connection might enter slow start excessively.

Relying Solely on Unit Tests

Integration tests with simulated loss are crucial. A static unit test may not trigger the bug because it doesn't model the full QUIC stack behavior.

Summary

This guide covered a subtle yet critical bug in porting CUBIC's app-limited exclusion to QUIC. By understanding the interaction between recovery state and application limits, you can apply a simple condition check to prevent permanent cwnd starvation. The key takeaway: always evaluate congestion control changes in the context of the actual transport layer (QUIC vs TCP) and test edge cases like high initial loss.

Tags:

Recommended

Discover More

React Native 0.85: Enhanced Animations, Updated Tooling, and Key ChangesYour Guide to Identifying the Kia EV5: US Version vs. Global ModelsHow Fortescue's Renewable Grid Defied Expectations During a Bushfire CrisisNeural Tangent Kernel Unlocks Mystery of Over-Parameterized Neural NetworksThe Paradox of Programming: Slow Evolution and One Rapid Revolution