Internet-Draft | Automatic Network Congestion Relief | March 2025 |
Zhao & Zhang | Expires 4 September 2025 | [Page] |
This document introduces an automatic congestion relief mechanism based on intelligent traffic analysis and dynamic regulation. In the event of congestion caused by fiber optic failures, it can respond intelligently and self-heal in real time, ensuring the stable operation of the network.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 4 September 2025.¶
Copyright (c) 2025 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
Nowadays, fiber optic failures occur frequently, leading to network congestion and becoming a common pain point for operators. These issues necessitate dedicated staff to perform daily traffic inspections and manually adjust configurations on an hourly basis, which significantly increases the difficulty of network maintenance.¶
This draft introduces an automatic congestion relief mechanism based on intelligent traffic analysis and auto-regulation. In the event of congestion caused by fiber optic failures, it can intelligently respond to congestion and initiate real-time self-healing processes, solving the network congestion and maintenance challenges faced by operators due to fiber optic failures, and ensuring the stable operation of the network.¶
This second-level congestion relief mechanism is automated through the intelligent module within the device. Leveraging intelligent traffic analysis, it precisely calculates the volume of traffic requiring redistribution. Subsequently, it redirects this traffic to paired devices via inter-device protocol announcements and the automatic adjustment of routing priorities.¶
+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Traffic Modeling |------>| Traffic Monitoring |---->| Intelligent policy generation | +-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | +-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Policy Reversion |<----| Policy Regulation | +-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The forwarding chip of the device performs real-time traffic sorting using full-flow data and identifies the top N traffic flows.¶
The intelligent component of the AI chip subscribes to the BGP RIB-out (Routing Information Base-outbound) and employs intelligent flow recognition algorithms to perform AI-based traffic modeling. This modeling approach, considering factors such as historical traffic patterns and flow behavior, provides a solid basis for subsequent traffic detection and regulation.¶
The intelligent flow feature statistics cover multiple dimensions, arranged in a logical order from macroscopic to microscopic traffic characteristics, including flow rate, packet length, the proportion of TCP/UDP traffic, the proportion of fragmented packets, and the proportion of SYN packets.¶
Through the extension of the BGP-LS protocol, the inter-domain link bandwidth and load changes are obtained by the device in real-time.¶
When the link bandwidth exceeds the set congestion threshold, the situation where the link bandwidth exceeds the threshold is reported quickly.¶
The interface statistics are collected at a second-level time interval.¶
The device uses the Utilized Bandwidth to aggregate the inter-domain BGP EPE link bandwidth and bandwidth utilization rate. The BGP-LS Utilized Bandwidth TLV reuses the Maximum Link Bandwidth TLV (Type 1089) [RFC5305] This TLV is used to describe the bandwidth and bandwidth utilization of inter-domain BGP Egress Peer Engineering (EPE) links. The format of the BGP-LS Utilized Bandwidth TLV is as follows.¶
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type | Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Utilized Bandwidth | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
When its utilization exceeds the predefined congestion threshold, the device launches the intelligent module. The device intelligently identifies the traffic that needs to be adjusted and generates policies based on traffic analysis.¶
One policy is that, considering the traffic characteristics such as flow rate and type, the device can make a more accurate calculation. Since the routing prefixes sent to the one network domain (rib-out) are the same, it is only necessary to change the attributes after finding the TOP N routing prefixes to be adjusted, and lower the priority of the faulty plane.¶
The device announces the automatic adjustment mechanism of routing priorities through the BGP RPD protocol and then automatically allocates the calculated traffic to the lightly-loaded plane. The end-to-end process is completed within seconds, effectively alleviating the congestion on the original plane. It can precisely control neighbors and paths without affecting the existing routing policies in the network.¶
After the interrupted link recovers, the optimization policies will be gradually withdrawn. With the revocation of the policies, the network traffic will progressively return to the load-sharing state before the failure.¶
---------------------------------------- | -------- -------- | | | C2 | Network2 | C1 | | | -------- -------- | ------|--------------------------|------ | | | x | 100GB x 8 x 100GB x 8 | | -------|--------------------------|------ | +------- -------- | | | CR2 | Network1 | CR1 | | | -------- -------- | ----------------------------------------
Establish topology mapping in Network 1 between CR1 and CR2, and between CR1/CR2 and C1/C2 in Network 2, clarifying the connection relationships. Set up BGP-LS peering between CR1 and CR2 to exchange topology and bandwidth information. Enable BGP EPE functionality via EBGP peering between CR1 and C1, and between CR2 and C2, to obtain link state and bandwidth-related information and generate BGP-LS LINK routes. Activate the BGP RPD neighbor function on CR1 and CR2 to receive optimized routing policies.¶
Step1:AI Traffic Modeling¶
CR1 and CR2 perform real-time and automatic TOP N traffic modeling on the link using the built-in automated algorithms. Without human intervention, they are capable of accurately grasping the traffic conditions of the link. The system automatically monitors that the total bandwidth of the traffic channels between CR1-C1 and CR2-C2 is 100 x 8GB, and the current traffic on both paths is 600GB.¶
Step2:Traffic Monitoring¶
If the CR1 device detects that a total of five links between CR1-C1 have failed, leaving only three links, the system automatically determines that congestion will occur on the CR1-C1 link.¶
Step3:Intelligent Policy Generation¶
As the primary adjustment device, the intelligent module of CR1 automatically generates optimization policies based on the established TOPN traffic model and the real-time collected prefix information. The entire process does not require human intervention, which greatly shortens the time from failure discovery to policy formulation, and enables timely response to network emergencies to ensure the stable operation of the network.¶
Step4: Policy Propagation¶
After CR1 generates the optimization policies, it automatically propagates them to C1. Upon receiving the policies, C1 automatically guides the remote routers to adjust their routing paths.¶
The system intelligently identifies and then automatically diverts the two services with the highest priority from the CR1-C1 link to the CR2-C2 link (through C2) to alleviate the congestion on the CR1-C1 link. After the policy adjustment, the system automatically monitors that the traffic on the CR1-C1 link is reduced to 300G, and the traffic on the CR2-C2 link is increased to 800G. The automated policy propagation and traffic diversion process is efficient and accurate, effectively improving the utilization efficiency of network resources and quickly alleviating the link congestion problem.¶
Step5: Policy Reversion¶
When the failed links between CR1-C1 are restored, the system automatically detects the link status change and gradually withdraws the relevant optimization policies. Upon the automatic revocation of the policies, the network traffic automatically and gradually returns to the load-sharing state before the fault.¶
This automated mechanism ensures that the network can quickly return to normal operation after the fault is eliminated, reducing the cost of human intervention and improving the self-healing ability of the network.¶
TBD.¶