Over the last 40 years, as manufacturing processes have evolved, specialized memory devices have come to market to meet the storage needs of different systems. The availability of so many options means system architects and designers have several decisions to make in order to choose the appropriate memory subsystem for their application. Specifically, for networking applications, architects are faced with the challenges of meeting the ever-increasing demands of the network traffic.
It is estimated that network traffic will grow at a compound annual growth rate (CAGR) of 22 percent from 2015 to 2020, primarily due to the proliferation of wireless devices and increasing use of video on all devices. The performance of routers and switches — the backbone of the Internet traffic, is directly proportional to the random access performance of the memory subsystem used (measured as Random Transaction Rate or RTR), due to the random nature of the packet processing required. This article describes how Quad Data Rate-IV (QDR-IV) SRAMs can be used to address performance bottlenecks in network design. It will also cover how to best optimize a design to take advantage of the performance provided by QDR-IV SRAMs using detailed architectures for two applications – a statistics counter and forwarding table lookup for Ethernet line card rates of 200- 400 Gbps.
Memory Subsystems in a Switch Line Card
Figure 1 shows a typical 400Gbps data plane line card illustrating the functional blocks, chipsets and memory subsystems.
The Media Access Controller (MAC) provides addressing and channel access control mechanisms in a shared media network such as Ethernet. This allows the network nodes to communicate with each other. Attached to the MAC is the Over-Subscription buffer (OS buffer). It enables the system designer to over-subscribe the front-end relative to the line card bandwidth (e.g.: 120G front end on a 100G line card). The over-subscription buffer stores the “extra” data for a specific time period. The buffering requirements can run into a few milliseconds of data which translates to several Gigabits of memory. Cost-per-bit is the main decision criterion here and why SDRAM is the optimal choice for this buffer.
The Network Processor (NPU) performs various functions that include parsing the data to identify protocols, verifying the integrity of the packet, and looking up the next hop based on the destination address. In addition, the NPU collects statistics on the packets in the flow for billing and network management purposes. The following are the memory subsystems attached to the NPU:
- Classification Lookup – The characteristics of the incoming packets are examined and a decision is made whether or not to allow the packet. Lookups are performed for the source port, destination port, source address, destination address and the protocol used. Lookups are performed on a per-packet basis on a long string. The preferred memory for this is Ternary Content Addressable Memory (TCAM). TCAMs allow searching the content using binary as well as “don’t care” states, enabling TCAMs to perform broader searches based on pattern matching.
- Forwarding Lookup – The FIB table stores the potential destination addresses of the next hop in the route. The lookup is an iterative process and therefore involves multiple accesses to memory. Each packet of data will require between 2 and 8 random accesses to memory, which translates to a high RTR. For a high RTR rate, QDR-IV SRAM is an ideal fit.
Statistics & Flow States – Routers maintain statistics on a per-packet and per-flow (stream of related packets) basis. This is accomplished in the form of counters. Each application could have many such counters. Counters are used to store information on prefix, flows, and packet classification. Hence, updating the counters requires high-performance memories that can accommodate multiple read-modify-write operations. The same memory in a line a card can be shared for Statistics and Flow/States. Because of the high RTR requirements, QDR-IV SRAM is the optimal choice.
Traffic Manager (TM) – Traffic Managers arbitrate access to a shared media bandwidth on per-packet granularity and manage congestion during periods of bandwidth over-subscription. Quality of Service (QoS) is handled by the TM whereby packets are grouped into several classes based on a hierarchy. The key memory subsystems here are:
- Packet Buffer – The packet buffer stores packets to dispatch to the switch fabric. The density of buffer is determined by the line card rate and the Round Trip Time (RTT) which can be up to 250ms. The choice of packet buffer memory is dictated by density and cost rather than performance. SDRAM is the memory of choice in this case. However, to overcome the inherent limitations of DRAM, customers implement a hierarchy of SRAM and DRAM where fast SRAM is used as a head/tail cache to complement the slower, bulk DRAM. QDR-serves as a highly efficient head/tail cache for such applications.
- Scheduler Database – Scheduling is the process of deciding when to send a packet onto the switch fabric and is determined based on the destination of the packet and Quality of Service (QoS) or Class of Service (CoS) required. Packets are grouped into several classes, each of which relate to a tiered service offering (revenue segment for service providers). Typically, the scheduling application requires one READ plus one WRITE per packet. While the scheduler is not sensitive to latency cycles, the absolute latency in nanoseconds is very important. Schedulers require a queue/de-queue (READ/WRITE) to take place within one minimum size packet arrival time. Hence, a memory with large READ/WRITE latencies is unsuitable for this application.
While SDRAM is used for large packet buffers, SRAMs are used as Head-Tail cache. The packets that are arriving in the line card are stored in a tail cache and transferred to the slower bulk DRAM. Similarly, in preparation for when the packets have to depart, blocks of packets are transferred from the DRAM into the SRAM head cache.
Both the scheduler database in conjunction with the Head-Tail cache are efficiently addressed by QDR SRAM. Table 1 shows the RTR requirements for various memory subsystems in a router in millions of transactions per second (MT/s). R is the packet rate in Millions of packets per second (Mpps).
QDR-IV SRAM Overview
QDR-IV SRAM has a synchronous interface and can perform two data WRITEs or two data READs or a combination of READ and WRITE per clock cycle. QDR-IV has 2 bi-directional data ports A and B, each of which can be used for READ or WRITE transactions. This gives architects the added flexibility of using QDR-IV SRAMs in applications where the R/W ratio is not necessarily balanced. Each port allows data transfers on both clock edges (DDR operation) and is burst oriented with a burst length of two words (each word is X18 or X36) per clock cycle. The address bus is common and also supports double data rate with the rising and falling edges providing the address for port A and B respectively. QDR-IV supports embedded ECC to virtually eliminate soft errors and improve the reliability of the memory array.
QDR-IV SRAMs are available in two flavors: QDR-IV High Performance (HP) and QDR-IV Xtreme Performance (XP). The HP device can operate at a maximum frequency of 667 MHz while the XP operates at up to 1066 MHz. The following table lists the key architectural features of QDR-IV. Both devices are available in densities of 72 and 144 Mbs and support width and depth expansion for additional density or performance.
One key difference between QDR-IV HP and XP is the use of banks. QDR-IV XP is able to increase performance by dividing the memory space into 8 banks, represented by the 3 least significant bits (LSB) of the address. The required access scheme is that different banks are accessed in the same cycle. From cycle to cycle, any bank can be accessed, allowing system designers to utilize the full RTR performance of XP devices by planning the system architecture to allocate the bank addresses accordingly.
The examples below showcase two applications, one using QDR-IV HP for a forwarding table lookup and another using QDR-IV XP for a statistics counter application at 400 Gbps line rate.
Forwarding Table Lookup at 400 Gbps
As discussed earlier, forwarding table lookup requires anywhere from 2R to 8R accesses, where R is the packet rate in Millions of packets/Sec (Mpps). For 400Gbps, R=600 and requires an RTR of 1200 to 4800 MT/s. Let’s take the example of a 400 Gbps L2 switch with following forwarding table requirements:
1M x 144 bit entries
1 Destination Address (DA) look up per packet (READ)
1 Source Address (SA) look up per packet (READ)
1 SA registration (WRITE) or CPU access at 60 Mpps
Based on the above:
Density required: 1M x 144 bits => 144 Mb
READ access rate (DA+SA) = 600 +600 = 1200 Mpps
Write Update Rate = 60Mpps
Since the data bus is common for READs and WRITEs, when a WRITE operation follows a READ operation, it requires the bus to turn around. As such, every switch from a READ to a WRITE command takes 4 cycles for a QDR-IV HP SRAM. There is no penalty for switching from WRITE to READ commands due to the latencies – READ has 5 cycles of latency and WRITE has 3. The bandwidth loss due to bus turnaround is minimized by collecting up to 4 updates in a FIFO before writing to the lookup table. A four-deep cache of the latest entry is maintained for the requests sitting in the FIFO and a SA or DA lookup is provided from this cache if there is an address match. The WRITE overhead, per WRITE update is:
= 4 WRITE cycles + 4 bus turn around cycles/4 cycles
RTR used for WRITE update = Write overhead * Write Update rate
= 2 * 60 = 120 Mpps rate.
Total RTR required = 600+600+120 = 1320 MT/s
The memory subsystem for this lookup function is shown below using two 72 Mb, X36
QDR-IV HP devices operating at 667 MHz. The two QDR devices are used in a width expansion mode to support 144 bits of data on port [A0, A1] and port [B0, B1] per clock cycle. The scheduler arbitrates between the DA, SA READ requests and table WRITE requests (4 WRITEs at a time) and sends them to either port [A0, A1] or port [B0, B1]. There is sufficient bandwidth to service all the requests. This approach is scalable to support more table entries or more bits per entry by using additional QDR-IV HP devices in depth or width expansion mode respectively.
Statistics Counters at 400 Gbps
As discussed earlier, statistics counters requires anywhere from 2R to 16R accesses, where R is the packet rate in Millions of packets/sec (Mpps). For 400Gbps, R=600 and requires an RTR of 1200 to 9600 MT/s. Let’s take the example of a router with the following statistics counter requirements:
1 Million counter pairs for ingress
1 Million counter pairs for egress
72 bits per counter pair
800 Million updates/sec
Density required: 2 M x 72 = 144 Mb
RTR required = 1600 MT/s (2x the rate of packet updates, due to read-modify-write operation)
This requirement is easily met by one X36 QDR-IV XP device running at 800 MHz. Each location in the SRAM stores two counters, 36 bits each with a total of 2M counters. A typical counter pair is the “total number of packets received” and “total byte count” and can be easily stored in the same address location in the SRAM. Port A is used for READ operation and port B is used for WRITE operation thus eliminating the need for bus turn around.
The addresses accessed for counter updates are random, due to the random nature of packet arrival. The QDR-IV XP is banked with 8 banks and within a cycle there cannot be two accesses to the same bank. As a result, the memory controller will issue a flow control back to the statistics update logic in order to avoid a banking violation. Figure 3 shows an FPGA-based memory controller implementation. Each 800 MHz port in the QDR-IV is split into 4 channels running at 200 MHz within the FPGA.
The memory controller asserts the Busy signal any time there is a banking violation between Port A and B and pushes the violating transaction to the next clock cycle. The system will not meet the update rate requirement if the busy signal is asserted. This can be resolved by allocating the counters to their memory array address systematically. QDR-IV has 8 banks represented by the 3 LSB bits of the memory address. For the current example, the LSB A of the QDR-IV address is mapped to Ingress (A=1]) and egress (A=0]) to form odd and even addresses for scheduling. With the underlying assumption that ingress and egress counters cannot exceed 400 Million updates/sec each, Figure 4 below shows how the memory subsystem can be implemented for the statistics update function.
Arriving packets, and hence the corresponding counters, naturally fall into categories. These counter categories need to be allocated to different banks to avoid banking violations and take full advantage of the performance of QDR-IV XP device. Depending on the overall system architecture, these categories can be:
1.Ingress- or Egress-based counter classification
2.If multiple NPUs are feeding into the same stats counters FPGA, each NPU gets assigned to different Banks
3.Ethernet link-based counter classification and memory bank allocation
4.Virtual lane (VL) or Class of Service (CoS) based counter classification and memory array bank allocation
5.Flow-based counter classification and appropriate bank allocation
6.Any other packet/counter classification scheme used in a custom solution
The high RTR of QDR-IV SRAMs makes them an efficient memory for high performance networking applications. Application examples for forwarding table lookup and statistics counters using QDR-IV HP and XP devices have been described. System architects and designers can use these as starting points for their own designs and customize them based upon specific application requirements. Both QDR-IV HP and XP devices are supported by memory controller IP directly available from Altera and Xilinx, enabling memory subsystem support for 100G to 400G line cards.
Avi Avanindra is Director of Systems Engineering for the Memory Products Division at Cypress Semiconductor Corp., where he helps customers develop memory solutions for embedded systems. A graduate of Iowa State University with a degree in Electrical and Computer Engineering, he previously worked at Cisco Systems designing ASICs for switches and routers.
Filed Under: Rapid prototyping