Weighted Round-Robin Schedulers
for Advanced QoS in High Speed Networks

Computer Architecture and VLSI Systems Division,
Institute of Computer Science (ICS), FORTH
Science and Technology Park of Crete, P.O.Box 1385, Heraklion, Crete, GR 711 10 Greece

In order to provide advanced Quality of Service (QoS) architectures in high speed networks, it is commonly agreed, nowadays, that one needs (i) per-flow queueing, and (ii) weighted round-robin scheduling; the combination of the two is frequently referred to as Weighted Fair Queueing. This Division has been active in research and development work on both of these technologies since 1985. For our per-flow queueing work, see that link; our work on (weighted) round-robin scheduling is described below.

Our work started in 1985 with an investigation of several techniques for performing round-robin scheduling at high speed among a large number of flows. Later, in 1990, we designed a weighted round-robin scheduler for hundreds of flows, using a content-addressable memory (CAM) in CMOS VLSI. More recently, since 1996, we developed techniques for weighted round-robin scheduling among many thousands of flows at a relatively low cost, using standard SRAM. The logic around such standard SRAM is simple enough to fit within part of an FPGA, and fast enough to operate up to gigabit line rates in one case of significant practical importance. These are described below, starting with the most recent work.

1. Per-Class Urgency Counters

The smoothest operation of a weighted round-robin scheduler (operation with the least amount of service time jitter) is achieved when the weight factors of the (possibly many thousands of) flows are picked from a relatively small ``menu'' of allowed factor values --e.g. say 8 or 16 different weight factor values. We call (service) class all those flows that share the same weight factor; for each class, we maintain an urgency counter. Based on the urgency counters, we perform weighted round-robin scheduling among the classes, and then use plain round-robin scheduling inside each class to select the next-to-be-serviced flow.

We simulated the performance of this scheduling algorithm, and found that the average service-time jitter never exceeded 2 % of the service time. The heart of this scheduler (the urgency counter and decision logic) fits in less than one third of an Altera 10K40 FPGA; with a 50 MHz clock, it can service 4 million events per second, which corresponds to an aggregate I/O line throughput of 1.6 Gbps in the case of scheduling ATM cells.

For more details, follow this link.

2. Heap (Priority Queue) Management in Hardware

Several algorithms have been proposed in the literature for weighted round-robin scheduling in the general case of arbitrary weight factors. Their common substrate is that, for each flow or packet or cell, a priority value is maintained --oftentimes corresponding to the next time-to-service value-- and the scheduler has to choose the minimum among these values, corresponding e.g. to the closest next time-to-service. Maintaining a set of many thousands of (priority) values and quickly finding the smallest of them is best done using a Heap (Priority Queue) data structure.

We looked at methods to manage heap data structures at high speed, using specialized hardware. First, we analyzed implementations involving one FPGA and external SRAM, and found that a few tens of clock cycles per heap operation are needed for priority queues containing several thousand elements; our detailed design was verified by simulation in Verilog. Second, we proposed a novel algorithm for top-to-bottom heap insertions, thus enabling one to manage a heap in a pipelined fashion, in order to achieve very high speed; we are now proceeding to a concrete design using that method.

For more details, follow this link.

3. Weighted Round-Robin Scheduler using VLSI CAM (ca. 1990)

M. Katevenis, S. Sidiropoulos, C. Courcoubetis: ``Weighted Round-Robin Cell Multiplexing in a General-Purpose ATM Switch Chip'', IEEE Journal on Selected Areas in Communications, vol. 9, no. 8, October 1991, pp. 1265-1279.

ABSTRACT: We present the architecture of a general-purpose B-ISDN switch chip and in particular its novel feature: the cell (packet) multiplexing algorithm and its implementation in hardware. The chip is intended to be connected, via its bit-parallel links, to other similar switch chips or to link-interface chips, providing maximum flexibility in building small or large networks. Our chip implements buffering, routing, flow control, cell scheduling, and cut-through all in hardware. We present a detailed design of how our chip implements its scheduling function consisting of a weighted round-robin multiplexing scheme, using a counter, frequency weights stored in reverse bit order, and a content-addressable memory. This algorithm allocates the available bandwidth to the virtual circuits that can use it, in proportion to the prescribed weights. Other features are chip-to-chip flow control with a built-in window mechanism, a pool of shared buffers, and a set of dedicated buffers, thus offering powerful buffer management capabilities. The above features are key for our chip to successfully switch traffic with different service requirements such as real-time traffic and non-interactive data traffic; the mechanisms built into the chip can be used by the network manager to offer guaranteed service performance to the real-time traffic, and to fully utilize the spare capacity of the links by serving the lower priority traffic without allowing congestion (fairly allocating buffers and bandwidth). We are currently designing this chip for a full-custom CMOS technology; its crucial parts have been laid out and simulated, thus proving its feasibility.

4. Round-Robin Scheduling using Merge Sort (ca. 1986)

M. Katevenis: ``Fast Switching and Fair Control of Congested Flow in Broad-Band Networks'', IEEE Journal on Selected Areas in Communications, vol. 5, no. 8, October 1987, pp. 1315-1326.

ABSTRACT: A new switching architecture is proposed, based on the tradeoffs of modern VLSI technology -- inexpensive memory and two-dimensional layout structures. Today, it is economically feasible to preallocate buffer space individually to each virtual circuit in every node, so that ``congestion'' ceases to have negative effects. On the contrary, when some low-priority circuits offer more traffic than the network can carry, full utilization of the link bandwidth is achieved. In this context, the allocation of bandwidth can be done automatically and in a ``fair'' way, if packets are multiplexed by circularly scanning all virtual circuits and transmitting one packet from each ``ready'' circuit. This multiplexing algorithm equally distributes all the available link BW to all the VC's that can use it (other than equal distribution is also possible), while it also guarantees an upper bound for the total packet delay through non-congested VC's (VC's that use less than their share of BW). We present methods for hardware implementation of fast such circular scans, and propose a structure for the switching nodes of such networks, consisting of a cross-bar arrangement like a systolic array that performs merge-sorting. It is ideally suited for physical layout on printed-circuit boards or with wafer-scale integration.


[ Original Site in Greece ] [ North European Mirror ] [ North American Mirror ]
[ Up to Packet Switch Architecture R&R at CARV ]
© copyright ICS-FORTH, Crete, Greece.
Last updated: March 1999, by M. Katevenis.