Heap (Priority Queue) Management in Hardware
for Weighted Fair Queueing in High Speed Networks

Manolis Katevenis, Ioannis Mavroidis, and Aggelos Ioannou

Computer Architecture and VLSI Systems Division,
Institute of Computer Science (ICS), FORTH
Science and Technology Park of Crete, P.O.Box 1385, Heraklion, Crete, GR 711 10 Greece

© copyright 1997-99 by FORTH, Crete, Greece


In order to provide advanced Quality of Service (QoS) architectures in high speed networks, Weighted Fair Queueing is needed. We have been working since 1985 on weighted fair queueing, i.e. on: (i) per-flow queueing, and (ii) weighted round-robin scheduling. In weighted round-robin scheduling, we make a distinction between the case where the weight factors are picked from a relatively small ``menu'' of allowed factor values, and the general case of arbitrary weight factors. For the former case, we have proposed and designed a scheduler with per-class urgency counters. For the latter case, our work is as described below.

Several algorithms have been proposed in the literature for weighted round-robin scheduling in the general case of arbitrary weight factors. Their common substrate is that, for each flow or packet or cell, a priority value is maintained --oftentimes corresponding to the next time-to-service value-- and the scheduler has to choose the minimum among these values, corresponding e.g. to the closest next time-to-service. Maintaining a set of many thousands of (priority) values and quickly finding the smallest of them is best done using a Heap (Priority Queue) data structure. We looked at methods to manage heap data structures at high speed, using specialized hardware.

First, we analyzed implementations involving one FPGA and external SRAM, and found that a few tens of clock cycles per heap operation are needed for priority queues containing several thousand elements. For example, a 16 K element heap needs 40 clock cycles per "increment" operation when implemented with one (32-bit wide) SRAM, or 24 clock cycles when two SRAMs are used; "insert" and "delete minimum" operations are less expensive: 16 clock cycles suffice, with either one or two SRAMs. Our detailed design was verified by simulation in Verilog, and is described in the TR-222 referenced below.

Second, we proposed (1997) a novel algorithm for top-to-bottom heap insertions, thus enabling one to manage a heap in a pipelined fashion, in order to achieve very high speed. In the traditional heap management algorithm, deletions (of the root) proceed from top to bottom, while insertions of new elements proceed in the reverse direction. In our modified algorithm, both operations proceed in the top-to-bottom direction; to achieve this, inserted elements are "steered", left or right, at each level of the tree, depending on which of the two subtrees the open slot to be occupied resides in.

If each level of the heap tree is stored in a separate memory, then we can have multiple heap operations proceeding in parallel, each of them operating on a different level of the tree. Having so many off-chip memories would be too expensive, but in an ASIC implementation these memories can be readily integrated on the same chip. We are investigating the use of wide, two-port memories, trying to achieve management rates of just one or two clock cycles per heap management operation. With a 100 MHz clock, this rate corresponds e.g. to one operation per ATM cell in a 20 to 40 Gbit/s aggregate-throughput system.


The FPGA and external SRAM design, and the top-to-bottom heap insertion algorithm are described in:

Heap Management in Hardware

by Ioannis Mavroidis

Technical Report FORTH-ICS/TR-222,
Institute of Computer Science, FORTH, Heraklion, Crete, Greece
Bachelor of Science Thesis, Department of Computer Science, University of Crete
July 1998

ABSTRACT:

Critical time priority queues require hardware instead of software management. The heap data structure is an efficient way to organize a priority queue. In this report we present the various design issues and evaluate the hardware complexity and timing requirements of different possible hardware organizations of a heap using one FPGA and external memory. We also describe the organization and the datapaths we simulated in Verilog. Finally we present some more efficient techniques applicable only to ASIC implementations.

The full Technical Report is available in:

© Copyright 1998 by FORTH.
Permission to make digital/hard copies of all or part of this material without fee is granted provided that the copies are made for personal use, they are not made or distributed for profit or commercial advantage, the FORTH copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of the Foundation for Research & Technology -- Hellas (FORTH). To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific written permission and/or a fee.


[ Original Site in Greece ] [ North European Mirror ] [ North American Mirror ]
[ Up to Weighted Round-Robin Scheduling R&D at CARV-ICS-FORTH ]
Last updated: March 1999, by M. Katevenis.