# Bit-Serial, VLSI Architecture for the Implementation of Maximum-Length Number-Theoretic Transforms Using Mixed Basis Representations 

M.G.Parker,M.Benaissa<br>School of Eng., Univ.of Huddersfield, Huddersfield HD1 3DH, UK.


#### Abstract

Fermat and Mersenne NTTs are relatively easy to implement, but unsuitable for many DSP applications, due to small block length over wordlength. This paper presents VLSI design techniques appropriate for a wider range of NTTs, including maximum-length NTTs, and presents a systolic architecture exploiting blocklength factorisation to decompose the architecture into sub-modules, themselves systolic NTTs. The design avoids the explicit implementation of modular addition, accumulation and multiplication by using systolic basis converters which, inherently, perform the aforementioned tasks. The implementation of a maximum-length, 60pt NTT is described as an example. It uses binary to ternary basis conversion to perform all addition and multiplication, and 'ternary basis compression' to perform accumulation.


## 1 Introduction

A general 1-D NTT of a sequence $\mathrm{x}[\mathrm{n}]$ is defined as,

$$
\begin{equation*}
X[k]=\left\langle\sum_{n=0}^{N-1} x[n] \cdot p^{n k}\right\rangle_{M} \quad 0 \leq n, k \leq N-1 \tag{1}
\end{equation*}
$$

where $p$ has order $\mathrm{N} \bmod \mathrm{M},\left\langle N^{-1}\right\rangle_{M}$ exists and is unique, and an N-point NTT can be defined over all the moduli factors of M [1]. By choosing all NTTs with 2 as the kernel, over a modulus, $M$, where $\left\langle 2^{N}\right\rangle_{M}=1$, a wide range of block lengths, for a given dynamic range, become available $[2,3]$. As with FNTs and MNTs, multiplication can be implemented as bit shifts but, in general, the problem of modulus overflow is non-trivial. This paper
shows how NTT block length factorisation enables a regular and highly-pipelined decomposition. Basis converters [4] distribute the addition over a series of simple cells, whilst converting the data basis to a form which eliminates multiplication. Finally, the concepts are illustrated by considering the specific implementation of a 60 pt NTT, mod 61, using binary and ternary representations.

## 2 Factorisation of NTTs

An N-length NTT is factorised as follows,

$$
X[k]=X_{\mu}[\theta] \text { for } k=N_{0} \cdot \theta+\mu
$$

where $0 \leq \theta<N_{1}, 0 \leq k<N$, $0 \leq \mu<N_{0}$ and $N=N_{0} \cdot N_{1}$,
and,

$$
\begin{aligned}
& X_{\mu}[\theta]= \\
& \left\langle\sum _ { n _ { 0 } = 0 } ^ { N _ { 0 } - 1 } \sum _ { n _ { 1 } = 0 } ^ { N _ { 1 } - 1 } g ^ { n _ { 1 } , \theta } \cdot \left[ g^{D_{N_{0}}\left(\left(n_{1}+N_{1}, n_{0}\right) \cdot \mu\right)} .\right.\right. \\
& \left.\left.p^{\left(\left(n_{1}+N_{1} \cdot n_{0}\right), \mu\right\rangle_{N_{0}} \cdot x\left[n_{1}+N_{1} \cdot n_{0}\right]}\right]\right\rangle_{M}
\end{aligned}
$$

where $g=\left\langle p^{N_{0}}\right\rangle_{M}, D_{z}()=()$ DIV $z$, and

$$
n=n_{1}+N_{1} \cdot n_{0} \text { where }
$$

$$
0 \leq n_{0}<N_{0}, 0 \leq n_{1}<N_{1}
$$

We have expressed $X_{\mu}[\theta]$ as a summation of $N_{0}$ seperate $N_{1}$ pt NTTs, where $x[n]$ is pre-processed to become,

$$
g^{D_{N_{0}}\left(\left(n_{1}+N_{1} \cdot n_{0}\right) \cdot \mu\right)} \cdot p^{\left(\left(n_{1}+N_{1} \cdot n_{0}\right) \cdot \mu\right)_{N_{0}}} \cdot x\left[n_{1}+N_{1} \cdot n_{0}\right]
$$

The architectures, developed, perform the preprocessing by pipelining the small powers of $p$ multiplications, ( $p^{0} \ldots p^{N_{0}-1}$ ), followed by the small power of $g$ multiplications, (implemented as a modification to the subsequent $N_{1}$ pt NTT architectures).

Fig 1 shows the realisation of this decomposition. Gate rotators re-route the stream of partial
products, for successive $x[n]$, between each processing stage and are general to any block length. Fig 1 also shows the equivalent, unfactorised, architecture. The design combines $N_{0}, N_{1} \mathrm{pt}$, NTTs, together with a pre-processing stage, to form the complete NTT, and, if $N_{1}$ is, itself, factorisable, the decomposition is repeated for each $N_{1}$ pt NTT.

The $N_{1}$ pt NTT modification, necessary for the extra small powers of $g$ multiplications, manifests itself as a series of cumulative pulses to the $G R N_{1}$ modules. These pulses update the re-routing pattern within the rotators accordingly.

## 3 NTT Multiplications Using Basis Conversion

To simplify designs, we assume $\alpha$, the $\mathrm{i} / \mathrm{p}$ basis parameter (see appendix) $=p$, the kernel of the transform. (Binary input data $\Rightarrow \alpha=p=2$ ). Thus small power of p multiplications become a few additions, $\bmod M$. For instance, if $\left\langle p^{N_{h}}\right\rangle_{M}=h=$ $p^{0}+p^{1}$, multiplications by $p^{0} \ldots p^{N_{h}-1}$ become single additions of bit re-orderings of $x[n]$. As shown in the previous section, we require multiplications by $p^{0}$.. $p^{N_{0}-1}$. These multiplications become single additions if $N_{h} \geq N_{0}$. Ideally $N_{h}=N_{0}$. These additions can be incorporated into the basis conversion process.

The converter changes the data from a $p$-basis to a $g$-basis to simplify multiplication by small powers of $g$. $g$ is the kernel of the $N_{1}$ pt NTTs and, as $\left\langle g^{N_{1}}\right\rangle_{M}=1$, power of $g$ multiplications become bit re-orderings.

Each converter cell contains a number of digits, ('bits' for binary digits), which are related by a state transition table and identical cells are 'joined' together to form the complete systolic converter.

If $N_{1}$ is, itself, factored, the same process is repeated and $g$-basis to 'power of $g$ '-basis converters will be required, (and so on).

## 4 Compression Replaces Accumulation

Each accumulator sums $N$ data words over a gbasis to calculate $X[k]$. By interpreting these $N$ serial data words as one long data word with an identical basis flow, (apart from $j_{N}^{\prime}=N . j^{\prime}$, where $j_{N}^{\prime}$ refers to the long word) (see appendix), this single data word $=X[k-z]$, (where $g^{j^{\prime}}=p^{z} \Rightarrow z=$ $\left.N_{0} \cdot j^{\prime}\right)$. As the mapping from $X[k]$ to $X[k-z]$ is one-to-one, accumulation is not strictly necessary, and for many applications, (such as convolution), accumulators can be omitted from the design, giving substantial hardware savings.

If it is desired to reduce the value of $j_{N}^{\prime}$ to a minimum, basis compressors, similar in design to the basis converter, but with minimal feedback inserted between cells, are used.

## 5 Optimal NTTs

The design principles, developed in previous sections, are applicable to any NTT. One class of NTTs for which they are particularly suited is maximum-length NTTs. These occur when $M$ is prime and $N=M-1$ (eqn 1). Table 1 list a few such NTTs for $p=2$. When $N$ is factorisable, a suitable $N_{h} \geq N_{0}$ should be found so that $h$ is a simple additive combination of small powers of 2, where $h=\left\langle 2^{N_{h}}\right\rangle_{M}$. As explained in the last section, these simple additions can then be incorporated into the basis conversion where the basis converters are required to convert from a binary basis to a $g$-basis. Suitable $N_{h}, N_{0}, h$ and $g$ are given in Table 1 for each example, along with the expression of $h$ as a simple modular addition.

| M | N | $N_{h}$ | $N_{0}$ | $h$ | $h_{\text {Add }}$ | g |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 11 | 10 | 8 | 5 | 3 | $2^{0}+2^{1}$ | 32 |
| 13 | 12 | 9 | 6 | 5 | $2^{0}+2^{2}$ | 64 |
| 29 | 28 | 5 | 4 | 3 | $2^{0}+2^{1}$ | 16 |
| 61 | 60 | 6 | 6 | 3 | $2^{0}+2^{1}$ | 3 |
| 4093 | 4092 | 12 | 12 | 3 | $2^{0}+2^{1}$ | 3 |

Table 1: Possible First-Stage Factorisation of Selected Prime Moduli

## 6 High Thr'put VLSI Implementation of a 60pt NTT, Mod 61

This example of a maximum-length NTT implements eqn 1 when $p=2, M=61$ and $N=60$, and is shown in Fig 2. The input basis flow is given by,

$$
-2^{5}, 2^{4}, 2^{3}, 2^{2}, 2^{1}, 2^{0}
$$

where $j=j^{\prime}=6$ and $d=1$, and, after binary to ternary conversion, by,

$$
\leftarrow \begin{array}{lllll}
3^{4}, & 3^{3}, & 3^{2}, & 3^{1}, & 3^{0} \\
3^{9}, & 3^{8}, & 3^{7}, & 3^{6}, & 3^{5}
\end{array}
$$

where $\left\langle 3^{10}\right\rangle_{61}=1, j=5, j^{\prime}=6$ and $d=2$.
Both bases span the data range $0-60$ and are therefore acceptable.

We factorise $N=60$ as 2.3.5.2 $=N_{0} \cdot N_{1} \cdot N_{2} \cdot N_{3}$. As $\left\langle 2^{6}\right\rangle_{61}=3=2^{0}+2^{1}$, multiplications by
$2^{0}, 2^{1}, 2^{2}, 2^{3}, 2^{4}, 2^{5}$ are implemented as single additions of bit re-orderings of $x[n]$. These reorderings can be factorised into $2^{0}, 2^{1}$, followed by $4^{0}, 4^{1}, 4^{2}$ multiplications. The additions are incorporated into the basis conversion. The 10 pt NTTs , ( $N_{2} . N_{3}$ ), use a kernel, $g,=\left\langle 2^{6}\right\rangle_{61}=3$. The basis flow, generated by the converter, contains all possible powers of $3, \bmod 61$. Thus, multiplications by $3^{0}, 3^{1}, 3^{2}, 3^{3}, 3^{4}$ become bit re-orderings which are performed on output from the converters. Finally, multiplications by $-1^{0},-1^{1}$ are achieved by re-routing the two basis flow lines of the ternary basis, as $\left\langle 3^{5}\right\rangle_{61}=-1$,

By interpreting the convertor output as one, long, word, the output becomes $X[k-z]$ instead of $X[k]$, where $\left\langle 3^{6}\right\rangle_{61}=\left\langle 2^{z}\right\rangle_{61} \Rightarrow z=36$.

Should basis compression be desired, a basis converter architecture with feedback can be used.

The complete NTT design has been successfully simulated down to the logic level and is currently being built.

## 7 Conclusion

In this paper, a systolic architecture for a wide range of NTTs, especially maximum-length NTTs, has been developed. This architecture is based on the block-length decomposition of the NTT and the use of mixed-basis arithmetic. This allows efficient and more practical NTT-based systems to be implemented in VLSI.

## 8 Appendix - 'Basis' and 'Basis Flow'

'Basis' refers to the weightings given to a group of digits which, when evaluated and summed, represent a number. An $\alpha$-basis means all weightings are successive powers of $\alpha$. In this paper, each digit is a binary digit, ('bit'), having two values, 0,1 , although the theory can be extended to any digit range. Ternary, in this context, implies $\alpha=3$, whilst each digit is a 'bit'.
'Basis Flow' refers to the flow characteristics of the basis data and the weighting pattern is shown below for one word, as a $j^{\prime} * d$ vector.

where the vector is assumed to be travelling right to left. $j^{\prime}$ is the word period and $j^{\prime}-j$ is the inter-word delay, both in clock cycles and $d$ is the
data path width. With no delay between successive words, $j^{\prime}=j$ and when $d=1$, the data moves in a digit-serial fashion.

## References

[1] J.H.McClellan,C.M.Rader, Number Theory in Digital Signal Processing, Prentice Hall, '79
[2] O.R.Hinton, "Implementation and Application of Generalised Number-Theoretic Transforms," IEE Colloqium on Signal Processing Applications of Finite Field Mathematics, June '89
[3] H.Lu,S.C.Lee, "A New Approach to Solve the Sequence-Length Constraint Problem in Circular Convolution Using Number-Theoretic Transform," IEEE Trans on Signal Processing, Vol 39, No 6, pp 1314-1321, '91
[4] B.Parhami, "Systolic Number Radix Converters", The Computer Journal, Vol 35, No 4, '92


