# Throughput Optimized Implementations of QUAD

Jason R. Hamlet  $\,\cdot\,$  Robert W. Brocato

Abstract We present several software and hardware implementations of QUAD, a recently introduced stream cipher designed to be provably secure and practical to implement. The software implementations target both a personal computer and an ARM microprocessor. The hardware implementations target field programmable gate arrays. The purpose of our work was to first find the baseline performance of QUAD implementations, then to optimize our implementations for throughput. Our software implementations perform comparably to prior work. Our hardware implementations are the first known implementations to use random coefficients, in agreement with QUAD's security argument, and achieve much higher throughput than prior implementations.

**Keywords** QUAD  $\cdot$  stream cipher  $\cdot$  throughput optimization  $\cdot$  hardware acceleration

# 1 Introduction

The QUAD algorithm is a stream cipher proposed by Berbain, Gilbert, and Patarin and is intended to be provably secure and practical to implement [1]. QUADs security is derived from the difficulty of solving the multivariate quadratic (MQ) problem. That is, the security of the QUAD cipher is provably reducible to the NP-hard problem of finding a solution to a multivariate quadratic system of m quadratic equations in n variables over a finite field, GF(q). Each equation in the system of kn mul-

J. R. Hamlet

Sandia National Laboratories, Albuquerque, NM 87123 USA Tel.: +505-845-0903 E-mail: jrhamle@sandia.gov

R. W. Brocato

Sandia National Laboratories, Albuquerque, NM 87123 USA E-mail: rwbroca@sandia.gov

tivariate quadratic equations used in QUAD is written as

$$Q(x) = \sum_{1 \le i \le j \le n} \alpha_{i,j} x_i x_j + \sum_{1 \le i \le n} \beta_i x_i + \gamma$$
(1)

In QUAD, a system of m = kn equations,  $S(x) = (Q_1(x), \ldots, Q_{kn}(x))$  are iterated. On each iteration, n bits are used to update the internal state, and the remaining m - n are output as keystream values. To do this, we let  $S_{out}(x) = (Q_{n+1}(x), \ldots, Q_{kn}(x))$  and  $S_{it}(x) = (Q_1(x), \ldots, Q_n(x))$ . The n polynomials in  $S_{it}$  are used to update the internal state, while  $S_{out}$  produces the keystream. As such, one round of QUAD entails calculating  $S(x) = (S_{it}(x), S_{out}(x))$  with current state x, outputting the n bits generated with  $S_{out}(x)$ , and then updating x with the n bits generated by  $S_{it}(x)$ .

Inspection of Equation 1 reveals that there are no conditional branches required in implementing QUAD. Consequently, QUAD permits constant time implementations, and so side-channel timing attacks [17] on QUAD are unlikely. Unfortunately, the key initialization procedure described in [1] does include conditional branching and so may be susceptible to such attacks. However, this initialization procedure is non-standard and was removed from QUAD in [4]. Depending on the implementation, there is a possibility of hardware leaking the key or initialization vector (IV) bits during key and IV setup, and there are likely simple power analysis (SPA) or differential power analysis (DPA) attacks [18]. QUAD's resistance to timing attacks is beneficial, and its potential susceptibility to hardware leakage or power analysis attacks is consistent with other ciphers [14].

To satisfy the security proof in [1] the coefficients defining the polynomials  $S_x$  must be randomly generated, but they are not secret values. Prior implementations replace these random coefficients with values output from a pseudo random number generator (PRNG) [5,6]. While this leads to more compact implementations, QUAD's security proof has not been extended to the pseudo random case. For the first time, we present hardware results for

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000.

QUAD implementations using random coefficients, in accordance with QUAD's security argument.

In this paper, we report on our efforts to measure computational performance of the QUAD algorithm on a personal computer (PC), an ARM Cortex A8 embedded microprocessor, and in Altera Cyclone V and Xilinx Virtex-4 field programmable gate arrays (FPGAs). We implemented the QUAD algorithm with a number of different variations on each platform in an effort to optimize performance on each. The standard measure of performance that we seek to optimize throughout these tests is the rate of keystream production, measured in bits/second.

In this work we consider only n = 128, k = 2, and coefficients in GF(2). The solution to Equation (1) is the m = 256 bit value  $S(x) = Q_1(x), \ldots, Q_{256}(x)$ . Values  $S_{out}(x) = Q_{129}(x), \ldots, Q_{256}(x)$  are output as the keystream, and values  $S_{it}(x) = Q_1(x), \ldots, Q_{128}(x)$  are used to update the internal state. The  $\alpha$ ,  $\beta$ , and  $\gamma$  coefficients are random but public values. There are  $256\binom{128}{2} =$ 2,080,768 bits of  $\alpha$ , which we term nonlinear coefficients,  $256 \times 128 = 32,768$  bits of linear coefficients  $\beta$ , and 256 bits of  $\gamma$ , for a total of 2,113,792 bits. QUAD's security argument requires these coefficients to be chosen randomly, and its designers state that bad choices of coefficients are unlikely, though this has not been proven [1]. The coefficients used in our implementations were generated using the random number generator in OpenSSL [3].

## 2 Software Implementations

Using the C programming language, we implemented four different software versions of QUAD. Each version used the same random coefficients. For each version we targeted both a PC and an ARM microprocessor and measured the resulting throughput, which varies significantly between implementations. In this section we describe each of the implementations and our results.

#### 2.1 Software Implementation: Overview

The initialization used in all of our software and hardware implementations of the QUAD algorithm differs from that presented in [1], which describes an initialization procedure that uses two different multivariate quadratic systems, S0 and S1, of n equations in n unknowns. The internal state x, which has been set to an initial value K, is used to select either the output of  $S_0(x)$  or of  $S_1(x)$ , depending on the sequentially selected value of the internal state. However, this approach is non-standard and has been removed from cryptographic standards that include QUAD [4]. For our software implementations we simply make a call to the Unix function /dev/random to provide an entropy source to seed the internal state of the algorithm. In practice, one might concatenate the output of an entropy source with a personalization string and then apply a cryptographic hash function to the result to seed the internal state of the algorithm. For the purpose of this work initialization approaches are inconsequential, since we have removed the effects of that delay from our throughput time measurements. Consequently, the initialization can be viewed as an initial delay that is identical between implementations and whose impact decreases as the length of the generated keystream increases.

Each of our software implementations uses a 128-bit internal state with a 128-bit keystream output for each update cycle. We tested these programs on both a PC and an ARM microprocessor. The PC used for testing has a 3.0GHz Intel Core 2 Duo processor with 6 Mbytes of cache memory and 8 Gbytes of random access memory (RAM) running Red Hat Enterprise Linux version 5. We compiled our code with the GNU compiler version 4.1.2. We also ran each program on a Cortex A8 ARM core that is part of a DaVinci DM3730 microprocessor. The DM3730 microprocessor has an additional C64x digital signal processor (DSP) core. We attempted to compile our programs to run on the DSP core, but we had insufficient development time to replace the key C language functions used to measure algorithm execution times. Consequently, our reported speeds are limited to the PC and the ARM microprocessor.

# 2.2 Software Implementation: Results

# 2.2.1 QUAD1

Our first software implementation, QUAD1, computes the internal state value and keystream output by means of the most computationally simplistic approach. It was used to derive test vectors for the other software implementations. In this version, separate computations are performed to update the 128-bit internal state register and the 128bit keystream output register. That is, the internal state and keystream are treated as two separate registers in QUAD1. Computations for the nonlinear, linear, and constant terms are performed separately. No effort was made to streamline computations in QUAD1, and the bit-wise computations required for the QUAD algorithm are not well suited to the register-based computations of a PC running a C program. Due to these factors, this first software version of QUAD achieves an average speed of only 4.7 kbits/sec on the PC and 970 bits/sec on the ARM microprocessor.

# 2.2.2 QUAD2

Our second version, QUAD2, was created in an effort to speed up the implementation of the algorithm by performing block matrix-vector multiplication. Most of the computations required in the algorithm take place in the quadratic  $(\alpha_{ij}x_ix_j)$  terms. To speed up these computations, the arithmetic in QUAD2 is performed on words sized to fit the register size in the microprocessor, which is 32 bits for the ARM and 64 bits for the Intel Duo. This version was also designed to be easily implemented on a smaller microprocessor with 8-bit or 16-bit registers. A set of appropriately sized Hadamard transform functions were created that implement a parity popcount to count the number of ones in the input register. The arithmetic is then performed using these and a bitwise logical AND function. QUAD2 achieved an average speed of 34.6 kbits/sec on the PC and 5.1 kbits/sec on the ARM processor.

# 2.2.3 QUAD3

A third software version, QUAD3, was created as a specialization of QUAD2. This version targets the 32-bit architecture found on the ARM processor. In this version, all of the computations are routed through a parity function that can be implemented in a DSP hardware instruction for a 32-bit parity popcount. This version ran at an average speed of 111 kbits/sec on the PC and 16.2 kbits/sec on the ARM processor.

## 2.2.4 QUAD4

Our fourth software version of QUAD, named QUAD4, was created by moving as much computation as possible outside of the loops that are used to compute each coefficient. It makes use of a custom logical XOR inline function to XOR all bits in a register. The basic version of QUAD4 generated an average of 1.25 Mbits/sec of keystream output on the PC and 163 kbits/sec on the ARM. Versions of the QUAD4 implementation, named QUAD4\_pc\_32bit and QUAD4\_pc\_64bit, were created with relatively minor changes to C function implementations. These provided little speed improvement over the basic QUAD4 program.

Versions of QUAD4 named QUAD4\_dsp, QUAD4\_dsp2, and QUAD4\_dsp3 were created to progressively remove more high level C-language functions and replace them with low-level implementations. This was done to enable the QUAD algorithm to run on the C64x digital signal processor core that is present in the DM3730 microprocessor, along with the ARM microprocessor. It was possible to get the program to execute on the C64x core; however, the C-language function *time*, used to benchmark program performance, does not function in the C64x. No suitable replacement for this function was able to be created in the available development time. As a result, the DSP optimized versions of QUAD were only tested on the PC and on the ARM processor.

In QUAD4\_dsp the coefficient table was moved directly into program memory from the external file that

|                    | PC     | PC            | ARM $\mu P$ | ARM $\mu P$   |
|--------------------|--------|---------------|-------------|---------------|
|                    | (Mb/s) | (cycles/byte) | (Mb/s)      | (cycles/byte) |
| Quad1              | 0.0047 | 5,106,383     | 0.00097     | 8,247,423     |
| Quad2              | 0.0364 | 693,642       | 0.0051      | 1,568,627     |
| Quad3              | 0.111  | 216,216       | 0.0162      | 493,827       |
| Quad4              | 1.21   | 19,835        | 0.163       | 49,080        |
| Quad4_pc_32bit     | 0.845  | 28,402        | 0.132       | 60,606        |
| Quad4_pc_64bit     | 1.25   | 19,200        | N/A         | N/A           |
| Quad4_dsp          | 3.302  | 7,268         | 0.547       | 14,625        |
| Quad4_dsp2         | 3.298  | 7,277         | 0.582       | 13,746        |
| Pentium IV [1]     | 4.6    | 4,347         |             |               |
| Opteron 64-bit [1] | 7.7    | 2,176         |             |               |

was used in previous versions. Also, the entropy source used for initialization was hard-coded into program memory. These two actions approximately tripled the throughput to 3.30 Mbits/sec on the PC and 547 kbits/sec on the ARM. QUAD4\_dsp2 incorporates minor changes to output data handling. It achieves only minor performance improvements over QUAD4\_dsp. In QUAD4\_dsp3, the bit test used to decide the condition of the state bit was performed using a look-up table. This decreased throughput to 2.19 Mbits/sec in the PC and 451 kbits/sec in the ARM processor.

#### 2.2.5 Results

Throughput results for our software implementations are summarized in Table 1. Our fastest performing version uses a custom, inline, register-based XOR function with coefficients stored in program memory and some high level C language functions replaced by low level operations. Implementation differences between the best version for the PC and the best version for the ARM microprocessor are minimal.

Our fastest speeds on the PC are slower than those reported in [1] and [19]. The optimizations used in the previous work include generating only the nonzero  $x_i x_j$ terms. Since the  $x_i$  are random there is a  $\frac{3}{4}$  probability that any such term will be 0, allowing it to be excluded from computation. Precomputing the indices of non-zero variables and pairs of non-zero  $x_i x_j$  terms results in a reduction in the number of non-linear terms from  $\binom{n}{2} = \frac{n(n-1)}{2}$  to  $\frac{n(n+1)}{8}$ , which reduces the number of computations and the accumulation overhead substantially [19]. The other optimization used in previous work takes advantage of a function that is equivalent to S(x) but that is more computationally efficient to evaluate when the Hamming distance of x is greater than  $\frac{n}{2}$  [19]. Our optimizations are based on using DSP hardware instructions, custom XOR inline functions, and in replacing high level C-language functions with low-level implementations. These optimizations produce a 700x increase in throughput when compared to our baseline designs, suggesting that the results of [1] could be enhanced by incorporating our optimizations.

# **3** Hardware Implementations

#### 3.1 Hardware Implementations: Overview

There has been some previous work on area efficient hardware implementations of QUAD [5,6], but that work abandons the formal security proof of [1] by generating the function S pseudo-randomly. Additionally, while area efficient, the designs in [5,6] provide maximum throughput of 4.1Mbps, which is not significantly faster than our software implementations. In this work, we present hardware FPGA implementations of QUAD that retain random functions S and that achieve much higher throughput.

We describe two FPGA implementations of QUAD. To achieve high-throughput, area-efficient implementations, both of these hardware designs are tailored specifically for the Cyclone V FPGA architecture [5]. The target device primarily impacts memory layout and the specifics of look-up table (LUT) based combinatorial logic. With appropriate modifications, similar results can be achieved with other FPGA architectures. To demonstrate this, we provide specific suggestions and implementation results for the Xilinx Virtex-4 architecture. Though this approach results in less portable hardware development language (HDL) code, attention to FPGA architecture allows faster, smaller designs.

As with the software implementations, the hardware implementations of QUAD that we describe do not include the key initialization procedure described in [5]. Instead, the state is initialized to a constant. Our hardware designs are over GF(2) with key length n = 128 and k = 2, resulting in a set of m = kn = 256 equations. Other key lengths in GF(2) would have similar designs. Implementations over larger finite fields are possible, but are much less secure and so are of less practical interest [2]. We do not consider them here.

The Q(x) in equation 1 have a regular structure that allows many different hardware implementations. For instance, the Q(x) could be solved serially. If a serial design requires  $\zeta$  cycles to compute one of the Q(x) then such an approach would need  $\zeta \times n \times k$  cycles to generate the *nk*-bit result. If  $\eta$  of the Q(x) are computed in parallel then  $\frac{\zeta}{\eta} \times n \times k$  cycles would be needed, but the hardware cost would also increase roughly by a factor of  $\eta$ .

In our designs we divide the Q(x) terms into a nonlinear portion  $\sum_{1 \leq i \leq j \leq n} \alpha_{i,j} x_i x_j$  and a linear portion  $\sum_{1 \leq i \leq n} \beta_i x_i$  that are computed separately and then combined with  $\gamma$ . Our first design splits computation of the Q(x) equations into two stages. In each stage, 128 of the Q(x) terms are computed in parallel. The same combinatorial logic and memory resources are used in the two stages. As such, the linear and nonlinear processing units described in the following sections each appear 128 times in the first design. The second design computes all 256



Fig. 1 At the top level, our design has separate linear and nonlinear processing units that operate on the function as the current state. The design is readily parallelized by replicating these processing units.

of the Q(x) in parallel and requires 256 instantiations of the processing units. Since logic reuse is not possible in this design, it is larger, but it has a greater throughput. The basic structure of these designs is shown in Figure 1, which depicts the data path for computing one of the Q(x) terms. Duplication of the linear and nonlinear processing units is required to compute the Q(x) terms in parallel.

# 3.2 Nonlinear Computation: Combinations of State

Computation of the nonlinear portion requires generation of the  $\binom{n}{2} = \binom{128}{2}$  combinations of the current state, x, multiplying each of these combinations by the appropriate  $\alpha_{i,j}$ , and finally performing a summation of these  $\binom{128}{2}$  values to a single bit. First, we consider generating the combinations of the current state, that is, generating  $x_i x_j, 1 \leq i \leq j \leq n$ . For state size n, a straightforward approach is to generate n bits per cycle. This can be accomplished with two registers. Initially, one register holds the current state, x, and the second register holds x rotated by one bit. The contents of the two registers are pair-wise ANDed to produce the combinations  $x_i x_j$ . Then the second register is rotated by a single bit and the operation is repeated. For n = 128, it will take 64 cycles to generate the  $\binom{128}{2}$  combinations. On the last cycle, the upper 64 bits of the result will hold the same combinations as the lower 64 bits. This can be corrected by masking the duplicate combinations with zeros in the  $\alpha_{i,j}$  coefficients. By making the registers smaller than n bits wide, fewer combinations can be generated each cycle, although this requires the two registers to be periodically updated with new portions of the current state. More than n bits can be generated in each cycle by using more registers. For instance, we could use three registers, two of them initially

containing x and the third containing x rotated by n/2 bits. We could then generate 2n combinations in each cycle. In this approach the second and third registers would each be rotated by one bit each cycle. Straightforward extensions to this scheme would allow more combinations to be generated each cycle. Our FPGA designs both have 128-bit data paths. They use two 128-bit registers, and so require 64 cycles to generate all of the combinations.

## 3.3 Nonlinear Coefficients

Each of the Q(x) equations requires  $\binom{128}{2} = 8128\alpha_{i,j}$  values. It is convenient to store these either in a ROM with output word width equal to the number of combinations of  $x_i x_j$ , generated each cycle, or in a collection of ROMs whose combined output word width equals this number of combinations. A more detailed discussion of ROM geometries appears in Section 3.6. Just as generation of the  $x_i x_j$  combinations will generally produce some redundant combinations, the chosen ROM geometries are likely to store more than 8128 bits. The unused bits should be set to zeros so that they will mask the redundant  $x_i x_j$ . Additionally, the  $\alpha_{i,j}$  should be permuted prior to storage to compensate for the ordering of the generated sequence of  $x_i x_j$  combinations.

# 3.4 Combining the Nonlinear Coefficients and Combinations of State

The  $\binom{128}{2} = 8128$  combinations of state and nonlinear coefficients have to be bitwise multiplied and then summed to produce a single bit. In the straightforward approach, during each clock cycle the combinations of state generated by the registers are ANDed with the nonlinear coefficients. The results are then input to an XOR tree. This XOR tree will generally be several layers deep, so it can be pipelined to improve throughput. While simple, this approach does not consider the FPGA architecture or resources. More efficient designs are possible by tailoring the bitwise AND multiplication and wide XOR summation to the available FPGA resources.

Modern FPGAs consist of an array of reconfigurable units, each containing look-up tables (LUTs), memory elements, routing, and other resources such as multiplexors or adders. The particulars of these units vary by manufacturer and device family. The LUTs are used to implement user-defined logic functions. Area efficient designs should attempt to make full use of the LUTs. For example, if a device provides 4-1 LUTs then 1, 2, 3, and 4-input, 1output functions all require one LUT. Partitioning the design into functions of fewer than 4 inputs makes inefficient use of the LUTs and wastes resources.

The Cyclone V devices targeted in this work contain 8 input fracturable LUTs [6]. Each of these LUTs can be



Fig. 2 Detailed depiction of our pipelined, LUT-based nonlinear (top) and linear (bottom) processing units.

configured to implement one 6-input function, any two 4input functions, any combination of one 3-input function and one 5-output function, or various more complicated configurations with functions having shared inputs. To make efficient use of the FPGA resources, we organize our design to make full use of these LUTs. An overview of our nonlinear processing unit is shown in Figure 2a, which we will refer to for the remainder of this discussion. The AND-XOR LUT blocks each consist of a 6-1 LUT and each implement three of the multiplications and two of the summations required for the nonlinear processing. That is, each AND-XOR block in the nonlinear portion computes  $f(x_i, \alpha_i) = \sum_{0 \le i \le 2} x_i \alpha_i$  where each of the  $x_i$ is the previously computed pairwise AND of two bits of state and the  $\alpha_i$  are the corresponding nonlinear coefficients. Since our designs have a 128-bit data path there are 42 of these 6-input AND-XOR functions and one 4input AND-XOR in each non-linear processing unit. The result is a 43-bit value that must be XORed down to 1 bit. For this, we have a pipelined architecture that again makes full use of the available LUT resources. The first stage of the XOR tree consists of seven 6-input LUTs. each of which computes a 6-bit wide XOR. Finally, the seven output bits from the first stage of the XOR tree, the  $43^{rd}$  input to the XOR tree, and the previous output from the XOR tree are combined by a 9-input XOR. The feedback path is required because each of the Q(x)terms requires operations on  $\binom{128}{2}$  combinations of state and nonlinear coefficients, and we have a 128-bit data path. Consequently, it takes 64 rounds of this processing flow for the nonlinear processing unit to generate a single output bit.

Now, we briefly consider implementation of the nonlinear processing in Virtex-4 devices. Each Virtex-4 slice contains two 4-input LUTs [6]. To accommodate this structure, the AND-XOR LUTs in Figure 2a were redesigned to implement  $f(x_i, \alpha_i) = \sum_{i=0,1} x_i \alpha_i$  where each of the  $x_i$  is a previously computed pairwise AND of two bits of state and the  $\alpha_i$  are the corresponding nonlinear coeffi-

| Table 2 | Memory | resources | available | $_{in}$ | Cyclone | V | devices |
|---------|--------|-----------|-----------|---------|---------|---|---------|
|---------|--------|-----------|-----------|---------|---------|---|---------|

|                                                                 | M10K                                                                    | MLAB                                  |
|-----------------------------------------------------------------|-------------------------------------------------------------------------|---------------------------------------|
| $(\stackrel{Configuration}{(\text{depth} \times \text{width})}$ | $256x32, 256x40, 512x16 \\512x20, 1kx8, 1kx10 \\2kx4, 2kx5, 4kx2, 8kx1$ | 32x16, 32x18, 32x20                   |
| $f_{max}$ (MHz)                                                 | 315                                                                     | 450                                   |
| Memory modes                                                    | single port,<br>true dual port, ROM                                     | single port,<br>simple dual port, ROM |

cients. For our 128-bit data path, 64 AND-XOR LUTs would be required. The first stage of the XOR tree would consist of 16 4-input XOR LUTs. The second stage of the XOR tree contains four 4-inputs XOR LUTs, and a final third stage combines the resulting four bits with the feedback value.

## 3.5 Linear Computation

The linear portion of QUAD is more easily computed than the nonlinear portion. Since we require 64 cycles to calculate the non-linear portion and there are 128 terms in the summation of the linear portion, we operate on two bits of state and two linear coefficients each clock cycle. In Figure 2a the AND-XOR LUT computes  $g(x_i, \beta_i) = \sum_{i=0,1} x_i \beta_i$ where the  $x_i$  are taps off the shifted version of the current state. Our design uses the leftmost  $(127^{th})$  and  $63^{rd}$  bits of this register, but any two bits separated by 64 positions would be acceptable. As with the linear coefficients, the  $\beta_i$  coefficients should be permuted prior to storage to compensate for the utilized sequence of  $\beta_i$ . In our designs the ROMS that store the  $\beta$  coefficients are designed to make efficient use of the available memory resources. The details are provided in Section 3.6.

In Virtex-4 devices the AND-XOR LUT in Figure 2b can be implemented in a single 4-input LUT. This makes more efficient use of the device resources than the Cyclone V design, which uses only four of the six available LUT inputs.

#### 3.6 Memory layout

For QUAD implementations with state size n = 128 and k = 2 there are  $256 \left( \binom{128}{2} + 128 + 1 \right)$  coefficients that must be stored. Given this large number of coefficients, it is important to choose ROM geometries that efficiently store the coefficients while allowing the full set of coefficients to be accessed in the number of cycles required by the pipelined architecture and throughput constraints. For designs targeting large throughput many memories are required so that many coefficients can be accessed concurrently. Dual and quad-port memories are also useful in this regard.

Cyclone V devices contain two types of memory blocks [8]. The relevant features of these blocks are summarized in Table 2. Each of the nk = 256 equations Q(x) requires

storing  $\binom{128}{2} = 8128$  nonlinear coefficients,  $\alpha_{i,j} 1 \leq i \leq j \leq n$ , so a single M10K block is large enough to store the  $\alpha_{i,j}$  for one of the Q(x). However, with a maximum configuration width of 40 bits, a dual port design would require at least 102 cycles to access all of the  $\alpha_{i,j}$ . To increase throughput several M10K blocks can be used. Unfortunately, this increase in throughput causes inefficient memory usage and increased resource consumption. Due to their small size, the MLAB memory resources in Cyclone V devices are not an attractive option for storing these coefficients.

Each of the Q(x) equations also requires storing 128 linear coefficients,  $\beta_i, 1 \leq i \leq n$ . Four M10K blocks could be used to store all of the  $\beta_i$ . Using a dual port configuration, such an approach would require at least 103 cycles to access all of the  $\beta_i$  coefficients. Greater throughput could be achieved by using more M10K blocks, but this would also result in inefficient use of the memories. MLAB resources are another option for storing the  $\beta_i$ . In the 32x16 configuration each MLAB can store the  $\beta_i$  coefficient for four of the Q(x) equations. This makes efficient use of the memory resources, and requires 32 cycles to read all of the  $\beta_i$ .

Since the  $\alpha_{i,j}$  are used in calculating the nonlinear portion of the Q(x) terms, the  $\beta_i$  coefficients are used in finding the linear portion, and the results of these distinct portions are combined with the  $\gamma$  to produce Q(x), it is desirable for the linear and nonlinear portions to be calculated in parallel and for these calculations to take the same amount of time. To achieve this goal, the  $\alpha_{i,j}$ and  $\beta_i$  coefficients should be stored so that it takes the same number of cycles to access them, and they should be stored in a manner that makes efficient use of the memory resources.

Our first design splits the calculation of the Q(x) equations into two rounds. The first 128 of the Q(x) are solved in the first round, and the remaining 128 are solved in the second. This permits the same memory resources to be used in each round. This design uses 128 memories to store the  $\alpha_{i,j}$  coefficients. Each of these memories is composed of four M10K blocks, each 32 bits wide and 256 words deep. This was necessary to achieve the 128 bit wide memories necessary to support our pipeline, which requires 128 bits of new  $\alpha_{i,j}$  values in each clock cycle. Unfortunately, this leaves half of each M10K block empty. Switching to a true dual port configuration does not improve the memory usage, as a 128 bit wide, 256 word deep dual-port ROM requires 8 M10K blocks. The memory layout for the 128  $\alpha$  coefficient memories is shown in Table 3. This design uses 16 memories to store the  $\beta_i$  coefficients. Each memory is 16 bits wide by 128 words deep, completely fills four MLAB resources, and stores the  $\beta_i$ coefficients for 16 of the Q(x) equations. The memory layout is also shown in Table 3. We chose this configuration so that the  $\beta$  memories could be addressed identically to

the  $\alpha$  memories. This design requires 512 MLABs and 512 M10K blocks.

Our second design solves all 256 of the Q(x) terms in parallel. In this design, we can no longer reuse memories, so the layouts are changed so that each memory holds half as many values as in the first design. As before, we store the  $\beta_i$  coefficients in MLABs. Here, each memory consists of two MLABs configured to be 16 bits wide x 64 words deep and holds the  $\beta_i$  coefficient for eight of the Q(x) equations.

The  $\alpha$  coefficients are stored in M10K memories. Each memory stores the  $\alpha$  for a single Q(x) equation and consists of four M10K blocks configured as 64 bits wide x 256 words deep. These memories are true dual-port ROMs with addresses n and n + 64 read simultaneously to permit the 128-bit reads necessary to support our pipeline. Half of the words are unused. As with the first design, the  $\alpha$  and  $\beta$  memories share address signals, and all of the coefficients can be read in 64 cycles. This design requires 512 MLABs and 1024 M10K blocks.

In Virtex 4 devices, two block RAMs can be configured in a single-port configuration 128 bits wide and 256 words deep. In our first design, half of these words are used to store the nonlinear coefficients for two of the Q(x). For our second design, a true dual-port configuration 64 bits wide by 512 words deep is also accommodated by two block RAMs. These configurations allow nonlinear coefficient storage in the same manner as we have implemented in the Cyclone V devices. In both devices, half of the memory words are unused. In Virtex 4 devices, the linear coefficients can be stored in block RAM or distributed memory. For our first design, 16 block RAMs are used to store all of the linear coefficients. They are arranged as 16 bit wide by 128 word deep memories. For the second design 32 block RAMs are configured as a 16 bit wide by 64 word deep memory. In both cases, the majority of the memory capacity is unused. Alternatively, each slice can be configured as a 64-bit memory. Then 32 slices could store the linear coefficients for 16 of the Q(x) equations in the configuration from Table 3 and 16 slices could store

the coefficients for 8 of the Q(x) equations in the configuration from Table 4.

## 3.7 Timing Optimization

The QUAD algorithm has a regular structure that is readily parallelizable. The processing consists of simple combinatorial logic that is easily pipelined to maintain fast clock frequencies. To ensure fast clocks in parallel QUAD implementations, it is important to floorplan the design to keep the logic and memories associated with each Q(x)equation close together. Due to the structure of the FP-GAs, the dedicated memory resources storing the nonlinear coefficients will generally be spread across the device and will dictate placement of the processing logic. Due to this and the large fanouts of some signals, such as memory addresses and the combinations of state, it is also helpful to duplicate some of these signals.

## 3.8 Results

A comparison of our results to previously published designs is shown in Table 5. Note that the previous QUAD implementations, QUAD low and QUAD medium, replace the random coefficients with pseudorandom functions. This violates the security argument in [1] but greatly reduces the circuit area. Cyclone-V ALUTs and Virtex-4 slices are not equivalent structures, although they are similar. Using order of magnitude estimates, our first Cyclonebased design is about 100x larger in physical area than the QUAD low, Virtex-based design, but it has over 1000Xgreater throughput, measured in bits-per-second of keystream generated. It is more than 10x larger than the QUAD medium, Virtex-based design, but its throughput is about 40x higher. Our first Virtex-4 design is 85x larger than QUAD low and 18x larger than QUAD medium, but its throughput is over 16,000X larger than QUAD low and 64x larger than QUAD medium. Our second Cyclone-based design is about 1000X larger in physical area than the QUAD low, Virtex-based design, but it has

**Table 3** The memory layout used in our first design. The subscripts on  $\alpha$  and  $\beta$  indicate which of the Q(x) terms the coefficients are associated with.

|            |                         | 4  M10K (128 bits wide  |                             |                            |
|------------|-------------------------|-------------------------|-----------------------------|----------------------------|
|            | (16                     | x 256 words deep)       |                             |                            |
| addr 0     | $\beta_n(10)$           | $\beta_{n+1}(10)$       | <br>$\beta_{n+7}(10)$       | $\alpha_n(1270)$           |
| :          |                         |                         | :                           | :                          |
| addr 63    | $\beta_n(127126)$       | $\beta_{n+1}(127126)$   | <br>$\beta_{n+7}(127126)$   | $\alpha_n(81918064)$       |
| addr 64    | $\beta_{n+128}(10)$     | $\beta_{n+129}(10)$     | <br>$\beta_{n+135}(10)$     | $\alpha_{n+128}(1270)$     |
| :          | :                       | :                       | :                           | :                          |
| addr 128   | $\beta_{n+128}(127126)$ | $\beta_{n+129}(127126)$ | <br>$\beta_{n+135}(127126)$ | $\alpha_{n+128}(81918064)$ |
| addr 129   |                         |                         |                             | 0                          |
| :          |                         |                         |                             | :                          |
| addr $255$ | ]                       |                         |                             | 0                          |

7

a throughput over 15,000x greater. It is also about 40x larger and 60x faster than the QUAD medium design. Our Second Virtex-4 design is 184x larger than QUAD low and 38x larger than QUAD medium, but it's throughput is 23,000x and 91x larger than QUAD low and QUAD medium.

Studying Table 5, it is clear that hardware implementations of QUAD can be small and slow, or large and fast. As a result, it lags other stream ciphers and symmetric ciphers, such as AES, in throughput/area. QUAD, however, is equipped with a security proof that the other ciphers lack and is much more efficient than other provably secure stream ciphers [1, 15, 16]. Moreover, QUAD's security is easily increased by using larger n, making it viable for long term use. These are all attractive properties of QUAD that might encourage its use over some of the more efficient alternatives.

Notice that, while our QUAD2 design is essentially the double of QUAD1, it does not achieve twice the throughput. Since the processing units in QUAD2 are the same as those in QUAD1, the decreased throughput is related to routing and propagation delays. We attribute this primarily to routing congestion and an inability to place all of the linear and nonlinear processing units close to the memories storing the coefficients that they utilize. The large memories required by QUAD, and the physical placement of those memories in the FPGA fabric, limit the performance of QUAD and the extent to which it can be parallelized. For instance, in Section 3.2 we indicate methods for further parallelizing QUAD, but unfortunately, the increased memory requirements of such approaches quickly exhaust the resources available in current FPGAs. Custom ASICs may be appropriate for QUAD, since such designs allow more efficient layouts and alleviate memory geometry concerns.

Although the security of QUAD depends on the Galois field and parameter choices, with some combinations being insecure [2], over GF(2) we assume that QUAD has approximately  $2^{n/2}$  bits of security [6]. Consequently, QUAD implementations with n = 512 bits of state offer security roughly equivalent to AES-256, whereas QUAD implementations with n = 224 bits are comparable to 3DES [14]. Given the nature of QUAD, which can be easily parallelized or serialized, scaling our implementations to these larger state lengths can be accomplished with relatively little impact on area, but approximately linear decrease in throughput, or with an approximate linear increase in area and relatively small decrease in  $f_{max}$ , due primarily to routing and resource placement in the FP-GAs. In particular, in parallel implementations, scaling up from an *n*-bit state to an *n'*-bit state will increase the logic area by about  $2\left(\frac{n'}{n}\right)$ , which accounts for the increased width of the nonlinear computation and the increased number of parallel paths. In either case, memory requirements for storing the coefficients are  $2n\left(\binom{n}{2}+n\right)+$ 2n, which scales quadratically with *n*.

# 3.9 Increasing Throughput

The throughput can be further increased, at the cost of additional hardware, by further parallelizing the computation. As described in Section 3.2, generating more than 128 combinations of state per cycle is straightforward. For instance, our second design could be adapted to have a 256-bit data path. This would eliminate 32 cycles from the pipeline, but would also require 256 additional linear and nonlinear processing units and modifications to the memories. The area and throughput of the design would both approximately double, although the increased logic complexity would likely reduce the achievable clock frequency and limit the throughput increase to a factor less than two. Moreover, increased memory requirements will quickly deplete the available FPGA resources, limiting achievable throughput. If eliminating the random coefficients and instead using a PRNG, as in [5,6], is acceptable then this problem can be avoided. PRNGs are small, and so many PRNGs could be included in future designs. This would allow each of the QUAD processing units to be located near a PRNG, increasing throughput.

**Table 4** The memory layout used in our second design. The subscripts on  $\alpha$  and  $\beta$  indicate which of the Q(x) terms the coefficients are associated with.

|          | 2 MLABs                           |                       |  |                       |  | 4 M10K (64 bits wide       |
|----------|-----------------------------------|-----------------------|--|-----------------------|--|----------------------------|
|          | (16  bits wide x  64  words deep) |                       |  |                       |  | x 256 words deep)          |
| addr 0   | $\beta_n(10)$                     | $\beta_{n+1}(10)$     |  | $\beta_{n+7}(10)$     |  | $\alpha_n(630)$            |
| :        |                                   |                       |  | :                     |  | :                          |
| addr 63  | $\beta_n(127126)$                 | $\beta_{n+1}(127126)$ |  | $\beta_{n+7}(127126)$ |  | $\alpha_n(81278064)$       |
| addr 64  |                                   |                       |  |                       |  | $\alpha_{n+128}(12764)$    |
| :        |                                   |                       |  |                       |  | :                          |
| addr 128 |                                   |                       |  |                       |  | $\alpha_{n+128}(81918128)$ |
| addr 129 |                                   |                       |  |                       |  | 0                          |
| :        |                                   |                       |  |                       |  | :                          |
| addr 255 |                                   |                       |  |                       |  | 0                          |

|                                                                           | Freq. (MHz) | Area                                | Thru. (Mb/s) | Thru./Area               |
|---------------------------------------------------------------------------|-------------|-------------------------------------|--------------|--------------------------|
| QUAD1<br>(Cyclone V)                                                      | 164         | 8,723 ALUTs<br>2,097,152 mem. bits  | 157.3        | 18.0kbps/ALUT            |
| QUAD1<br>(Virtex-4)                                                       | 274.8       | 7,193 slices<br>272 RAMB16          | 262.5        | 36.5kbps/slice           |
| QUAD2<br>(Cyclone V)                                                      | 143         | 15,612 ALUTs<br>2,129,920 mem. bits | 265.2        | 17.0kbps/ALUT            |
| QUAD2<br>(Virtex-4)                                                       | 204.9       | 13,061 slices<br>544 RAMB16         | 374.7        | 28.7kbps/slice           |
| $\begin{array}{c} \text{QUAD low} \\ \text{(Virtex-4)} [2,6] \end{array}$ | 267         | 85 slices                           | 0.016        | 0.2kbps/slice            |
| QUAD med.<br>(Virtex-4) [2,6]                                             | 262         | 406 slices                          | 4.1          | 10.1kbps/slice           |
| Trivium<br>(Virtex-2) [9]                                                 | 207         | 41 slices                           | 207          | 5.05 Mbps/slice          |
| Grain-128<br>(Virtex-2) [9]                                               | 181         | 48 slices                           | 181          | 3.77 Mbps/slice          |
| MICKEY-128 2.0<br>(Virtex-2) [9]                                          | 200         | 190 slices                          | 200          | 1.05 Mbps/slice          |
| Phelix<br>(Virtex-2) [9]                                                  | 62.5        | 1213 slices                         | 1000         | 0.82  Mbps/slice         |
| Salsa20<br>(Cyclone) [12]                                                 | 30          | 3510  LEs                           | 1280         | $0.36 \mathrm{~Mbps/LE}$ |
| 3-DES<br>(Virtex-2) [11]                                                  | 258         | 604 slices                          | 917          | 1.51 Mbps/slice          |
| AES-128<br>(Virtex-2) [10]                                                | 123         | 146 slices                          | 358          | 2.45 Mbps/slice          |

 Table 5 Comparison of our results to previous FPGA implementations of QUAD and other ciphers. Note that QUAD low and QUAD medium replace S with a pseudorandom function

#### 3.10 Conclusion

We have demonstrated a variety of different implementations of the QUAD stream cipher algorithm, including software implementations in a PC and an ARM microprocessor, and hardware implementations in the Cyclone V and Virtex-4 FPGAs. All of our implementations are over GF(2) and output 128 keystream bits per iteration. Our fastest implementations are over 3.3Mbits/sec on the PC, 580kbits/sec on the ARM processor, and 374Mbits/sec in the Virtex-4 FPGA. We investigated design variations to improve the speed of the implementations, and we discussed methods for further improving the software and hardware approaches in future work.

#### References

- C. Berbain, H. Gilbert, and J. Patarin, QUAD: A practical stream cipher with provable security, in Advances in Cryptology - EUROCRYPT 2006 (S. Vaudenay, ed.), vol. 4004 of Lecture Notes in Computer Science, pp. 109128, Springer Berlin / Heidelberg, 2006.
- Yang, B.-Y., Chen, O.C.-H., Bernstein, D.J., Chen, J.-M.: Analysis of QUAD. Pages 290–308 in Fast software encryption: 14th international workshop, FSE 2007, Luxembourg, Luxembourg, March 26–28, 2007, revised selected papers, edited by Alex Biryukov. Lecture Notes in Computer Science 4593, Springer, 2007. ISBN 978-3-540-74617-1.
- 3. OpenSSL: The Open Source toolkit for SSL/TLS. http://www.openssl.org/ Accessed Jan. 28,2013.
- 4. International Organization for Standardization, ISO/IEC 18031 : 2011, Random bit generation. 2011.

- 5. D. Arditti, C. Berbain, O. Billet, and H. Gilbert, Compact FPGA implementations of QUAD, in Proceedings of the 2nd ACM symposium on Information, computer and communications security, ASIACCS 07, (New York,NY, USA), pp. 347349, ACM, 2007.
- D. Arditti, C. Berbain, O. Billet, H. Gilbert, and J. Patarin, QUAD: Overview and recent developments, in Symmetric Cryptography (E. Biham, H. Handschuh, S. Lucks, and V. Rijmen, eds.), no. 07021 in Dagstuhl Seminar Proceedings, (Dagstuhl, Germany), Internationales Begegnungsund Forschungszentrum fr Informatik (IBFI), Schloss Dagstuhl, Germany, 2007.
- Altera, Cyclone V Device Handbook, (CV-5V2), Dec. 28, 2012.
- 8. Xilinx, Virtex-4 FPGA User Guide UG070 (v2.6) Dec. 1, 2008.
- Bulens, Philippe, et al. "FPGA implementations of eS-TREAM phase-2 focus candidates with hardware profile." State of the Art of Stream Ciphers Workshop (SASC 2007), eSTREAM, ECRYPT Stream Cipher Project, Report. Vol. 24. 2007.
- Rouvroy, Gal, et al. "Compact and efficient encryption/decryption module for FPGA implementation of the AES Rijndael very well suited for small embedded applications." Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004. International Conference on. Vol. 2. IEEE, 2004.
- 11. Rouvroy, Gal, et al. "Design strategies and modified descriptions to optimize cipher FPGA implementations: fast and compact results for DES and triple-DES." Proceedings of the 2003 ACM/SIGDA eleventh international symposium on Field programmable gate arrays. ACM, 2003.
- Rogawski, Marcin. "Hardware evaluation of estream candidates: grain, lex, mickey128, salsa20 and trivium." State of the Art of Stream Ciphers Workshop (SASC 2007), eSTREAM, ECRYPT Stream Cipher Project, Report. Vol. 25. 2007.

- Barker, Elaine B., et al. "SP 800-57." Recommendation for Key Management, Part 1, Rev. 3. 2012.
- Gierlichs, Benedikt, et al. "Susceptibility of eSTREAM candidates towards side channel analysis." Proceedings of SASC (2008): 123-150.
- 15. Blum, Lenore, Manuel Blum, and Mike Shub. "A simple unpredictable pseudo-random number generator." SIAM Journal on computing 15.2 (1986): 364-383.
- Gennaro, Rosario. "An improved pseudo-random generator based on discrete log." Advances in CryptologyCRYPTO 2000. Springer Berlin Heidelberg, 2000.
- 17. Kocher, Paul C. "Timing attacks on implementations of Diffie-Hellman, RSA, DSS, and other systems." Advances in CryptologyCRYPTO96. Springer Berlin Heidelberg, 1996.
- Kocher, Paul, Joshua Jaffe, and Benjamin Jun. "Differential power analysis." Advances in CryptologyCRYPTO99. Springer Berlin Heidelberg, 1999.
- Berbain, Cme, Olivier Billet, and Henri Gilbert. "Efficient implementations of multivariate quadratic systems." Selected Areas in Cryptography. Springer Berlin Heidelberg, 2007.