# A 32Gb/s On-chip Bus with Driver Pre-emphasis Signaling

Liang Zhang, John Wilson, <sup>\*</sup>Rizwan Bashirullah, Lei Luo, Jian Xu, and Paul Franzon

Department of ECE, North Carolina State University, Raleigh, NC 27695 \* Department of ECE, University of Florida, Gainesville, FL 32611

Abstract-A 16-bit on-chip bus with driver pre-emphasis fabricated in 0.25 $\mu$ m CMOS technology attains an aggregate signaling data rate of 32Gb/s over 5-10mm long lossy interconnects while reducing delay latency by 28.3%, power by 15.0%, and peak current by 70% over a conventional singleended voltage-mode static bus. The proposed bus is robust against crosstalk noise and occupies comparable routing area to a reference static bus design.

# I. INTRODUCTION

Delay, noise and power dissipation in on-chip global signaling have become critical performance metrics in scaled CMOS technologies [1]. Unlike local or intermediate interconnects, global interconnects communicate signals across a chip and do not scale down in length [2]. Conventional repeater insertion techniques have been effective at achieving lower latency and higher data throughput for on-chip RC dominated interconnects [3]. However, the number of required repeaters increases as optimal repeater insertion spacing decreases with each technology node [4]. The power dissipation and delay latency associated with repeaters themselves start to undermine the power and delay performances of global signaling.

Various techniques have been proposed for highperformance on-chip busses. In [5], a hybrid current/voltagemode (CM/VM) bus was used to exploit the increased signaling bandwidth benefits of CM sensing while minimizing the static power dissipation, but requires a pipelined datapath to accommodate its bus processing latency. A differential CM sensing technique was used in [6], but exhibits decreased energy efficiency for low data switching activities. This work proposes a driver pre-emphasis technique for differential CM busses to improve the delay and power performance. It attains an aggregate bandwidth of 32Gb/s (2Gb/s/ch) across 5-10mm lossy on-chip interconnects in 0.25µm CMOS technology.

# II. DRIVER PRE-EMPHASIS BUS TECNIQUE

Driver pre-emphasis (i.e. transmitter equalization) techniques are commonly used to reduce inter-symbol interference (ISI) and increase channel bandwidth by emphasizing the high frequency signal components or attenuating low frequency components the [7]. Straightforward analysis of a RC dominated on-chip distributed interconnect channel, a pre-emphasis equalizer, and their combined response indicate a 3dB bandwidth improvement from 0.5GHz to 1GHz as shown in Fig. 1.

Fig. 2 shows the proposed driver pre-emphasis circuit with an equalization and a main driver path. The equalizer consists of a single-ended to differential converter, a one-tap FIR filter, and a simple DAC. The FIR filter determines whether the current symbol is different from the previous bit by using an inverter-based delay cell. The data sequence does not need to be pipelined or delayed as in [5] before appearing at the bus input because pre-emphasis is always determined by the previous symbols. The two tri-state gates P1/N1 and P2/N2 are activated only when there is a "0-1" or "1-0" transition for a 125mV signal swing (250mV differential) at the receiver input. The driver is sized according to the target data rate and interconnect parameters. Inverters "invA" and "invB" in the main signal path maintain the 125mV signal swing for consecutive "1"s or "0"s and are implemented using minimum-sized transistors.

At the receiver end, a sense-amplifier (SA) [8] is used to amplify the 125mv signal swing to single-ended CMOS output level. Unlike a PMOS based input receiver stage in [9], the Vdd/2 bias at the receiver input in this circuit allows an NMOS selection, which operates faster and exhibits a smaller latency.



Fig. 1. Frequency responses of a distributed RC interconnect channel, pre-emphasis equalizer, and their combination.



Fig. 2. Driver pre-emphasis circuit.

The bridge termination resistor ( $R_B$ ) at the receiver balances the differential pair and sinks as much current as it sources. It doubles the resistance of the signal current path and reduces the static current. Due to the  $V_{dd}/2$  virtual ground in the middle of  $R_B$ , this differential-interconnect structure maintains the same RC time constant as a single-ended line. The driver and receiver areas are around  $350\mu m^2$  and  $500\mu m^2$ , respectively.

The reduced signal swings owed to CM sensing and small drivers result in 70.0% reduction in peak currents (Fig. 3) and proportional reduction in power supply noise. The static current per line is 0.126mA, or only 0.158pJ/bit at 2Gb/s.

In order to compensate for process variations, long channel transistors are used in the pre-emphasis circuit delay cell for "0" to "1" or "1" to "0" signal transition detection. For example, a SS process corner results in longer delays for additional pre-emphasis on the driver output whereas with FF corners, the output requires less pre-emphasis and the overall delay is shorter. The variation of SF and FS corners falls between FF and SS corners.  $\pm 18\%$  delay variation is achieved, compared to the  $\pm 28\%$  in conventional single-ended VM static bus. At the receiver end, the termination transistor (R<sub>B</sub>) is biased in the linear region with an overdrive voltage  $\sim$  Vdd/2 to minimize resistance deviation caused by Vth variation. The worst case variation is less than 6.4%, indicating a  $\pm 8$ mV change in the 125mV signal swing. Similarly, the Vdd/2 gate bias at the SA inputs helps build a large (Vgs-Vth) overdrive and makes the SA less sensitive to offset.

# III. EXPERIMENT RESULTS

### A. Bus Architecture

Fig. 4 shows the architecture of the 16-bit pre-emphasis driver bus with on-chip pseudo-random bit sequence (PRBS) generator and bit error rate (BER) analyzer. The bus lines are routed in Metal-4 with every differential pair drawn at minimum-pitch ( $P_{min}$ ) of 0.4µm width and 0.4µm spacing as the pre-emphasis and CM techniques proposed herein are able to compensate for RC losses in the long interconnects. The pitch between differential pairs is set at 2µm to reduce the effect of cross-talk noise, resulting in a signal-to-signal pitch of  $3.2\mu$ m (i.e.  $2xP_{min}$  per line). One ground line on each side of the overall 16-bit bus is used to shield the low-swing signal. The lines are 5mm long with three meanders or 10mm long with 6 meanders. Both metal-3 planes and metal-3 to metal-1 with 50% coverage as underlying layers are tested. The inductive effects on these long but narrow interconnects are still dominated by the line resistive behavior [10].

The die micrograph of the prototype chip is shown in Fig. 5. It implements a 16-bit 32Gb/s 5mm-long bus with driver preemphasis and an 8-bit 16Gb/s 10mm-long bus with preemphasis. A 16-bit single-ended VM static bus with similar bus routing and driver area and power-optimal repeater insertion [11] is also implemented as a benchmark for delay, power and noise performance comparisons. A chip-on-board technique was used to minimize contributions from wire bonding inductances. High-speed signals were directly measured on-chip using high-impedance probes.



Fig. 3. Peak current reduction compared to conventional single-ended





Fig. 4. 16-bit driver pre-emphasis bus architecture.



Fig. 5. Die picture.

P-16-2

# B. Performance Evaluation

The intra-bus crosstalk performance of the pre-emphasis bus is shown in Fig. 6. Waveforms on the victim pair at the receiver input are measured and converted to differential signal in oscilloscope. The crosstalk between adjacent lines behaves primarily as common-mode noise. The differential signal on a pair of quiet lines has only 36mV of noise swing in Fig. 6(a), which is 14.4% of the 250mV measured signal swing. Fig. 6(b) shows the eye diagrams of the differential signal at the receiver input and the single-ended signals of the two inputs. A 250mV differential signal swing and 200mV eye opening are observed when all of the 16-bits are switching randomly. The measured driver input to receiver output delay is 590ps, a 28.3% reduction compared to the reference repeater bus.





(b)

Fig. 6. Measured 2Gb/s (a) waveforms and (b) eye-diagrams at the receiver input.



Fig. 7. Crosstalk from full-swing signals.

To analyze the crosstalk on the low-swing bus due to a fullswing bus, a test structure as shown in Fig. 7 is used. This consists of a full-swing 8-bit VM bus crossing orthogonally beneath the 16-bit low-swing bus at the receiver side. The worst case crosstalk occurs when signals switch in the same direction at the same time. Fig. 7 shows the noise is still mainly common-mode and can be ignored due to the small coupling capacitance between different layers.

Both the on-chip BER tester and an Agilent 863130A 3.6Gb/s error performance analyzer (BERT) report immeasurable BER ( $<10^{-12}$ ) with 15-minute tests. A 230ps clock offset margin is found at BER  $10^{-12}$  (Fig. 8) by adjusting the receiver sampling clock.

Fig. 9 shows the power dissipation measurement at different data activity factors. For activity factors above 0.1, the preemphasis bus reduces power by 15.0%-67.5% in comparison to the reference VM repeater bus. The relative power performance of the proposed pre-emphasis technique is lower only for data activity factors under 0.07. In addition, this technique compares favorably against typical current-mode busses as these require activity factors higher than 0.5 (random data) to achieve better power performance.





P-16-3

To analyze the power performance of the proposed preemphasis bus in a real application, a time-based Alpha 21264 processor simulator program [12] was modified to extract data activity profiles on instruction and data (i.e. load/store) streams. A total of 100 million 32-bit instructions and 100 million 32-bit data are collected for benchmarks from the SPECint2000 test suite. Fig. 10 shows the accumulated data activity profiles of data address (a) and instruction address (b) patterns from the GCC benchmark (i.e. C Programming Language Compiler). Data bus (act=0.121) and address bus (act=0.352) exhibit the same uniform activity distributions as data address bus. The application of pre-emphasis can save 52.1% of power on the instruction bus, 13.2% on the data address bus, and 20.4% on the data bus. Pre-emphasis only saves 1.4% of power on the instruction address bus, but the instruction address bus exhibits high switching activity for the lower order bits, which indicates a higher spatial locality amongst the address streams since instructions are usually stored in adjacent locations of memory. A bus scheme with pre-emphasis on lower order bits and traditional VM bus on higher order bits can be proposed to take advantage of this high spatial locality, but the delay latency difference between the two techniques need to be adjusted.

# IV. DISSCUSSION

Delay, throughput, power, area, and noise are all important performance metrics to be considered in on-chip signaling methodologies. This work explores and applies communication techniques such as pre-emphasis to on-chip signaling while achieving various design trading-offs. Similar to optimization of SRAM designs [13], delay or power performances are improved by trading off noise margin or signal swing, while a degradation of these metrics is allowable only whithin a confined domain of global buses where noise levels are tightly controlled by circuit techniques or bus structures.



Fig. 10. Data address and instruction address bus activity of an Alpha 21264 microprocessor (data bus act=0.121 and address bus act=0.352 exhibit the same uniform activity distributions as data address bus).

The proposed 16-bit bus architecture has a differentialsignal pair pitch of  $3.2\mu$ m. A sparse structure with  $2\mu$ m spacing between signal pairs instead of an even structure with  $0.8\mu$ m spacing is used to improve the noise performance. This non-delay optimal structure favors noise. Given the same bus routing area, pre-emphasis bus delay and noise can be traded for optimal performance. In general, the proposed technique yields improved delay, power and still noise robustness.

## V. CONCLUSION

By applying driver pre-emphasis technique to on-chip bus, power performance is improved by 15.0% due to low-swing signal while minimizing the number of required repeaters; delay performance is improved by 28.3% due to increased channel bandwidth; and peak current is reduced by 70% due to the decrease of driver size. This work was demonstrated in a 0.25µm CMOS technology with an aggregate data rate of 32Gb/s over 5-10mm long lossy interconnects.

### ACKNOWLEDGMENT

The authors thank IBM in Research Triangle Park (RTP) for kindly lending us their error performance analyzer and thank Dr. Stephen Mick and Fei Gao for their valuable discussions. This work is supported by NSF under CCR-9988334 and AFRL under F29601-03-3-0135.

# References

- C. Hu, "CMOS for one more century?" Custom Integrated Circuits Conference, Keynote Speech, Oct 2004.
- [2] R. Ho, K. W. Mai, and M. A. Horowitz, "The future of wires," Proc. IEEE, vol. 89, no. 4, pp. 490-504, Apr 2001.
- [3] H. Bakoglu, Circuits, Interconnections and Packaging for VLSI, Addison-Wesley, 1990.
- [4] J. Cong, "An interconnect-centric design flow for nanometer technologies," Proc. of the IEEE, vol. 89, no. 4, pp. 505-528, Apr 2001.
- [5] R. Bashirullah, et.al., "A 16Gb/s adaptive banwidth on-chip bus based on hybrid current/voltage mode signaling," Symp. VLSI Circuits, pp. 392-393, June 2004.
- [6] N. Tzartzanis and W. W. Walker, "Differential current-mode sensing for efficient on-chip global signaling," JSSC, vol. 40, pp. 2141-2147, Nov 2005.
- [7] W. Dally and J. Poulton, *Digital Systems Engineering*, Cambridge Univ. Press, Cambridge, UK, 1997.
- [8] B. Nikolic, et al., "Improved sense-amplifier-based flip-flog design and measurements," JSSC, vol. 35, pp. 876-884, Jun 2000.
- [9] R. Ho, K, Mai, and M. Horowitz, "Efficient on-chip global interconnects," Symp. VLSI Circuits, pp. 271-274, Jun 2003.
- [10] Y. Ismail, E. Friedman, and J. Neves, "Figures of merit to characterize the importance of on-chip inductance," IEEE Trans. VLSI, vol. 7, pp. 442-449, Dec 1999.
- [11] K. Banerjee and A. Mehrotra, "A power-optimal repeater insertion methodology for global interconnects in nanometer designs," IEEE Trans. Electron Devices, vol. 49, no. 11, pp. 2001-2007, Nov 2002.
- [12] D. Burger and T. M. Austin, "The SimpleScalar tool set, version 2.0," University of Wisconsin, Madison, Technical Report CS-TR-97-1342, June 1997.
- [13] J. M. Rabaey, *Degital Integrated Circuits: A Design Perspective*, Prentice Hall, 1996.