# A Novel Low Gate-Count Pipeline Topology With Multiplexer-Flip-Flops for Serial Link

Wei-Yu Tsai, Student Member, IEEE, Ching-Te Chiu, Member, IEEE, Jen-Ming Wu, Member, IEEE, Shawn S. H. Hsu, Member, IEEE, and Yar-Sun Hsu, Member, IEEE

Abstract—This paper proposes multiplexer-flip-flops (MUX-FFs) to be a high-throughput and low-cost solution for serial link transmitters. We also propose multiplexer-latches (MUX-Latches) that possess the logic function of combinational circuits and storing capacity of sequential circuits. Adopting the pipeline with MUX-FFs, which are composed of cascaded latches and MUX-Latches, many latch gates for sequencing can be removed. Analysis and simulation results show that an 8-to-1 serializer in the pipeline topology with MUX-FFs reduces 52% gate-count compared to that in the traditional pipeline topology. To verify the functions of the proposed design, two chips are implemented with the proposed 4-to-1 MUX-FF and 8-to-1 serializer with MUX-FFs in 90 nm CMOS technology. The measured results show that the MUX-FF and the proposed serializer with MUX-FFs are almost bit-error-free (with  $BER < 10^{-12}$ ), operating at up to 6 Gbits/s and 12 Gbit/s, respectively.

*Index Terms*—Low gate-count, MUX-FF, MUX-Latch, pipeline, serial link.

## I. INTRODUCTION

N RECENT years, serial link interfaces are widely adopted in high-speed interconnect transmission systems. High-speed, low area, and low power are the key targets for serial link designs. The design techniques for the serial link can be categorized into two groups, the pipeline and nonpipeline topology. The nonpipeline topology has the advantages of low chip area and power while the pipeline topology has the advantages of high operating speed. The nonpipeline topologies such as the large fan-in multiplexer [1] and the tree-topology multiplexer with multiphase clocks [2] have been proposed for low-cost solutions. The large fan-in multiplexer [1] processes more-than-two input data in a single gate. However, in a large fan-in multiplexer, the parasitic capacitance of the output node is large and causes long MUX gate delay. In [2], the multiplexer adopts the common tree-topology, but employs multiphase low-frequency clock signals rather than high-frequency clock signals. Both [1] and [2] designs achieve

Manuscriut received September 07, 2011; revised March 05, 2012; revised May 08, 2012; accepted June 06, 2012. Date of publication August 13, 2012; date of current version October 24, 2012. This work was supported by a grant from NSC 99-2221-E-007-112-MY3. This paper was recommended by Associate Editor S. Mirabbasi.

W.-Y. Tsai, J.-M. Wu, S. S. H. Hsu and Y.-S. Hsu are with the Department of Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan, 30013 (e-mail: s9861629@m98.nthu.edu.tw; jmwu@ee.nthu.edu.tw; shhsu@ee.nthu.edu.tw; yshsu@ee.nthu.edu.tw).

C.-T. Chiu is with the Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan 30013 (e-mail: ctchiu@cs.nthu.edu.tw).

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCSI.2012.2206494

area saving through cascaded MUX without pipeline flip-flops (FFs). However, the multiple cascaded MUX stages become critical data paths and that limits the operating speed. Due to this limitation, the pipeline topology is more commonly adopted to achieve high-speed design.

Adopting the pipeline topology to a serial link transmitter separates the data transmission of the MUX stages and improves the data throughput. The pipelined multiplexer implemented with static logic has been reported in [3]. Using the current mode logic (CML) to implement pipelined multiplexers reaches higher data rate, but consumes more power compared to the static logic. In the hybrid designs of [4] and [5], CML circuits are adopted as the output stage to achieve high throughput, and the static logic circuits are used in the input low-speed stage. In recent years, most of the high-speed transmitters are implemented with CML in the pipeline topology with FFs as the sequential circuits (SCs) [6], called the conventional pipeline topology in this paper. The transmitters with the conventional pipeline have been reported in technologies such as SiGe [7] and InP [8], [9] at speed of 50 Gbits/s. Implementations of pipeline transmitters under various CMOS technologies have also been reported. The inductive peaking scheme is adopted to increase the speed of output stages to 15 Gbits/s or higher in [10]–[12]. In [11], the super-dynamic FFs enhance the operation speed of the sequential FFs. The feedback CMOS CML in [13] is used for high-speed operation and is more tolerant to threshold voltage fluctuation than the conventional CMOS CML. All of the above approaches in [6]–[13] adopt the conventional pipeline topology as the high-speed solution, and at the same time different design techniques are adopted in the individual circuits to make an improvement.

However, the conventional pipeline topology using FFs as sequential circuits in a serial link costs circuit overhead. In the serial link interfaces, a serializer converts the low-speed parallel data into a high-speed serial datum. The main circuit of a serializer is a MUX gate, but lots of FF gates in the conventional pipeline are used to separate the MUX stages and store data. The FFs add a significant portion of area and power overhead in the serial link circuit.

The selection between the nonpipeline and the pipeline topologies in designing a serializer becomes a trade-off between speed and area. To overcome this problem, we propose a new circuit called MUX-Latch. A CML MUX-Latch, a combination of a CML MUX and a CML latch [14], has the function of multiplexing and data storage. A. Emami-Neyestanak *et al.* have proposed an analog MUX and embedded latch architecture that is used in decision feedback equalizer (DFE) receiver [15]. Our proposed MUX-Latch architecture is similar to theirs except for two points. First, the selecting and clocking transistors are stacked up in their design while ours are not.



Fig. 1. Schematics of conventional (a) CML MUX and (b) CML Latch.

Second, their design is used in the receiver while ours is used in the transmitter for serialization. The clock signals in their MUX with embedded latch are quarter-rate of the input signals while ours is four times of the input signals.

A MUX-FF is composed of cascaded MUX-Latches and latches [16]. In this paper, we provide detail area and delay analysis of the MUX-FF that performs multiplexing and clock-edge-triggered sampling functions like the MUXs and the FFs in the conventional topology. We describe the pipeline behavior of the MUX Latch and clock-edge-triggered sampling function of the MUX-FF. The function of the smallest unit MUX-FF is a 4-to-1 MUX with clock-edge-triggered sampling. A unit MUX-FF saves six latches compared with a conventional pipeline 4-to-1 MUX. With the MUX-FFs, the gate-count of the proposed serializer is reduced significantly as the inputs of the serializer increases. The proposed serializer topology with MUX-FFs uses the 2-to-1 CML MUX at the output stage, and remains the high-speed operation as the conventional pipeline one.

This paper is organized as follows. The proposed CML MUX-Latch is presented in Section II. The operation and speed-up analysis of the proposed pipeline structure with MUX-Latches are described in Section III. Section IV-A describes the serializer circuit design and analysis of the proposed topology with MUX-FFs. The 4-to-1 MUX-FF and the proposed 8-to-1 serializer are implemented, and the experimental results are shown in Section V. Section VI gives a brief conclusion.

## II. CML MUX-LATCH

The schematics of conventional CML MUX and CML latch circuits are shown in Fig. 1. A CML MUX is constructed by two input pairs, (M1, M2) and (M3, M4), using a common output node and current tail. The differential pair of the lower level, (M5, M6), works as a current switch, and the selecting signal S controls the current flow to one of the two input pairs. The output function of a CML MUX is (S&IN0)||(S&IN1). Similarly, a CML Latch consists of an input pair (M1, M2) and a storage loop (M3, M4). With 50% duty cycle clock signals, a latch samples the data in the first half period and holds the status in the second half period. Both latch and MUX of conventional CML schematic are controlled by 50% duty cycle complementary signals.

With the level-sampling function of a latch, any switching or noise might cause wrong logic at the output node in the clockhigh-level period. Therefore, the succeeding stage of a latch is commonly designed to get data during the holding period. An FF is constructed by cascading two latches, which are controlled



Fig. 2. (a) Circuit diagram, (b) timing diagram, and (c) schematic of MUX-Latch.

by clock and inverse clock signals, respectively. Then only the datum sampled at the moment of clock falling edge propagates to the output.

Our proposed MUX-Latch is constructed by a MUX and an additional storage loop connected to a common current tail. A MUX-Latch combines the holding function of a latch to a MUX. Fig. 2 shows the circuit diagram, control signals, and schematic of a MUX-Latch. In Fig. 2(a), the current of a MUX-Latch switches like a conventional one, in which only one of  $\{(M1, M2), (M3, M4), (M5, M6)\}$  is chosen to carry the total current. One MUX-Latch operation is separated to four steps: propagate IN0, hold IN0, propagate IN1, and hold IN1. Three control signals CLK, P0 and P1 are adopted for the holding and selecting as shown in Fig. 2(b). Like a latch, either INO or IN1 would pass through the MUX-Latch to the output when P0 or P1 is high. A MUX-Latch performs multiplexing for half CLK period and holding data for the other half CLK period, which is equal to cascade a MUX and a latch. In other words, a MUX-Latch can replace a MUX and a succeeding latch. The frequency of CLK is twice of that of P0 and P1. The normal 50% duty cycle CLK signal is used to hold logic level when the CLK is low. In addition to the CLK signal, two 25% duty cycle pulse signals, P0 and P1, are provided to select the two inputs IN0 and IN1 individually.



Fig. 3. Schemes of conventional pipeline with: (a) latches and (b) FFs.

## **III. MUX-LATCHES PIPELINE STRUCTURE ANALYSIS**

We describe the pipeline behavior of our proposed MUX-Latch. Latches and FFs are most commonly used as the sequential circuits (SCs). The schemes of conventional pipeline with latches and FFs are shown in Fig. 3(a) and (b). The  $CC_i$ in the figure is the *i*th stage combinational circuit (CC) of the pipeline. The "\*" and "\*\*" signs in the figure indicate that the data are modified by the CCs once and twice, respectively. The dotted-edged period in Fig. 3(a) is the propagation period of the latch in the previous stage. The shaded blocks are the operation period of the  $CC_is$  and the operation lengths are called  $T_{CCi}$ . The clock-to-output delay and data-to-output delay of a latch are called  $T_{cqL}$  and  $T_{dqL}$ , respectively. The maximum length of  $T_{CCi}$  is the half clock period minus  $T_{cqL}$  in standard design. However, when the operation time of  $CC_i$  is longer than a half

 TABLE I

 Implementations of the Simple Logic Functions With MUXs



Fig. 4. Scheme of proposed pipeline using MUX-Latches.

clock period, the extra time can be "borrowed" from the succeeding stage. Because the time-borrowed stage does not get the stable input datum when the clock is switched, the delay of the latch should be considered as  $T_{dqL}$ .

The pipeline with FFs, as shown in Fig. 3(b), is a special case of the pipeline with latches. The maximum length of  $T_{CCi}$  is the clock period minus  $(T_{cqL} + T_{dqL})$ .

The enhancement of performance is relying on separating the total operation time of the nonpipeline into smaller pieces in the pipeline schematic. In some cases we can find that the smallest separations of the pipeline are just simple logics, such as AND, OR, XOR, MUX, etc. For pipelines with simple logic functions, the operation time of a SC approximates or even exceeds that of a CC. Accordingly, the timing overhead of a SC occupies a significant part of the clock period and dominates the speed-up.

A MUX can implement AND, OR, and XOR logic functions by connecting the inputs IN0 and IN1, and selecting signal S to specified logics, as shown in Table I. Being a combination of a MUX and a latch, a MUX-Latch can accomplish the simple logic functions and store the data by a single gate.

Fig. 4 shows the proposed pipeline using MUX-Latches.  $T_{cqML}$  is the clock-to-output delay of a MUX-Latch. Compared to the conventional pipeline with latches, no time borrowing is allowed. All propagations should be finished in a half clock period, and the MUX-Latch switches to the holding operation in the succeeding half clock period. The  $T_{cqML}$  is smaller than or equals to a half clock period. Because of the accurate holding period, cascading multiple MUX-Latches with staggered phases guarantees that each stages operates when the previous stage output is held and stable.

In the proposed pipeline, the simple logics and the holding function are implemented with MUX-Latches. To compare the same function of a MUX-Latch in the conventional pipeline, a MUX and a latch are used as the CC and the SC, respectively. The operation time of a proposed pipeline MUX-Latch stage



Fig. 5. The clock-edge-triggered sampling function of cascaded MUX-Latches.

is  $T_{cqML}$ . The operation time of a conventional MUX plus a latch is  $T_{cqM} + T_{dqL}$  in contrast. Accordingly, the speed-up ratio of the operation time of the proposed MUX-Latch to the conventional pipeline is

$$\frac{\left(T_{cqM} + T_{dqL}\right)}{\left(T_{cqML}\right)}.$$
(1)

Fig. 5 shows the clock-edge-triggered sampling function of the cascaded MUX-Latches. A MUX-Latch is transparent from the input to the output for half operation period, and is blocked for the other half period. The slash-covered periods in the figure indicate that the datum is hold by the first MUX-Latch and the node is opaque to the input. Similarly, the shaded periods in the figures indicate that the datum is hold by the second MUX-Latch. In this paper, the clock-edge-triggered sampling function is defined as that the output (NODE 2) is opaque to the input, except the data level switching on the clock-edge. The cascaded MUX-Latches in a serializer, as shown in Fig. 5, are controlled by signals of different frequency at each stage. Because the sampling period of the second MUX-Latch overlaps the opaque period of the first MUX-Latch, NODE 2 is opaque to the input. As a result, the cascaded MUX-Latches perform the clock-edge-triggered sampling function.

## IV. SERIALIZER CIRCUIT DESIGN

## A. CML MUX-FF

Fig. 6 shows three types of topologies for a 4-to-1 multiplexer. In the conventional topology, a 2-to-1 CML multiplexing unit is composed of a latch and a MUX. The latch before a MUX is used to provide a half cycle delay. As a result, the MUX can sample the original datum and the delayed datum at the proper phase with high level and low level of the selecting signal, respectively. The FF between every two stages of 2-to-1 CML multiplexing units is used to provide synchronization between top and bottom output of the multiplexers. Because of the clock-edge-triggered sampling function of the FFs, the variance and uncertainty of combinational circuits can be reduced, and then the data are aligned.



Fig. 6. 4-to-1 multiplexer: (a) conventional topology, (b) MUX-Latch with Latch topology, and (c) the proposed MUX-FF topology.

A MUX-Latch can replace a MUX and a latch from the succeeding FF, as shown in Fig. 6(b). Because the MUX-Latch includes a latching function and is a sequential circuit, the operations of multiplexing and latching can be aligned perfectly. The latch after a MUX-Latch can be reduced as shown in Fig. 6(c). We call this topology as a MUX-FF which implements a 4-to-1 multiplexing unit using only MUX-Latches and latches. In the MUX-FF, three MUX-Latches are used, and six latches can be removed compared with the conventional CML 4-to-1 multiplexer. As a result, the number of total gate-count can be obviously decreased.

Fig. 7 shows the complete schematic of a MUX-FF and the signal timing diagram. There are three clock signals CLK, CLKH, and CLKHH and four control signals P0, P1, P0H, and P1H. The frequency of CLK is two and four times of that of CLKH and CLKHH. The frequency of P0 and P1 is half of that of CLK and P0/P1 are 25% duty cycle pulse signals. The frequency of PH0 and PH1 is half of that of CLKH and PH0/PH1 are 25% duty cycle pulse signals. In the first input stage, the CLKHH in the latches is used to pre-sample the data of INPUT 2 and INPUT 3 and hold for half period. The data INPUT 2 and INPUT 3 from NODE 1 and NODE 2 is held for half-period in terms of CLKH. Similarly, the CLKH is providing phase delay of half period to NODE 4. The shaded blocks in the timing diagram are the holding periods of the data. For the MUX-Latches, the propagations are controlled by P0H/P1H and P0/P1, and the holding periods are controlled by CLKH and CLK. Except P0H, propagations are designed to overlap the holding periods of previous stage. The output data rate is the same as the clock rate of CLK.

## B. Clock Pulse Generator

For the noncomplementary selecting of current in the MUX-Latch, a set of clock pulse signals with 25% duty cycle



Fig. 7. (a) Schematic of a MUX-FF. (b) Signal timing diagram.

is needed. The circuits of clock pulse generators (CPGs) are the extra overheads when adopting the proposed topology. The schematic and waveforms in Fig. 8 are the CPG and its input and output signals. Like the differential signals, the pulse signals are in 180° phase difference to each other. P0 is equal to (CLK&CLKH), and P1 is equal to  $(CLK\&\overline{CLKH})$ . To implement the pulse signals, the MUX gates are used as AND logics. For a MUX, signals from IN0 or IN1 propagate to the output faster than the selecting signal S. The critical switching signal of the AND operation is CLK. Therefore, IN0 and IN1 are connected to CLK and logic-0, and the selecting input S is connected to CLKH, respectively.

The waveform of CLKH in Fig. 8(b) shows the ideal case (the solid line) and the real case (dotted line) of the timing of CLKH. The clock divider is a flip-flop with inverted feedback, and the falling edge of the clock triggers the divider. As a result, the delay of a divider covers only the CLK-level-0, and keeps the complete overlap of the CLK-level-1 and the CLKH-level-0 or 1 period.



Fig. 8. (a) Schematic and (b) the input and output waveforms of the clock pulse generator.

## C. Serializer Topology With MUX-FFs

Figs. 9 and 10 shows the topology schemes of the conventional 8-to-1 serializer and the proposed 8-to-1 serializer with MUX-FFs. A proposed 8-to-1 serializer is constructed by two MUX-FFs and 2-to-1 multiplexing unit as the output stage. The 2-to-1 multiplexing unit is the same to the conventional topology, which is combined of a Latch and a MUX. Two clock dividers are needed to produce CLKHH and CLKH from CLK. The CPGs are extra circuits compared to the conventional topology. Two CPGs are used to provide the full-speed and half-speed pulse signals, P0/P1 and P0H/P1H.

The parasitic effect from the last stage MUX to the output load in Fig. 9 is the speed critical part of a half-rate-to-clock serializing. A common way to solve this problem is to add an FF after the last stage MUX. The FF is operated under full data rate clock. For the proposed approach, a MUX-latch can be added instead. Under this case, a MUX-FF can reduce the parasitic effects to the output load at the cost of operating the MUX-Latch at full data rate.

The serializer topologies focused in this work are the halfrate-to-clock approach as shown in Figs. 9 and 10. A MUX-Latch has a level-sampling function. By adopting a MUX-FF, the data sampling can be clock-edge-triggered. Nevertheless, the data period of a MUX-FF is full-rate to clock period. Adding a 2-to-1 multiplexing unit at the outputs of the MUX-FFs, the data rate of the proposed 8-to-1 serializer is doubled by using the same clock signals in the 4-to-1 MUX-FF. As a result, serializers with input number over eight are recommended to use the proposed topology.

Compared to a conventional 8-to-1 serializer, a proposed 8-to-1 serializer with MUX-FFs uses the same clock dividers, saves 14 pipelining flip-flops, and adds four extra MUXs for pulse signals. The gate count estimations of N-to-1 serializers



Fig. 9. Schematic of a conventional 8-to-1 serializer.

for both conventional and proposed topologies are expressed as Table II.

The estimations include the circuit overheads of clock dividers and the pulse generators. A N-to-1 serializer is constructed by N-1 2-to-1 components, placed into  $\log_2 N$  levels, so N-1 clock dividers are needed. In each level, the number of component is reduced by half and the data speed is doubled compared to the previous level. In a conventional serializer, five latches and one MUX compose a 2-to-1 component. Nevertheless, only a latch and a MUX-Latch can perform the same function in the proposed topology. The gate count of the serializer core in the proposed schematic is about 33% to a conventional one. Fig. 11 shows the gate count verses the number of inputs. The gate-count numbers of a conventional 8-to-1 serializer and a proposed 8-to-1 serializer with MUX-FFs including clock signal circuits are 46 and 22, respectively.

The area, power, and data rate of our proposed 8-to-1 serializer are compared with those of the conventional serializer. First, the area of a MUX-Latch is larger than the area of a MUX or a latch. Under the assumption that the MUXs, latches, and MUX-Latches are designed with the same driving strength in this paper, the area ratio of these circuits is similar to the ratio of number of components.

$$Area_{MUX} : Area_{latch} : Area_{MUX-Latch} \approx 2 : 2 : 3.$$
 (2)

If we multiply the factors to the gate count estimations, we can get the area estimations of the conventional topology and the proposed topology for the 8-to-1 serializer shown in Figs. 9 and 10. The area estimation ratio of the proposed topology to the conventional topology is 50 : 92, which means 46% area are saved.

Second, the power estimation ratio is similar to the gate count estimation ratio because the MUX-Latches are designed with the same driving current. The presimulation results show that the power consumption including the serializer, the clock signal circuits, and the buffers of our design saves 35.3% compared with the conventional 8-to-1 serializer with the same driving strengths. Third, since all the components in the conventional and proposed serializer are designed with the same driving strength so their data rate and BER performance are similar.

## D. Serializer Delay Analysis

The parasitic capacitance at the output node of a MUX-Latch is larger than that of a latch or a MUX. Both a MUX and a Latch are two-component circuits, in which a component is a input pair or a storage loop. Each of the NMOSs in the components connected to the output node increases the parasitic capacitance  $C_{OUT}$ . Assume the parasitic capacitance of a NMOS is  $C_p$ , the unit fan-out capacitance is  $C_f$ , and the output resistance is  $R_{OUT}$ , the RC delay approximations of  $T_{cqM}$  and  $T_{dqL}$  are

$$T_{cqM} \approx T_{dqL} \approx R_{OUT} \times C_{OUT}$$
  
=  $R_{OUT} \times (2 \times C_p + N \times C_f),$  (3)

where N is the normalized fanout-number. More accurately, the RC delay approximation is the delay from input data switching to output switching. In fact, the clock-to-output delay is longer



Fig. 10. Schematic of a proposed 8-to-1 serializer with MUX-FFs

 
 TABLE II

 N-to-1 Serializer Gate-Count Estimations of Conventional Topology and Proposed Topology

|                   | Conventional topology | Proposed topology    |
|-------------------|-----------------------|----------------------|
| Latch             | 5(N-1)                | N-1                  |
| MUX               | N-1                   | 1                    |
| MUX-Latch         | -                     | N-2                  |
| Latch for divider | $2(\log_2 N - 1)$     | $2(\log_2 N - 1)$    |
| MUX for CPG       | -                     | $2(\log_2 N - 1)$    |
| total             | $6N + 2\log_2 N - 8$  | $2N + 4\log_2 N - 6$ |



Fig. 11. Estimation of serializer gate-count for scalability.

than data-to-output delay. Also, the storage loop in a latch has larger  $C_p$  because of the feedback. Both of these effects are ignored in the delay approximation of  $T_{cqM}$  and  $T_{dqL}$ . Similarly, the RC delay approximation of  $T_{cqML}$  is

$$T_{cqML} \approx R_{OUT} \times (3 \times C_p + N \times C_f) \tag{4}$$



Fig. 12. Environment of the simulation of circuit delay.

because a MUX-Latch is a three-component circuit. The extra parasitic of a MUX-Latch would cause long propagation delay and slow down the operation speed. The delay approximations include the extra capacitance (the fan-out capacitors  $C_f$ ) of the output node of the mux-latch and the NMOS parasitic capacitance except the capacitance caused by the feedback loop. The capacitance ( $C_g$ ) of the feedback loop is smaller than the NMOS parasitic capacitors are connected together in parallel so the capacitor ( $C_g$ ) can be neglected.

Because operations at every stage are finished at a half clock period, the minimum half clock period length should be considered as the longest operation time. There are two speed bottlenecks in both the conventional and proposed topology. One of them is the operation time of the last stage MUX at speed X depending on the delay  $T_{cqM}$ , and the other is in the previous stage of speed 1/2X. The delay from the MUX of speed 1/2X to the succeeding latch is the longest in conventional topology. Define  $T_{MaL}$  as the clock-to-output delay of the cascaded MUX and latch, and the RC delay approximation of  $T_{MaL}$  is

$$T_{MaL} \approx R_{OUT} \times (2 \times C_p + C_f) + R_{OUT} \times (2 \times C_p + N \times C_f) = R_{OUT} \times (4 \times C_p + (N+1) \times C_f).$$
(5)



Fig. 13. (a) Theoretical delay and (b) simulated delay of the speed bottleneck of proposed serializer topology compared with MUX, latch, and the chain of MUX and latch.

For a MUX-FF, the longest propagation time is from P0/P1 rising to output data switching,  $T_{cqML}$ . The simulation environment is shown in Fig. 12. The test elements could be a latch, a MUX, a MUX followed by a latch, or a MUX-Latch. The output of a test element is loaded with N fanout capacitances  $(C_f)$ . The input is driven by an ideal signal source followed by a CML buffer. Fig. 13 shows the theoretical delay and the simulated delay of these test elements and their delays relations are shown below.

$$T_{cqM} \approx T_{dqL} < T_{cqML} < T_{MaL}.$$
  
(proposed) (conventional) (6)

The simulations in Fig. 13(b) include the complete RC model of each transistors, so impacts of the feedback capacitance are added in the results. Fig. 13(a) shows the theoretical results based on the delay estimations in (3)–(5). The simulation results are similar to the results of the theoretical approximations. The delay of the proposed MUX-Latch is about 72% of that of the chain of MUX and latch in the simulations versus 68% in the estimations. In Fig. 13(a), the delay of the latch  $T_{dqL}$  is assumed to be the same to the MUX delay  $T_{cqM}$ .

The simulation result shows that the MUX-latch solves the bottleneck of the stage of 1/2X speed because  $T_{cqML}$  is less than  $T_{MaL}$ . The delay of the proposed MUX-Latch ( $T_{cqML}$ ) is about 72% of that of the chain of MUX and latch ( $T_{MaL}$ ). Using a 2-to-1 MUX at the last stage of the proposed serializer with MUX-FFs, the operation speed remains the same as the conventional one. And the gate-count estimations in Fig. 11 show that using the MUX-FFs saves over 52% of the gate-count. As a result, adopting the proposed topology with MUX-FFs reduces the gate-count significantly without slowing down the operation speed.

## V. EXPERIMENT RESULTS

To verify the functions of both 4-to-1 MUX-FF and proposed 8-to-1 serializer with MUX-FFs, two chips are implemented. Chip-1 contains a 4-to-1 MUX-FF, two clock dividers, and two CPGs. Fig. 14 shows the chip photograph of Chip-1. The size of Chip-1 is  $0.979 \times 0.950 \text{ mm}^2$  including pads. The MUX-FF and the circuits for clock signals occupy  $0.1 \text{ mm}^2$  and  $0.105 \text{ mm}^2$ , respectively. The total power consumption of Chip-1 is 170 mW including the output buffer. The 4-to-1 MUX-FF can operate up to 6 Gbits/s, with a 6 GHz clock and four 1.5 Gbits/s PRBS data inputs. The measured eye-diagram of Chip-1 at 6 Gbit/s is shown in Fig. 15, monitored by the oscilloscope Agilent DSO80404B. Because the propagation path of the serializer is unbalanced, the jitter would be determined by the propagation path and has two values. The measured peak-to-peak jitter



Fig. 14. Chip photograph of Chip-1 (4-to-1 MUX-FF).

is 10 ps/14 ps. The bit-error-rate (BER) test is performed with the four input data:

$$\begin{array}{l} A:2^{31}-1\ PRBS \quad (tested\ sequence)\\ B:2^{23}-1\ PRBS\\ C:2^{15}-1\ PRBS\\ D:2^{14}-1\ PRBS\end{array}$$

which are produced by the multichannel signal generator Agilent E8403A. The tested sequence is provided to the tested input, and the error detector Anritsu MP1764C reads out the tested sequence from the serialized output data (...A, B, C, D, A...). In the BER test, Chip-1 is error-free (with  $BER < 10^{-12}$ ) for all four inputs with operation speed 6 Gbits/s.

Chip-2 contains the whole scheme in Fig. 10, which is composed of a proposed 8-to-1 serializer with MUX-FFs, and the clock signal circuits are the same as Chip-1. Fig. 16 shows the chip photograph of Chip-2. The size of Chip-2 is  $0.812 \times$  $0.825 \text{ mm}^2$  including pads. The areas for the serializer core and clock signal circuits are  $0.12 \text{ mm}^2$  and  $0.06 \text{ mm}^2$ , respectively. The area overhead of the clock signal circuits is smaller than that in Chip-1 because the layout in Chip-2 is designed to be more compact than Chip-1. The power consumption of Chip-2 is 308 mW including the output buffer. Fig. 17 shows the eye-diagrams at 10 Gbits/s and 12 Gbits/s, monitored by the oscilloscope Agilent DCA-J 86110C. Chip-2 can operate up to 12 Gbits/s, with a 6 GHz clock input and eight 1.5 Gbits/s PRBS data inputs. Similarly to Chip-1, Chip-2 is error-free (with  $BER < 10^{-12}$ ) in the BER test for all eight PRBS inputs at both 10 Gbits/s and 12 Gbits/s. The BERs of our designs are both below  $10^{-12}$  that meets the serial link transmission requirement.



Fig. 15. Measured eye-diagram of Chip-1 (4-to-1 MUX-FF) at 6 Gbits/s.



Fig. 16. Chip photograph of Chip-2 (proposed 8-to-1 serializer with MUX-FFs).

In Fig. 17(b), it shows a voltage shift of the logic-0 in the eye diagram. The eye distortion is due to the asymmetric charging and discharging paths at the output. For the 4-to-1 MUX-FF in Fig. 14 and the 8-to-1 serializer in Fig. 16, the CML output buffers are added after the CML-Latch and the 2-to-1 MUX in the last stage to enlarge the output current. The charging is from the voltage source through the resistor, and the discharging is through the NMOSs to the ground. The impedance of the discharging path is larger than that of the charging path in our design. The unbalanced of charging and discharging paths of the CML buffers would cause the asymmetry of rising and falling delay. During the rapid switching, the large parasitic impedance in the discharging path makes the voltage not reaching the level of logic-0. On top of Fig. 17(b) is the postsimulation of the output waveform for the 8-to-1 MUX at 12 Gbits/s. That shows a long discharging time so the output voltage does not reach the level of logic-0. This phenomenon becomes more observed in the measured eye diagram due to intersymbol interference caused by transmission impairment. We mount our 8-to-1 MUX chip on an evaluation board. The inputs and outputs of the 8-to-1 MUX are brought out to cables with SMA connectors. The measured eye diagram at the bottom of Fig. 17(b) shows the combined loss of chip module, evaluation board, and trace lines and intersymbol interference effect.

The performance summary of Chip-1 and Chip-2 and comparisons with other implementations are shown in Table III. Chip-1 and Chip-2 are fabricated in a TSMC 90 nm 9-metal CMOS process. The chips are measured on PCBs with gold



Fig. 17. Chip-2 (proposed 8-to-1 serializer with MUX-FFs). (a) Measured eyediagrams of at 10 Gbits/s and (b) postsimulated and measured eye-diagrams at 12 Gbits/s.

wire bonding. Compared with the 4-to-1 MUX of the same technology, the area saving of our design is about 36% of that in [6]. In the 8-to-1 MUX design [13], only the last stage MUX adopts the CML architecture and the other MUXes use the CMOS structure. Therefore, their area and power are smaller.

Apart from the conventional topology, five more serializer implementations are compared with the Chip-1 and Chip-2. The tree-topology based multiplexer [2] employs a multiphase lowfrequency clock. Since multiphase low frequency clock is used in [2], the power consumption is lower than other approaches. Assume a factor of three of area reduction for scaling down the 8-to-1 MUX design in [2] from 180 nm to 90 nm, the area saving is around 44%. J. Y. Song et al. [17] propose an 8-to-1 multiplexer by using pseudo-nMOS configuration with one-stacked switches and reducing the short-circuit current of the gate driver in the multiplexer. Since pseudo-nMOS architecture is used, their area is the smallest. On chip inductive peaking is adopted in [18] to implement a 32-to-1 transmitter with more than 400 on-chip inductors and transformers in order to achieve the bandwidth required for the 38.4 Gbits/s operation demonstrated in a 130 nm CMOS process. When a 32-to-1 transmitter are implemented by one our 8-to-1 and eight our 4-to-1 multiplexer, the

|                    | Chip-1           | Chip-2                 | Tree MUX<br>[2]     | Serial Link<br>[6]     | MUX<br>[13]           | MUX<br>[17]                    | TX<br>[18]       | TXRX<br>[19], [20]    | TXRX<br>[21]       |
|--------------------|------------------|------------------------|---------------------|------------------------|-----------------------|--------------------------------|------------------|-----------------------|--------------------|
| Technology         | 90 nm            | 90 nm                  | 180 nm              | 90 nm                  | 180 nm                | 180 nm                         | 130 nm           | 65 nm                 | 65 nm              |
| Function           | 4-to-1<br>MUX-FF | 8-to-1 with<br>MUX-FFs | 8-to-1<br>MUX       | 4-to-1<br>MUX          | 8-to-1<br>MUX         | 8-to-1<br>MUX                  | 32-to-1<br>MUX   | 16-to-1<br>MUX        | 16-to-1<br>MUX     |
| Power<br>(mW)      | 170              | 308                    | 30.06<br>(at 5Gbps) | 182                    | 126                   | 56.92                          | 2700             | 635                   | 1600               |
| Core area $(mm^2)$ | †0.1             | †0.12                  | 0.639               | †0.35×0.45<br>(0.1575) | 0.45×0.25<br>(0.1125) | $0.28 \times 0.18$<br>(0.0504) | †2.5×3.6<br>*(9) | †4.9× 5.2<br>*(25.48) | †1.3×0.8<br>(1.04) |
| Clock frequency    | 6 GHz            | 6 GHz                  | 0.875 GHz           | 11.7 GHz               | 5.1 GHz               | 10 GHz                         | 19.4 GHz         | 44.6 GHz              | 22.6 GHz           |
| Output data rate   | 6 Gbits/s        | 12 Gbits/s             | 7 Gbits/s           | 11.7 Gbits/s           | 10.2 Gbits/s          | 10 Gbits/s                     | 38.4 Gbits/s     | 44.6 Gbits/s          | 44.6 Gbits/s       |
| RMS Jitter<br>(ps) | 4.39 / 7.47      | 2.26 / 4.66            | 7.2<br>(at 7Gbps)   | NA                     | NA                    | 6                              | 1.53             | 0.9                   | NA                 |
| p-p Jitter<br>(ps) | 32 / 40          | 28 / 30                | 42.11<br>(at 7Gbps) | NA                     | 28                    | 42                             | 8.1              | 4.7                   | NA                 |
| BER                | $< 10^{-12}$     | $< 10^{-12}$           | NA                  | $< 10^{-13}$           | $< 10^{-9}$           | NA                             | NA               | $< 10^{-12}$          | $< 10^{-11}$       |

TABLE III Performance Summary and Comparisons

†: CMOS CML based MUX, \*: is the total chip area.

serializer core is  $0.92 \text{ mm}^2$  that are significantly smaller than the area in [18]. A multidata rate 16-to-1 CMOS transmitter is implemented in a 65 nm CMOS technology [19], [20]. The transmitter chip provides reasonable jitter performance with a 40 GHz full-rate clock architecture that alleviates pattern-dependent jitter and eliminates duty cycle dependence. Nikola Nedovic *et al.* present a 40 Gbits/s SerDes using standard CMOS with less inductance to reduce the power [21]. Using minimal of inductors and operating at half-rate or quarter-rate helps to reduce the design effort and power consumption. When a 16-to-1 transmitter are implemented by five our 4-to-1 multiplexer, the serializer core is  $0.5 \text{ mm}^2$  that are significantly smaller than the area in [19]–[21]. Our area is smaller than other CML based implementations as marked in Table III.

As for the power consumption, the power efficiency per bit of our chip-1 and chip-2 are smaller than others. The reason is as follows. All the circuit components such as MUXs, latches, and MUX-Latches implemented in this paper are designed to have the same driving strength. The current of the current source in the MUX-Latches and Latches are the same even they are operated at different frequencies. In most of other works, the sizing choices of the gates are designed according to the operation speed. The sizing of our design can be further optimized to reduce the power and enhance the speed.

The jitter of our designs are higher than others. In conventional design as shown in Fig. 9, the last stage MUX samples the data from the last two latches. Those three circuits are controlled by the same clock and inverse clock signals (CLK and CLKB). In contrast, as shown in Fig. 10, the last stage MUX of the proposed design samples data from a MUX-Latch and a latch. In addition to the clock signals (CLK), the MUX-Latch are controlled by signals P0/P1 that are produced by the clock (CLK). There are delays between the P0/P1 and CLK. As Section IV-D mentioned, the parasitic capacitance at the output node of a MUX-Latch is larger than that of a latch. The voltage swing and speed of data from the MUX-Latch output is relatively smaller and slower than that of the latches. The unbalance of the input channels in the last stage MUX results in the multivalue jitter, and the larger one is caused by the input from the MUX-Latch.

## VI. CONCLUSION

In this paper, a pipeline topology with MUX-FFs for serializers is proposed. The MUX-FF is composed of cascaded latches and MUX-Latches. With the MUX-FFs, the gate-count number of the serializer can be reduced by removing flip-flops from the conventional pipeline. The gate count ratio of a proposed 8-to-1 serializer with MUX-FFs to a conventional 8-to-1 serializer is 0.48. The proposed 4-to-1 MUX-FF and 8-to-1 serializer with MUX-FFs are implemented into two chips in the TSMC 90 nm CMOS process, verified error-free at 6 Gbits/s and 12 Gbits/s, and with 170 mW and 308 mW power consumptions, respectively.

#### REFERENCES

- M. Alioto and G. Palumbo, "Interconnect-aware design of fast large fan-in CMOS multiplexers," *IEEE Trans. Circuits Syst. II, Exp. Briefs*, vol. 54, no. 6, pp. 484–488, Jun. 2007.
- [2] H. Lu, C. Su, and C.-N. J. Liu, "A tree-topology multiplexer for multiphase clock system," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 56, no. 1, pp. 124–131, Jan. 2009.
- [3] T. Nakura, K. Ueda, K. Kubo, Y. Matsuda, K. Mashiko, and T. Yoshihara, "A 3.6-Gb/s 340-mW 16:1 pipe-lined multiplexer using 0.18 μm SOI-CMOS technology," *IEEE J. Solid-State Circuits*, vol. 35, no. 5, pp. 751–756, May 2000.
- [4] A. Shinmyo, M. Hashimoto, and H. Onodera, "Design and measurement of 6.4 Gbps 8:1 multiplexer in 0.18 μm CMOS process," in *Proc. IEEE Asia South Pacific Design Autom. Conf. (ASP-DAC)*, Jan. 2005, vol. 2, pp. D9–D10.
- [5] D. Tondo and R. Lopez, "A low-power, high-speed CMOS/CML 16:1 serializer," Argentine School Micro-Nanoelectron., Technol. Appl. (EAMTA), pp. 81–86, Oct. 2009.
- [6] S. Rylov, S. Reynolds, D. Storaska, B. Floyd, M. Kapur, T. Zwick, S. Gowda, and M. Sorna, "10+ Gb/s 90-nm CMOS serial link demo in CBGA package," *IEEE J. Solid-State Circuits*, vol. 40, no. 9, pp. 1987–1991, Sep. 2005.
- [7] M. Meghelli, A. Rylyakov, and L. Shan, "50-Gb/s SiGe BiCMOS 4:1 multiplexer and 1:4 demultiplexer for serial communication systems," *IEEE J. Solid-State Circuits*, vol. 37, no. 12, pp. 1790–1794, Dec. 2002.
- [8] T. Suzuki, T. Takahashi, K. Makiyarna, K. Sawada, Y. Nakasha, T. Hirose, and M. Takikawa, "Under 0.5 W 50 Gb/s full-rate 4:1 MUX and 1:4 DEMUX in 0.13 μm InP HEMT technology," in *Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC)*, Feb. 2004, vol. 1, pp. 234–235.
- [9] T. Suzuki, Y. Kawano, Y. Nakasha, S. Yamaura, T. Takahashi, K. Makiyama, and T. Hirose, "A 50-Gbit/s 450-mW full-rate 4:1 multiplexer with multiphase clock architecture in 0.13-µm InP HEMT technology," *IEEE J. Solid-State Circuits*, vol. 42, no. 3, pp. 637–646, Mar. 2007.
- [10] D. Kehrer, H.-D. Wohlmuth, M. Wurzer, and H. Knapp, "50 Gbit/s 2:1 multiplexer in 0.13 μm CMOS technology," *Electron. Lett.*, vol. 40, no. 2, pp. 100–101, Jan. 2004.
- [11] J.-C. Chien and L.-H. Lu, "A 15-Gb/s 2:1 multiplexer in 0.18-μ/m CMOS," *IEEE Microw. Wireless Compon. Lett.*, vol. 16, no. 10, pp. 558–560, Oct. 2006.
- [12] A. Yazdi and M. Green, "A 40 Gb/s full-rate 2:1 MUX in 0.18 μm CMOS," in *Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC)*, Feb. 2009, pp. 362–363.

- [13] A. Tanabe, M. Umetani, I. Fujiwara, T. Ogura, K. Kataoka, M. Okihara, H. Sakuraba, T. Endoh, and F. Masuoka, "0.18-µm CMOS 10-Gb/s multiplexer/demultiplexer ICs using current mode logic with tolerance to threshold voltage fluctuation," *IEEE J. Solid-State Circuits*, vol. 36, no. 6, pp. 988–996, Jun. 2001.
- [14] W.-Y. Tsai, C.-T. Chiu, J.-M. Wu, S.-H. Hsu, and Y.-S. Hsu, "A novel MUX-FF circuit for low power and high speed serial link interfaces," in *Proc. IEEE Int. Symp. Circuits Syst. (ISCAS)*, Jun. 2010, pp. 4305–4308.
- [15] A. Emami-Neyestanak, A. Varzaghani, J. Bulzacchelli, A. Rylyakov, C.-K. Yang, and D. Friedman, "A 6.0-mW 10.0-Gb/s receiver with switched-capacitor summation DFE," *IEEE J. Solid-State Circuits*, vol. 42, no. 4, pp. 889–896, Apr. 2007.
- [16] W.-Y. Tsai, C.-T. Chiu, J.-M. Wu, S. S. Hsu, and Y.-S. Hsu, "A novel low gate-count pipeline topology with multiplexer-flip-flops for serial link," in *Proc. IEEE Int. Symp. Circuits Syst. (ISCAS)*, Jun. 2012.
- [17] J.-Y. Song and O.-K. Kwon, "Low-power 10-Gb/s transmitter for high-speed graphic DRAMs using 0.18-µm CMOS technology," *IEEE Trans. Circuits Syst. II, Exp. Briefs*, vol. 58, no. 12, pp. 921–925, Dec. 2011.
- [18] J. Kim, J.-K. Kim, B.-J. Lee, and D.-K. Jeong, "Design optimization of on-chip inductive peaking structures for 0.13-µm CMOS 40-Gb/s transmitter circuits," *IEEE Trans. Circuits Syst. 1, Reg. Papers*, vol. 56, no. 12, pp. 2544–2555, Dec. 2009.
- [19] S. Kaeriyama, Y. Amamiya, H. Noguchi, Z. Yamazaki, T. Yamase, K. Hosoya, M. Okamoto, S. Tomari, H. Yamaguchi, H. Shoda, H. Ikeda, S. Tanaka, T. Takahashi, R. Ohhira, A. Noda, K. Hijioka, A. Tanabe, S. Fujita, and N. Kawahara, "A 40 Gb/s multi-data-rate CMOS transmitter and receiver chipset with SFI-5 interface for optical transmission systems," *IEEE J. Solid-State Circuits*, vol. 44, no. 12, pp. 3568–3579, Dec. 2009.
- [20] Y. Amamiya, S. Kaeriyama, H. Noguchi, Z. Yamazaki, T. Yamase, K. Hosoya, M. Okamoto, S. Tomari, H. Yamaguchi, H. Shoda, H. Ikeda, S. Tanaka, T. Takahashi, R. Ohhira, A. Noda, K. Hijioka, A. Tanabe, S. Fujita, and N. Kawahara, "A 40 Gb/s multi-data-rate CMOS transceiver chipset with SFI-5 interface for optical transmission systems," in *Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC)*, Feb. 2009, pp. 358–359.
- [21] N. Nedovic, A. Kristensson, S. Parikh, S. Reddy, S. McLeod, N. Tzartzanis, K. Kanda, T. Yamamoto, S. Matsubara, M. Kibune, Y. Doi, S. Ide, Y. Tsunoda, T. Yamabana, T. Shibasaki, Y. Tomita, T. Hamada, M. Sugawara, T. Ikeuchi, N. Kuwata, H. Tamura, J. Ogawa, and W. Walker, "A 3 Watt 39.8–44.6 Gb/s dual-mode SFI5.2 SerDes chip set in 65 nm CMOS," *IEEE J. Solid-State Circuits*, vol. 45, no. 10, pp. 2016–2029, Oct. 2010.



Wei-Yu Tsai received the B.S. degree in electrical control engineering from National Chiao Tung University, Taiwan, in 2009, and the M.S. degree in electrical engineering from National Tsing Hua University, Taiwan, in 2011.

His research interests include high-speed SerDes design and analog IC design.



**Ching-Te Chiu** received the B.S. and M.S. degrees from National Taiwan University, Taipei, in 1986 and 1988, respectively, and the Ph.D. degree from University of Maryland, College Park, in 1992, all in electrical engineering.

She was an Associate Professor with National Chung Cheng University, Chia-Yi, Taiwan from 1993 to 1994. From 1994 to 1996, she was a Member of Technical Staff with AT&T, Murry Hill, NJ, and at Lucent Technologies, Murry Hill, from 1996 to 2000, and with Agere Systems, Santa Clara,

CA, from 2000 to 2003. Since 2004, she has joined the Computer Science Department and Institute of Communications Engineering, National Tsing Hua University, Hsinchu, Taiwan as an Associate Professor. Her research interests include high speed serdes design, multichip interconnect, fault tolerance for network-on-chip, high dynamic range image and video processing, high

definition television video decoder chip design, and the SONET/SDH mapper and framer IC design.

Dr. Chiu won the first prize award, the best advisor award, and the best innovation award of the Golden Silicon Award in 2006. She serves as a TC member of the IEEE Circuits and Systems Society, Nanoelectronics and Gigascale Systems Group, and the IEEE Signal Processing Society, Design and Implementation of Signal Processing Systems group. She is the program chair of the first IEEE Signal Processing Society Summer School at Hsinchu, 2011.



Jen-Ming Wu received the B.S. degree from National Taiwan University, Taipei, in 1988, the M.S. degree from Polytechnic Institute of New York University, Brooklyn, in 1991, and the Ph.D. degree from University of Southern California, Los Angeles, in 1998, all in electrical engineering.

From 1998 to 2003, he has been with Sun Microsystems Inc., Sunnyvale, CA, as Senior Member of Technical Staff. Since 2003, he has been with the faculty of the Institute of Communications Engineering, Department of Electrical Engineering,

National Tsing Hua University, Taiwan, where he currently holds the Associate Professor position. He has worked on various fields of electrical engineering including signal processing for communications, wireless communication transceiver IC design, high-speed chip-to-chip interface IC design, and micro-processor architecture and has published more than 50 technical papers in IEEE journals and conferences. Currently, his research works focus on high-speed interface technologies, energy efficient communication systems, MIMO signal processing, and wireless applications for eHealth monitoring.



**Shawn S. H. Hsu** (M'04) was born in Tainan, Taiwan. He received the B.S. degree from National Tsing Hua University, Hsinchu, Taiwan, in 1992, and the M.S. and the Ph.D. degrees from the University of Michigan, Ann Arbor, in 1997 and 2003, respectively.

He is currently a Professor with the Department of Electrical Engineering and Institute of Electronics Engineering, National Tsing Hua University. He serves as a technical program committee member in SSDM (2008–2011) and IEEE A-SSCC

(2008-present). His current research interests include the design of MMICs and RFICs using Si/III–V-based devices for low-noise, high-linearity, and high-efficiency system-on-chip (SOC) applications. He is also involved with the design and modeling of high frequency transistors and interconnects.

Prof. Hsu was the recipient of the Junior Faculty Research Award of National Tsing Hua University in 2007 and the Outstanding Young Electrical Engineer Award of the Chinese Institute of Electrical Engineering in 2009.



**Yar-Sun Hsu** received the B.S. and M.S. degrees in electronics engineering from National Chiao Tung University, Taiwan, and the Ph.D. degree from Rensselaer Polytechnic Institute, Troy, NY.

He was employed by General Electric Company in New York State for three years before joining IBM T. J. Watson Research Center, Yorktown Heights, NY, as a Research Staff Member. Since then, he has been involved in the research of computer architecture, parallel and distributed systems, parallel file system, interconnection network, and VLSI design. In 1988

he became the manager of a system department involved in the research and design of IBM Scalable Power Parallel System, the base machine used for the IBM Deep Blue. In addition, he also led his group working on cache coherence protocol for multiprocessor systems, performance evaluation and visualization for scalable parallel systems, and scalable parallel I/O. In 2002, he joined the Department of Electrical Engineering, National Tsing Hua University, Taiwan as a Professor.

Dr. Hsu received one IBM Outstanding Technical Achievement Award, three IBM invention plateau awards, two IBM supplemental invention awards for toprated patents, and three IBM Research Division technical achievement awards. He had also received the best system paper award from ACM SIGMETRICS Conference in 2000, the best paper award from International Computer Symposium in 2004, and two outstanding teaching awards from National Tsing Hua University in 2006 and 2009.