Power Management Strategies for Serial RapidIO Endpoints in FPGAs

Moritz Schmid, Frank Hannig, and Jürgen Teich

Abstract—We propose a novel data budget-based approach to dynamically control the average power consumption of Serial RapidIO endpoint controllers in FPGAs. The key concept of the approach is to not only perform clock-gating on the FPGA-integrated components of the communication controller, but to disable the multi-gigabit transceivers during idle periods. The clock synchronization, inherent to serial interfaces, enables us to omit the often needed periodic link sensing, and only enable the controller according to a predefined schedule to transmit the allocated amount of data during a specific interval. Following this approach, we are able to reduce the dynamic power consumption by up to 77% on average.

I. INTRODUCTION

The growing desire for high performance computing has encouraged a huge demand for very fast interconnects in embedded systems. Especially, the area of digital signal processing requires computing platforms that deliver high performance but also predictable system behavior to guarantee certain service requirements. It is meanwhile common to design custom computing platforms, composed of heterogeneous hardware components, connected by a serial interconnect. Examples for such computing platforms can be found in medical image processing or in surface radar processing stations [1], [2]. A very popular interconnect capable to fulfill the requirements of such systems is the Serial RapidIO (SRIO) communication standard [3]. Although some research has been conducted on the abilities of SRIO as a high-speed interconnect, to the best of our knowledge, none of these studies have focused on the power requirements of the protocol. As opposed to large scale packet-switched networks, data-networks in embedded systems are commonly over provisioned and provide much higher data rates than it is actually necessary for the operation of the system. As a result, the network controller is alternating between periods of data transmission and idle listening. Such periods of idle listening are the main source for energy waste in communication controllers and can be exploited to save energy by disabling the controller for as long as possible.

In this work, we propose a novel budget-based power-management scheme for SRIO endpoints in FPGAs to exploit the clock synchronizing characteristics of serial interconnects and available a priori knowledge of the expected communication rates to schedule active and idle periods. We define a data rate budget to dynamically control the activity schedule of SRIO endpoints to lower the average power consumption of the interconnect.

II. RELATED WORK

The RapidIO Interconnect Specification, released by the RapidIO Trade Association, is an open standard, developed to achieve high-performance, low-cost, as well as reliable and scalable system connectivity in embedded systems, networking applications and communication devices. Several works are concerned about the applicability of SRIO for high performance embedded systems, such as novel imaging systems [4] or spatial radar applications [1]. However, to the best of our knowledge, no research has been conducted to involve SRIO in a low-power communication scheme, yet. In contrast, most research on low power communication schemes and protocols was conducted in the area of wireless sensor networks, mobile and automotive communication, since these environments often have to cope with a limited energy supply. A key concept in power-aware communication is the introduction of idle periods in which no communication happens, and thus, the transceiver can be turned off or at least be transferred into a low power state [5]. Unterassinger et al., for example, design a power management unit for ultra-low power wireless sensor networks, which is based on multiple power modes for the transceivers in [6]. El-Hoiydi et al. design a novel medium access protocol based on polling for downlink data streaming and compare it to other protocols in [7]. Their key findings include that the minimization of the idle listening period, as well as overhearing may become the main source for energy reduction in communication. In the area of wire-bound communication, a more complex approach based on PCI-Express is presented in [8]. The authors use appropriate line speeds, lane configurations and moreover different power...
III. BACKGROUND

A. Serial RapidIO

RapidIO specification defines a layered architecture consisting of the logical, transport and physical layer. The logical layer is responsible for the implementation of the transaction concept, providing support for memory operations, atomic operations, unaligned memory transfers, globally shared memory and message passing. Routes for traversal of the nodes of the fabric are provided by the transport layer. The physical layer is specified for both, a parallel and a serial standard. The serial standard, also referred to as Serial RapidIO, can operate at up to 16 serial duplex links with a maximum line rate of 6.25 Gbaud. Furthermore, the physical layer is responsible for receiving and transmitting packets and for providing transmitter or receiver based flow control. The implementation assures packet delivery, since packets remain the responsibility of the transmitter until the receiver has acknowledged the acceptance. In case of insufficient resources at the receiver, the transmitter must retry the packet. After an implementation dependent amount of retries, the protocol defines the recovery from erroneous transmissions. To ensure that latency sensitive traffic is not delayed by larger packets, the maximum payload for SRIO packets is 256 byte at an overhead of about 20 bytes, depending on the transaction type. To conduct our studies, we have used the Xilinx SRIO IP core, which will be introduced next.

B. Xilinx SerialRapidIO FPGA IP Core

For use with their range of FPGAs, Xilinx offers two IP cores to implement an SRIO FPGA endpoint. The logical and transport layer, which comprise the logical layer core (LOGIO), are separated from the physical layer core (PHY). In this work, we have used the IP cores in version 5.6 which implements version 2.1 of RIO specification [14]. On top of the Xilinx SRIO IP core, we have implemented a power management unit (PMU), to selectively disable the SRIO endpoint. The individual components of the SRIO architecture are traversed by data in sequence, according to the direction of the transmission. For example, in case of an egress transmission, a packet must first traverse the logical layer, before it is being held in the buffer between the LOGIO and the PHY until the PHY was able to successfully transmit it to the next SRIO device. This is crucial for the implementation of the PMU, which includes clock gating of the IP Core components in the order of the traversal. The MGTs are another part of the SRIO that can be disabled in idle periods. On a Virtex-6 LXT, the serial GTX MGTs [15] are organized in clusters of four transceivers, sharing two differential reference clocks. Two analog supply powers are used, MGTAVCC and MGTAVTT. MGTAVCC is responsible for the internal analog circuits, including the clock generation circuitry. The GTX supports several power-down modes to facilitate the implementation of a generic power control. For both, the transmitter and the receiver lane, it is possible to power down the transmitter and receive port of the module. The endpoint is functional, if all three indicators are asserted.

IV. POWER MANAGEMENT STRATEGIES FOR SRI0

A. Motivation

In Figure 1, we depict the communication channel utilization over time in a fixed bandwidth communication scenario, where the data rate requirements are lower than the actual available data rate. The channel alternates between active periods \( D_{\text{active}} \) of data transmission and idle periods \( D_{\text{idle}} \), where the network adapter does not transmit any user data. Although there is no payload being transmitted during the idle periods, the link is highly active to keep the communication partners synchronized. Hence, such idle periods are prime candidates for optimization of power consumption in serial interconnects. If the link is deactivated during idle
Figure 1. Channel activity at a transmission rate of 1 Gbps at an available data rate of 1 Gbps and 0.5 Gbit of data.

periods and reactivated for transmission or reception of new data, the link is not ready right away, but must resynchronize before data can be transmitted. This link synchronization is referred to as the link training delay \( D_{lt} \) and, unfortunately, may take up between 100 and 250 ns, depending on the line rate of the interface.

Several possibilities exist for power-management of idle periods. A straightforward means of reducing the power consumption of the SRI0 endpoint is to increase link utilization by reducing the amount of lanes and the maximum lane rate, as depicted in Figure 2. Such an approach is very effective, since it does not suffer from idle periods in which the link is only active for synchronization. However, it imposes a hard limit on the available data rate, so that the link is suitable only for evenly distributed traffic and cannot handle bursty traffic sources.

In contrast, a rather complex strategy is to keep the transceiver initially in the disabled state and only activate it in case of outgoing transmissions or link sensing for incoming communication. For egress data, the situation is very convenient, since the transceiver can be activated as soon as data is available for transmission. Data can then be transferred and the transceiver can be disabled afterwards, which is shown in Figure 3. Such an implementation is especially useful for signal processing applications, where data suffers a high latency due to processing but is delivered at a high throughput. Unfortunately, there exist several problems with such an approach. In Figure 4, we consider the link training delay after enabling the transceiver, which causes additional latency and reduces the link capacity.

Powering down the transceiver during idle periods can reduce the power consumption, however, incoming transmissions cannot be received. If the arrival times of ingress transmissions are not known a priori, the transceiver must be periodically enabled to sense the link for prospective incoming data. Deciding on when to wake up and how long to listen is a complex optimization problem, since not only the power consumption of the transceiver but also that of excess memory necessary to store egress data at the sender until the link partner becomes available must be taken into account.

Complex embedded systems applications often possess the advantage that data streams are known a priori and can therefore be scheduled. Especially if the system must fulfill real-time constraints, the exact scheduling of data streams is a common practice. The advantage of such system behavior over the previously introduced power management strategies is that the transceivers can be activated in fixed intervals, which minimizes idle as well as link sensing periods and can therefore reach a very high level of energy efficiency.

B. Methodology

The process of activating an SRI0 communication controller can be subdivided into 5 steps, as depicted in Figure 5.

Figure 5. Transceiver hardware enable states and transitions for SRI0 communication events.

These are:

- **SLEEP** – The transceiver lanes as well as the PLL are powered down, the IP core and the user application are clock-gated.
- **MGT_PHY** – The transceivers are powered up and the clock to the physical layer is enabled.
- **LOGIO** – The logical layer and is enabled in addition to the MGTs and the PHY.
- **TX/RX** – Either the transmitter or the receiver logic is activated in addition to the MGTs, the PHY and the LOGIO.
- **TRX** – The complete transceiver logic is activated.
The initial state is SLEEP, in which the complete transceiver hardware is disabled. From here, to wake up the transceivers, the next state is MGT_PHY, where the MGTs and the PHY are enabled to start link training. The transceivers will remain in this state for the duration of $D_{il}$ until the link is synchronized, after which the logical layer can be enabled in state LOGIO. Depending on whether only the transmitter, the receiver, or both components will be needed, the transceivers are then put into one of the corresponding states TX, RX, or TRX, respectively. The logical layer is enabled in all of these states, however, the logic for transmitting and receiving can be selectively disabled, according to the task to fulfill. Each of the states is associated with a power consumption $P_x$, where $x$ denotes the current state of the hardware:

- $P_S$: Power in state SLEEP
- $P_M$: Power in state MGT_PHY
- $P_L$: Power in state LOGIO
- $P_T$: Power in state TX
- $P_R$: Power in state RX
- $P_{TR}$: Power in state TRX

We define the power level $P_S$ during the state SLEEP as the basic power reference level. The change to other states by activating the communication controller elevates the power consumption. We denote $P'_x = P_x - P_S$ as the increase in power consumption in state $x$ compared to the basic level of power consumption. For example, $P'_M = P_M - P_S$ expresses the increase in power consumption, if the MGTs and the PHY are activated. In contrast, if no power management is implemented, the communication controller is constantly in the state TRX and consumes an amount of energy equal to $P'_{TR} = P_{TR} - P_S$. We can achieve an improvement in power consumption if the mean value of power consumption over a certain period of time $t$, consisting of the sum of the power levels in the different states is smaller than the power consumption $P'_M$ over the same period of time. To elaborate on this, we examine the power management strategies presented in the previous section.

Consider the case when there is no a priori knowledge of the duration of idle periods and data may arrive or must be transmitted at arbitrary points in time. We will first analyze the transmission process, which is depicted in Figure 6. Packets traverse several stages of the endpoint controller, each of which may selectively be enabled or disabled. The first step is to enable the MGTs and the PHY. Until the link is synchronized, the remaining hardware can stay in the disabled state. After synchronization, both, the logical layer and the transmit logic are activated in a single step to start the transmission. When the last data packet was processed by the transmit component, it still has to traverse the logical layer and the PHY, so we sequentially disable the components, once activity has ceased. During transmission of the data, we experience an overall higher power consumption, which is due to the higher toggle rate in the individual components. We also depict the power consumption of an unmanaged endpoint controller in Figure 6 for comparison. Here, the power consumption is also elevated during the actual transmission. We can now evaluate the energy savings of this approach as follows: The energy $E_{\text{unmanaged}}$, consumed by the unmanaged endpoint controller is

$$E_{\text{unmanaged}} = P_{TR} \cdot (D_{idle} + D_{il}) + P_T \cdot D_{tx},$$  

whereas, the energy $E_{\text{managed}}$ consumed by the managed endpoint is

$$E_{\text{managed}} = P_S \cdot D_{idle} + P_M \cdot D_{il} + P_T \cdot D_{tx} + P_L \cdot D_{logio} + P_M \cdot D_{phy}.$$  

The difference between both is displayed in Figure 6. Obviously, the longer the idle periods can be kept, the more effectively the energy can be reduced.

An advantage of this approach is that it only increases the delay by $D_{il}$ of a data stream, but does not decrease the maximum available line rate. The same observation holds for receiving packets, where the transceiver must be activated periodically in intervals $D_{int}$ to probe the link for incoming transactions, as it is illustrated in Figure 7. However, this is only the case if the transceiver is deactivated. During active transmission periods, no extra hardware must be activated for link sensing and receiving. The worst case scenario happens if there are no active periods due to transmissions and new ingress data becomes available right after the last sensing period $D_{sense}$ has ended. The periodical interval comprises the idle time $D_{idle}$, the link training time $D_{il}$ and the sensing time $D_{sense}$. The maximum delay an ingress packet can experience is depicted in Figure 8 and therefore $D_{max} = D_{idle} + D_{il}$. The energy that can potentially be saved $E_{\text{save}}$ in comparison to an unmanaged controller during such a cycle in the best case is the difference between the power consumed in state TRX during the time $D_{max}$ and the
$D_{rx}, D_{active} = D_{lt} + \max(D_{tx}, D_{rx}) + D_{logio} + D_{phy}$. For example, in case of more data to be transmitted than received, the consumed energy during $D_{active}$ evaluates to

$$E_{active} = P_M \cdot D_{lt} + P_{TRX} \cdot D_{rx} + P_{TX} \cdot (D_{tx} - D_{rx}) + P_s \cdot D_{logio} + P_M \cdot D_{phy}.$$  

Apart from active periods, the controller can remain in the low-power sleep state during the idle period $D_{idle}$. An illustration of this case is depicted in Figure 9. The energy saved if the endpoint is managed by the budget-based protocol evaluates to

$$E_{save} = P_{TR} \cdot D_{budget} - (E_{active} + P_s \cdot D_{idle}).$$  

V. EXPERIMENTAL RESULTS

In order to evaluate the proposed power management strategies, we have analyzed the PMU for the Xilinx SRI0 IP Core in version 5.6 using the Xilinx Virtex-6 LXT 240 FPGA on the ML605 evaluation board, which supports single lane SRI0 architectures. Since the ML605 does not include an oscillator to drive the system clock, we have used an ML505 board as external source at 125 MHz. At 125 MHz we can generate SRI0 endpoints at line rates of 1.25, 2.5, 3.125 and 5 GBaud. Before measuring the power consumption of the proposed power optimization strategy on actual hardware, we have performed an analysis by simulation.

A. Simulation-based Analysis

We have created two different versions of the SRI0 endpoint controller, one with the PMU and one without power management. Without the PMU, we have synthesized the endpoint for two different optimization goals, minimum period (Speed) and minimum power (Power). The PMU-based endpoint design was optimized for minimum period (PMU). To estimate the improvement possible due to power management, the designs were analyzed using the Xilinx XPower Analyzer tool [16]. In comparison to the measurement on actual hardware, XPower offers the advantage,
that it can exactly determine the power consumption of each individual component of the FPGA. The results of the XPower Analysis are listed in Table I and were made under commercial settings for the temperature grade, as well as typical process settings. As expected, the largest part of the power consumption is due to leakage, which is an FPGA inherent problem. To visualize the remaining proportions, we have omitted the leakage part in Figure 10. A very large proportion of the remaining power consumption is due to the GTX transceivers, and deactivation of the transceivers in the PMU design can reduce the required energy by up to 75%. Moreover, it can be seen that the PMU designs can also reduce the energy required by the clock tree due to clock-gating, which outperforms the automatic approach of up to 35% for the 5 GBaud design. Moreover, we observe that the PMU requires extra logic, especially for the faster designs, which will also be noticeable in the hardware measurements. The downside of the XPower analysis is that the results do not very well reflect the dynamic behavior of the interconnect at different data budgets, which is why we cannot abstain from performing power measurements on the actual hardware.

B. Hardware-based Measurements

The power supply on the Virtex-6 FPGA is controlled by a Texas Instruments (TI) UCD9240 controller. Measuring of the power consumption of the FPGA and the MGTs can be easily accomplished on the ML605 using the PMBus interface to the TI controller and the TI Fusion Digital Power Designer software package [17]. For this study it is interesting to measure the power consumption on the VCCINT power rail, which drives the internal components of the FPGA, as well as the MGTAVCC and MGTAVTT power rails, as described in Section III-B. The SRI0 endpoint was implemented with and without the PMU. The design without the PMU was used to generate a bit file with minimum period as design goal (Speed) and furthermore implemented with the Xilinx ISE internal power reduction option enabled (Power). The design with the PMU was optimized for minimum period (PMU). All three designs were implemented for all possible line rates at a 125 MHz system clock speed. Due to space restrictions, we only list the measurement results for 5 GBaud in Table II.

![Figure 10](image1.png)

**Figure 10.** Power requirements of individual FPGA components of the SRI0 endpoint designs for all possible line rates.

![Figure 11](image2.png)

**Figure 11.** Average power consumption of SRI0 endpoint designs for 125 MHz system clock and 1.25 GBaud line rate.

A plot of the measurements for the 1.25 GBaud imple-
Table II

Power rail measurements of the 5 Gbaud design with and without PMU. The design without PMU was optimized for minimum period (speed) and power reduction (power). The PMU design (PMU) was optimized for minimum period.

<table>
<thead>
<tr>
<th>Design Data Budget</th>
<th>Power Rail</th>
<th>Speed</th>
<th>Power</th>
<th>PMU</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>VCCINT</td>
<td>MGTA VCC</td>
<td>MGTA VTT</td>
<td>VCCINT</td>
</tr>
<tr>
<td>0.0</td>
<td>1.0315</td>
<td>0.9421</td>
<td>0.9515</td>
<td>1.0274</td>
</tr>
<tr>
<td>0.1</td>
<td>1.0350</td>
<td>0.9415</td>
<td>0.9517</td>
<td>1.0274</td>
</tr>
<tr>
<td>0.3</td>
<td>1.0364</td>
<td>0.9418</td>
<td>0.9515</td>
<td>1.0381</td>
</tr>
<tr>
<td>0.5</td>
<td>1.0391</td>
<td>0.9418</td>
<td>0.9515</td>
<td>1.0391</td>
</tr>
<tr>
<td>0.7</td>
<td>1.0416</td>
<td>0.9501</td>
<td>0.9516</td>
<td>1.0416</td>
</tr>
<tr>
<td>0.9</td>
<td>1.0441</td>
<td>0.9516</td>
<td>0.9516</td>
<td>1.0441</td>
</tr>
<tr>
<td>1.1</td>
<td>1.0465</td>
<td>0.9517</td>
<td>0.9517</td>
<td>1.0465</td>
</tr>
<tr>
<td>1.3</td>
<td>1.0480</td>
<td>0.9517</td>
<td>0.9517</td>
<td>1.0480</td>
</tr>
<tr>
<td>1.5</td>
<td>1.0495</td>
<td>0.9517</td>
<td>0.9517</td>
<td>1.0495</td>
</tr>
<tr>
<td>1.7</td>
<td>1.0509</td>
<td>0.9517</td>
<td>0.9517</td>
<td>1.0509</td>
</tr>
<tr>
<td>1.9</td>
<td>1.0524</td>
<td>0.9517</td>
<td>0.9517</td>
<td>1.0524</td>
</tr>
</tbody>
</table>

Figure 12. Average power consumption of the SRIO endpoint for 125 MHz system clock and 2.5 Gbaud line rate.

Figure 13. Average power consumption of the SRIO endpoint for 125 MHz system clock and 3.125 Gbaud line rate.

Figure 14. Average power consumption of the SRIO endpoint for 125 MHz system clock and 5 Gbaud line rate.

Discussion and Analysis

The measurements were conducted up to the maximum data budget of 1 Gbps. To keep the rest of the measurements comparable, individual designs in Figures 12, 13, and 14 were measured up to 1.9 Gbps, although, the 3.125 and 5 Gbaud versions are capable of transmitting data at higher rates. The graphs show that disabling the GTX transceivers, affecting MGTA VCC and MGTA VTT, during idle periods is very effective, however, as the data budget approaches the maximum supported line rate, the measurements converge with those without the power management implemented. FPGA-internal clock-gating of the communication controller, which only affects VCCINT, is only marginally effective. On the contrary, once the data budget approaches the maximum achievable line rate, the power consumption is actually higher due to the extra logic of the PMU. Furthermore, the results show that although very fine-grained power reduction techniques prove to be effective, manual optimizations, such as disabling the GTX transceivers during idle periods, can outperform such techniques. The best results can of course be observed for idle periods, in which we can lower the combined power consumption of the GTX transceiver on power rails MGTA VCC and MGTA VTT by up to 200 mW. As specified earlier, we take the values at idle operation in the lowest power state as the basic reference value. For the 5 Gbaud design, which represents the best case, we can achieve a reduction of the power consumption for the MGTs by 77% on average, and for the internal FPGA design by 58%. For the 1.25 Gbaud design, we can still achieve a reduction by 44% on average for the GTX transceivers and by 18% for VCCINT. Another observation is the comparison between lowering the maximum line rate for an underused link to...
avoid idle listening and raising the link rate to the maximum possible speed and use power management to increase idle periods, in which the controller can be deactivated by a PMU. A comparison for the measurements up to a data budget of 0.9 Gbps is depicted in Figure 15. For communication controllers not involving power management, reducing the line rate is very effective. However, using power management instead of lowering the maximum line rate can reduce the power consumption to a much higher degree.

VI. CONCLUSION

We have proposed a novel data budget-based approach to dynamically control the power consumption of SRIoT endpoint controllers in FPGAs. The key concept of the approach is to not only perform clock-gating on the FPGA-internal components of the communication controller, but to disable the MGT transceivers during idle periods. The clock synchronization, inherent to serial interfaces, enables us to omit the often needed periodic link sensing, and only enable the controller according to a predefined schedule to transmit the allocated data budget for the budget interval. Following this approach we are able to reduce the dynamic power consumption by up to 77% on average. Moreover, we have shown that lowering the line rate on underused links is an effective technique to reduce the power consumption, however, transmitting the data at maximum speed to maximize idle periods in which the controller can be deactivated, can reduce the power consumption even more.

REFERENCES


