Matrix Decomposition Algorithms for MIMO receivers: Flexibility vs. Efficiency Tradeoffs in a Library-based Tool-assisted SDR Development

Matrix Decomposition Algorithms for MIMO

receivers: Flexibility vs. Efﬁciency Tradeoffs in a

Library-based Tool-Assisted SDR Development

Venkates h Ramakrishnan

∗

, Tobias Veerkamp

∗

, Marc Adrat

†

, Gerd Ascheid

∗

and Markus Antweiler

†

∗

Institute for Integrated Signal Processing Systems, RWTH Aachen University, Templergraben 55, 52062 Aachen, Germany

†

Fraunhofer Institute for Communication, Information Processing & Ergonomics (FKIE), Wachtberg, Germany

Abstract—Multiple input and multiple output (MIMO) is one

of the key technologies used in wireless standards like LTE and

WiMaX. Matrix decomposition of the channel matrix in the form

of QR decomposition (QRD) is needed for advanced MIMO

demapping algorithms like sphere decoder. The implementation

of QRD has to be highly efﬁcient due to its computation-

intensive nature. On the other hand, software deﬁ ned radios

(SDRs) require ﬂexibility in several forms, e.g. support for

different algorithms. The contradictory nature of ﬂexibility and

efﬁciency requires tradeoffs to be made between them in SDR

development. In this paper, we analyze tradeoffs by using MMSE-

SQRD as a case study. We have implemented two algorithms for

performing SQRD in four different methods with varying d egree

of portability, efﬁciency and reusability. Focus will given on a

library based SDR development approach i n our investigations

where constraint aware mapping can be performed with tool

assistance at a high abstraction l evel.

I. INTRODUCTION

Multiple input and multiple output (MIMO) systems can

provide high throughput by exploiting multi-path propaga-

tion and diversity. It is one of the en a bling tech nologies in

upcoming wireless standards like LTE and WiMax. One of

the highly comp lex components in a MIMO system is the

MIMO demapper, which separates the superposed received

data stre ams into the multiple transmitted streams. In order

to simplify and perform MIMO demapping efﬁciently, several

matrix decomposition methods are used in conjunction w ith

MIMO detection schemes. Since matrix decomposition repre-

sents a computation intensive c ompon e nt of a MIMO receiver,

its implementation has a direct impact on the over-all system

efﬁciency and therefore has to be highly efﬁcient. This makes

the implementation of matrix decomposition challen ging.

Flexibility, a key feature in software deﬁned radios (SDRs),

is sought in several forms. For example, ﬂexibility in choos-

ing algorithms and their implementation in a SDR hardware

platform can enable the radio to efﬁciently use, e.g. spectrum

resources, according to the different environmental conditions.

Portability, the ease with which an implementation of a wire-

less standard c a n be ported to different hardware platforms, is

another form of ﬂexibility. Reusability is a common form for

This research project was performed in the Ultra High-Speed Mobile

Information and Communication (UMIC) research centre under the support of

the Technical Center for Information Technology and Electronics (WTD-81),

Germany.

enabling ﬂexibility in SDR development. For example, if an

algorithm or implementation can be reused fo r performing a

different fun ctionality, it can (re)conﬁgured/programmed on-

the-ﬂy.

Even though a ﬂexible solution may not always yield the

same performance, e.g. with resp e ct to energy, when comp ared

to a dedicated solution, the difference can be made tolerable

with careful engineering and application speciﬁc optimiza-

tions. A key advantage of reusability is the reduction of the

system development time an d time-to-market. Portability can

be signiﬁcantly increased with reusable components. However,

identifyin g such reu sable components and implementing them

in a ﬂexible and efﬁcient way can be c hallenging.

Efﬁciency, which is also a prerequisite for SDRs, can take

several form s. For example, en ergy efﬁciency is paramount for

increasing battery life in mobile devices. Area efﬁciency has

become important, particularly in hand-held devices, due to

the growing need to ac commodate more hardware elemen ts.

It is well known that fully ﬂexible hardware architectures,

e.g. general purpose proce ssors (GPPs), are expensive in terms

of area and energy consumption. In order to improve energy

efﬁciency and at the same time meet the computation needs

of the application, heterogeneous hardware platfor ms are a

promising solution. Apart from the processing elements (PEs),

the type of implementation on a PE can play a vital role

on efﬁciency as well. For example, hand-coded assembly

implementation that is optimized for the ar chitecture of a

processor can consume less cycles, by a few orders of magni-

tude, when compa red to a generic C implementatio n. However,

implementation in C has high portability w hen compared to

assembly.

From the above discussions, it is clear that ﬂexibility

and efﬁciency are contradictory in nature. Therefore, tr ade-

offs among them have to be made at every stage of SDR

development. However, in order to make the tradeoff decisions

it is important to quantify the differences in performance. This

requires implementation of an algorithm or a component on

a whole range of PEs, spanning G PPs, dedicated application

speciﬁc integrated circuits (ASICs), digital signal processors

(DSPs), ﬁeld programmable gate arrays (FPGAs), etc. using

generic C, C with intrinsics and hand written assembly (fo r

programmable processors) and hardware description language

Proceedings of SDR'11-WInnComm-Europe, 22- 24 Jun 2011

(HDL) (for FPGA an d ASIC). Such a comprehensive investi-

gation is beyond the scope of this paper. One of the main

focus of our investigations is to analyze th e ﬂexibility vs.

efﬁciency tradeoffs while implementing a wireless standard

using reusable algorithmic kernels.

Due to the offer of both portability and efﬁciency for SDR

development, th e library-based Nucleus methodology [1] is

used as the backbone for our investigations. The Nucleus

methodology is based on a standardized library, consisting

of algo rithmic computation intensive kernels, Nuclei. T he

description of a wireless standard is done using the Nuclei

library as the basis. Due to the standardization of the Nuclei

library, vendors can provide efﬁcient implementation s for

Nuclei, known as Flavors, as a part of board support package

(BSP) for a hardware platform. Tools are used for selecting an

implementation from the BSP, that me ets the con stra ints of a

wireless standard like late ncy. This paper can also be seen as

a case study for identifying a Nucleus, which can be reused

in several standards for performing different functionalities.

The rest of the paper is structured as follows. The system

model that is used for our investigations is explained in

Section II. Our contributions and related work are given

in Section III. The algorithms that were implemented for

performing minimum mean squared error (MMSE) sorted

QR decomposition (SQRD) are explained in Section IV. The

different versions of implemen ting the MMSE-SQRD algo-

rithms and their building blocks are described in Section V.

Results and observations are presented in Section VI. Finally,

conclusions are drawn.

Notations: A matrix Q is denoted by a uppercase le tter,

jth column of Q is indicated by a lowercase letter q

. Q

j,k

indicates the matrix element at jth row and kth column. The

subscript in lowercase letter (which itself has a subscript of

upperc a se letter) of a matrix Q

×n

indicates th e dimension

× n

of Q (a square matrix has a single subscript). For

example, I

represents an identity matrix of order n

. A

matrix with a subscr ipt in lowercase letter a indica tes a

submatrix Q

of the augmented matrix

Q. Re{Q

j,k

} and

Im{Q

j,k

} indicate the real and imaginary parts of Q

j,k

respectively. A complex and real nu mber is represented by the

symbol, C and R respectively. The superscript

indicates th e

Hermitian transpose. k.k denotes the Euclidean norm operator.

The superscripts ˆ and

indicate the estimated value and

transpose respectively.

II. SYSTEM MODEL

We consider a MIMO-OFDM system w ith n

transmitting

and n

receiving antennas. The received n

× 1 dimensional

signal vector y can be expressed as

y = Hs + n (1)

where H is the n

× n

channel matrix, s is the n

× 1

transmitted signal vector an d n is the n

× 1 noise vector

with zero mea n and variance

respectively.

MMSE solution, which takes the noise in the system into

account for estimating the transmitted signal vector at the

receiver side, offers better bit error rate (BER) performance

when compared to zer o forcing (ZF) solution (see Figure 1).

Therefore, this pap e r considers a MMSE solution for imple-

mentation. The channel matrix H has to be extended to the so

called complex-valued augmented channel matrix

H, which is

deﬁned as

H =

(2)

where E

is the transmitted signal power. I

denotes a

× n

-dimensional identity matrix. The inverse of the

complex augmented chann e l matr ix leads to the estimation

of the transmitted symbol vector

s =

−1





(3)

0 5 10 15 20 25 30

−6

−5

−4

−3

−2

−1

SNR (dB)

Uncoded BER

ZF−QRD, SIC

ZF−OTSQRD, SIC

ZF−SQRD, SIC

MMSE−OTSQRD, SIC

MMSE−SQRD, SIC

MMSE−SQRD, Sphere Decoder

Fig. 1. Uncoded BER performance of a 4 × 4 MIMO system using 16QAM

modulation for different MIMO preprocessing schemes

A. MIMO Processing

One of the most processing inten sive blo cks in the MIMO

receiver is MI MO processing (also mentioned as MIM O

demapp ing), wh ic h reverses the channel effects in ord e r to

recover the transmitted stream from the re ceived superposed

data streams. Figure 2 illustrates the block diagr am of M IMO

processing. Th e functionality of MIMO processing can be

separated into MIM O preprocessing and MIMO detection. The

functionality of the preprocessing block depends on the MIMO

detection scheme. MIMO detection schemes can vary from

the simple linear detection m e thod to th e advanced methods

like sphe re decoder (SD). Figure 1 illustrates the difference in

BER performa nce between successive interference cancelation

(SIC) and depth ﬁrst, sphere decoding scheme.

Channel

Estimation

Noise

Estimation

MIMO Processing

MIMO

Preprocessing

(QRD)

MIMO

Detection

(SIC)

Estimated

Signal

Vector

Fig. 2. Block diagram of the MIMO processing based on QRD

B. MIMO Preprocessing

For perform ing MIMO detection using the linear technique,

the preprocessor can compute th e matrix inversion of H. Since

advanced MIMO detec tion schemes like SIC and tree-based

SD need the QR dec omposition (QRD) of the channel matrix

H, the preprocessor co mputes QRD while using them .

Implementation of preprocessing block can be critical for

wireless standards tha t support very h igh thro ughput and

therefore should be highly efﬁcient. To illustrate this point,

let us consider two scen a rios in LTE standard. Assuming a

maximum spe e d of the mobile h andset as 350 km/hour and

a carrier frequency o f 2.5 GHz, the coherence time is given

by 1.2 milli-second (ms) [2]. With 1200 data carrying sub-

carriers, time to perform the QRD of H belonging to each

sub-carrier is: 1.2 ms / 1200 = 1 micro-second (µs). For a

slightly relaxed scenario with the speed of mobile handset as

45 km/hour (coherence time of 10 ms) and other conditions

being the sam e as above, the ma ximum time to perform QRD

is: 8.3 µs. In order to meet such low timing, implementatio n

of MIMO preprocessing must be highly efﬁcien t making it

challengin g.

As shown in Figure 1, sorting the channel matrix before

performing QRD improves BER performance signiﬁcantly,

resulting in the so-called SQRD. We have considered only

SQRD algorithms in this paper. As 2 × 2 MIMO is of

reasonable complexity and is proposed for LTE to achieve

a peak downlink download ra te of 150 Mbps, we have not

considered higher order MIMO systems for impleme ntation

in this paper. More over, most of th e observations made from

our results are general in na ture and a pplicable to higher

order MIMO systems as well. It is important to note that

one-time full-column sorted QRD (OTSQRD) technique in a

4 × 4 MIM O system becomes fully sorted technique, referred

as SQRD, in a 2 × 2 MIMO system.

III. CONTRIBUTIONS AND RELATED W ORK

In this paper, we have implemented two algorithms, mod-

iﬁed Gram Schmidt (MGS) and Givens r otation (GR) f or

performing MMSE-SQRD. The GR algorithm is implemented

in several ways that are based on algorithms and components

which are highly reusable. For example, the coordinate rotation

digital computer (CORDIC) algor ithm, which has been used

for impleme nting GR, can be used for realizing several other

functionalities like phase cor rection [3]. Due to the nature of

implementation (without ﬂexibility), MGS can be used as a

dedicated solution in this paper and can be used for comparing

the performance of the ﬂexible implementations of GR. The

Texas Instruments C64x+ DSP is used as the PE for our

analysis. For analy z ing the porta bility vs. efﬁciency trad e-

offs, all the implementation -variants (men tioned as variants

in th is paper) have been realized in ﬂoating-point C, ﬁxed-

point C, ﬁxed-poin t C with intrinsics and hand-optimized

Very Long Instruction Wo rd (VLIW) a ssembly. This includes

both 16-Bit and 32-Bit implementations for analyzing different

input/output (I/O) data-width. Due to the h igh computation al

complexity in performing QRD, it serves as a good case study

for our analyses o n ﬂexibility vs. efﬁciency tradeoffs.

Several meth ods such as GS-, LDL-, LR-, LU-, QR- and

Jacobi singular value decomposition (SVD) can be used for

performing matrix decomposition. Among the m, QRD is a

popular candidate for MIMO preprocessing. This is due to

the necessity for QRD of th e ch a nnel matrix in some ad-

vanced MIMO detection schemes like SD. Three most p opular

algorithm s for implementing a QRD are: the Householder

reﬂections algorithm [4, 5], the MGS algorithm [6, 7] and the

GR algorithm [8, 9]. As we are considering o nly the MGS and

GR algor ithms, our focus is re stricte d to only these alg orithms.

Numerous ASIC solutions have bee n published for im-

plementing QRD [10–13]. Authors in [11, 14] describe a

MIMO processing VLSI architecture on GR based QRD.

Both a rchitectures use the CORDIC algorithm. The tradeoffs

between ﬁnite word length precision, which is determine d

by the number of CORDIC iterations, and latency ha s been

investigated in [11]. Similarly, VLSI architectures ba sed on

MGS for QRD are presented in [6, 15, 16].

Several decomposition methods have been implemente d on

FPGAs as well [17–2 0]. The perform ance differences between

the QRD algorithms, due to the d ifferences in the ﬁxed-

point implementations, can be clearly seen in the mentioned

publications.

QRD implementations on both ﬂoating- and ﬁxed-point

DSPs exist. The authors in [21] have analyzed the performance

differences of numerical a lgebra algorithms between imple-

mentation in a ﬁxed-point DSP, using TI C64x+ and a ﬂoating-

point DSP, using TI C67x+. Several matr ix decomposition

methods like SVD, Cholesky, LU, QRD and Gauss-Jordan

have been implemented on a ﬂoating-point TI 6713 DSP and

analyzed with respect of computational complexity, latency

and memory requirements in [22].

Most of the above work s have predominantly focused on

developing efﬁcient implementations of matrix decomposition

algorithm s for MIMO receivers. Tho ugh algorithms have been

implemented after making complexity and numerical stability

compariso ns, the analysis on ﬂexibility vs. efﬁciency tradeoffs

which is e ssential for SDR development is missing. Further-

more, reusability and portability investigations are missing as

well. This paper tries to ﬁll these gaps by analyzing and

implementing the MGS a nd GR algorithms for performing

MMSE-SQRD. In order to a utomate SDR development, we

lay focus on the aspects that are imp ortant for tool assistance

throughout our investigations.

IV. ALGORITHMS

The M MSE-SQRD of a complex augmented channel matrix

H using the MGS [23] and sequenc e of GRs is presented

in this section along with the com putational complexity for

a 2 × 2 MIMO system. The modiﬁcations for perf orming a

one-time full column sorting b e fore executing QRD which is

equivalent to a full iterative sorting for a 2 × 2 MIMO system

are highlighted.

A. MMSE-SQRD based on modiﬁed Gram-Schmidt

The MMSE-SQRD operation based on the MGS algorithm

is listed in Algo rithm 1. The augmented complex-valued

channel matrix

H is assigne d to

Q matrix, R is initialized

with a zero matrix, 0

and th e permutation matrix P with

. First, the colu mn norm vector α o f ea ch column in

is calculated (lines 1 to 3). Next, a full column sorting is

performed. The column ¯q

with the smallest squared l

-norm

is processed ﬁrst. Finally, the actual QRD is done as shown

from lines 7 to 14 in Algorithm 1.

The diagonal element R

j,j

is calculated as shown in line 8

and is used for dividing the column ¯q

(line 9). The upper

diagona l elements in row j of R are computed according

to R

j,k

= ¯q

¯q

, where k = j + 1, j + 2, ..., n

. In each

sub-iteration, column q

is updated according to line 12 in

Algorithm 1. The MGS algorithm executes the QRD in n

steps, with the result as shown in Equation 4.

H =

QRP





(4)

Input:

Q =

H, R = 0

×n

, P = I

1: for i = 1, 2, ...n

2: α

= k¯q

3: end

4: for i = 1, 2, ...n

5: sort columns in

Q and P, correspon ding to norm α

6: end

7: for j = 1, 2, ...n

8: R

j,j

¯q

9: ¯q

j,j

· ¯q

10: for k = j + 1, j + 2, ..., n

11: R

j,k

= ¯q

¯q

12: ¯q

= ¯q

− R

j,k

¯q

13: end

14: end

Output: Q

aj,i

j,i

with j = 1, 2, ...n

, i = 1, 2, ...n

Alg. 1: MMSE One -Time-Sorted QRD based on MGS

The MGS algorithm is mainly based on multiplication

(C ×C and C×R), inverse and square root operations. Table I

illustrates the count of basic operations in pe rformin g one

MMSE-SQRD operation using MGS algorithm for a 2 × 2

MIMO system.

Operation Count

SQRT 2

INV 2

C × C 9

R × C 5

TABLE I

OP ERATION COUNT FOR PERF ORMING MMSE-SQRD USING MGS

ALGORI THM I N A 2 × 2 MIMO SYSTEM

B. MMSE-SQRD based on Givens rotations

The MMSE-SQRD operation performed using a series of

GR is listed in Algorithm 2. Initialization of Z is d one

accordin g to Equation 5. Z is a compound matrix o f the

augmen te d complex-valued channel matrix

H, an n

× n

dimensional identity matrix and a zero matrix in the lower

right section of matrix Z. The righ t side of the Z matrix is used

for concate nation of the GRs to form the unitary

Q matrix.

The GR a lgorithm upper triangularizes the left half of matrix

Z. Permutation matrix P keep s track of the sorting .

(0)

H I

(5)

where Z has the dimension of (n

+ n

)× (n

+ n

). The

outcome of the GR algorithm after the last iteration N is

(N)



R Q

0 Q



(6)

The column norm vector α of

H is calculated in lines

1 to 3 of Algorithm 2. A full column sorting is executed

between lines 4 and 6. The procedure for QRD is illustrated

from line 7 to 10. The MMSE-SQRD operation using GR

algorithm computes the

Q and R matrices in n

steps. In

each step, a series of GR is computed and multiplied with Z

to eliminate the j +n

, i+n

−1, ..., i+1 rows of z

. In other

words, e ach rotation zeros an element in the subdiagonal of

the given matrix, forming an upper-triangular R matrix. The

concatenation of all applied GR forms the unitary matrix

Q. In

each step, the rotation is only executed on two speciﬁc rows.

The complex-valued matrix processing using GR is an

extension of the re al-valued matrix p rocessing and can be

separated into two operations: C → R and R → 0. C → R

operation tra nsforms a complex matrix entry Z

p,k

from the

complex pla ne into the real plane. The rotation matrix θ

C → R o peration is given by

(p, φ) =



1 0

0 e

−iφ



(7)

with rotation ang le

φ(p, k, Z) = − arctan



Im{Z

p,k

}

Re{Z

p,k

}



(8)

(p, φ) is a 2-dime nsional identity matrix with e

−iφ

Input: Z = Z

(0)

, P = I

1: for i = 1, 2, ...n

2: α

= k

3: end

4: for i = 1, 2, ..., n

5: sort the ﬁrst n

columns in the ﬁrst n

rows of Z

(0)

and P, corresponding to α

6: end

7: for j = 1, 2, ..., n

8: perform a series of G Rs, θ

such that rows j +

1, ..., n

+ n

of column z

become zero

9: Z = (

u=(j−1)n

10: end

Output:

j,i

= Z

j,i

with i, j = 1, 2, ..., n

j,i

= Z

j,i+n

with j = 1, 2, ..., n

, i = 1, 2, ..., n

Alg. 2: MMSE One-Time-Sorte d QRD based on GR

position p, p. The elimination of real valued matrix elements

is performed by R → 0 operation. The rotation matrix for

R → 0 operation is given by

(q, p, φ) =



cos(φ) sin(φ)

− sin(φ) cos(φ)



(9)

q p

with rotatio n angle

φ(q , p, k, Z) = − ar c tan



Re{Z

q,k

}

Re{Z

p,k

}



(10)

Both C → R and R → 0 operations can be further

separated into b asic vecto ring and rotation o perations. The

vectoring operation rotates a complex value into the real

plane and provides the rotatio n angle. In other words, the

vectoring operation performs the computation of the rotation

angle (shown in Equation 8). Furthe rmore, it obtains the angle

needed for rotating the real valued matrix elements (shown in

Equation 10).

The rotation operation rotates one complex value into an-

other b y a speciﬁc an gle. It is importan t to note that the

rotation operation has to be performed, using the rotation

matrices 7 and 9, on the other en tries in the rows of Z which

are affected by C → R and R → 0 opera tions. Table I I lists

the required number of these basic operation s for performing

MMSE-SQRD on a 2 × 2 MIMO system. Note that the GR

algorithm can be easily parallelized, however, it needs a higher

number of opera tions when co mpared to MGS algorithm for

performing MMSE-SQRD.

V. IMPLEMENTATIONS

The building blocks, which have been reused for different

variants of implementation, are ﬁrst presented in this section

followed by the variants themselves.

Basic Operation Count

Vectoring Rotation

C → R 4 7

R → 0 4 16

TABLE II

NUMBER OF BASIC OPERATI ONS FOR PERFORMI NG MMSE-SQRD USING

GR ALGORITHM IN A 2 × 2 MIMO SYSTEM

A. Building Blocks

1) TI IQMath Library: In order to increase the implemen-

tation efﬁciency and decre ase the time-to-market, TI provides

IQmath library [24] with impleme ntations of well-known

mathematical functions for the C64x+ DSP. This library is

a collection of m athematical 32-Bit ﬁxed-point functions.

Table III lists the functions fr om the IQmath library that are

used in ou r MMSE-SQRD im plementations.

Algorithm Name Processing time

in cycles

MGS

IQNsqrt 79

IQNdiv 73

IQNsin 56

IQNcos 54

IQNatan2 118

TABLE III

LIS T OF USED F UNCTIONS FROM THE IQMATH LIBRARY

The same functions were used for both 16-Bit and 32-Bit

implementations, except inverse and square root functions.

Hand written optimized implem entations of these two func-

tions, w hich gave a better perform ance when compared to

the 32- Bit I Qmath librar y f unctions, were used for 16-Bit

implementation.

2) CORDIC: The CORDIC algorithm is versatile algorithm

widely used in digital signal processing applications. Typically,

it is used for performing a vector pseudo-rotation of a given

two-dimensional vector in a sequenc e of micro iterations with

discrete step size. One advantage of the CORDIC algorithm

is that it can be imp lemented by employing, predominantly,

shift and add ope rations.

CORDIC

Kernel

CORDIC

Vectoring

Preprocessing

CORDIC

Vectoring

Postprocessing

Fig. 3. Implementation of CORDIC vectoring

Figures 3 and 4 show the implementation of vectoring and

rotation operations usin g CORDIC. n denotes the number of

applied CORDIC iterations and k represents the scaling factor

with which the ﬁn al result is multiplied. The rotation angle is

marked a s φ.

Note that the CORDIC imp lementation in the rotation mode

can be modiﬁed for computing sine and cosine values of a

CORDIC

Kernel

CORDIC

Rotation

Preprocessing

CORDIC

Rotation

Postprocessing

Fig. 4. Implementation of CORDIC rotation

given rotation angle φ, it is denoted as sincos implemen tation

in this paper.

B. Variants of Implementation

The different variants f or implementing the algorithms are

presented in this section. Highly portable imp le mentations are

done in ﬂoating-p oint and ﬁxed-po int usin g generic C. Both

16-Bit and 32-Bit versions have been implemented in the

ﬁxed-point format. Highly optimized VLIW code has been

hand-written in assembly suiting the DSP architecture. Since

the im plementations of the MMSE-SQRD algorithm using

MGS algorithm are straight forward and follows exactly the

same procedu re as shown in Algorithm 1, it is not discussed

here. MMSE- SQRD implementa tion using GR provides more

variety in imple mentation. T hree such variants are:

1) GR-Plain: Implementations denoted by GR-Plain use

the IQmath library to execute GRs. As shown in Figure 5,

implementations of the tr igonometric functions (I Q Natan2,

IQNsin and IQNcos) from the IQmath library are used to com -

pute the rotation angle for the GR, mentioned in Equations 8

and 10. The rotation o perations are executed using complex

multiplication opera tions.

IQNatan2

IQNsin

IQNcos

Complex

MPY

Complex

MPY

Fig. 5. Implementation of GR using the IQmath library

2) GR-CORDIC: Implementations denoted by GR-

CORDIC use only the CORDIC kernel for the vectoring and

rotation operations in order to perform a GR. As illustrated in

Figure 6, the CORDIC implementa tion in vectoring operation

computes also the rotation angle, whic h is further applied for

performing rotation operations using the CORDIC kernel.

CORDIC

Vectoring

CORDIC

Rotation

CORDIC

Rotation

Fig. 6. Implementation of GR-CORDIC using only the CORDIC kernel

3) GR-Hybrid: In order to avoid the trigonometric opera-

tions which consume several cycles in the GR-Plain implem en-

tation (see Table III) and CORDIC r otations, which consume

more cycles when c ompared to c omplex multiplications (see

Tables V an d VIII), the GR-Hybrid variant is implemented.

The vectoring and sincos building blocks, using CORDIC

kernel, are used for comp uting the rotation angle for G R. Fast

complex multiplications are used for executing the rotation

operation. Fig ure 7, depicts the block diagram of the hybrid

implementation.

CORDIC

Vectoring

CORDIC

SinCos

Complex

MPY

Complex

MPY

Fig. 7. Implementation of GR-Hybrid using the CORDIC kernel and complex

multiplications

VI. RESULTS

In this section, performance of the implementations for

performing MMSE-SQRD in terms of BER and processing

time (using cycle count) are presented.

A. BER

The ﬂoating-point MMSE-SQRD implementation s are used

as a reference for comparing the BER performa nce of ﬁxed-

point implementation s. For all BER simulations, a i.i.d.-type of

channel matrix is used. SNR in decibels (dB) is the ra tio of the

average energy per transmitted constellation symbol vector and

the AWGN noise with zero mean and variance σ

. The channel

matrix entries are distributed according to CN (0, 1). The 2×2

MIMO system employs two spatial separated streams with

QPSK modulation using Gray mapping and non-systematic

{171, 133}

convolutional code with a code rate of r = 1/2.

Each carrier is used for data symbols. Perfect channel and

noise estimation have been assum ed. SIC is used as the MIMO

detection scheme for all simulations.

0 5 10 15 20 25 30

−6

−5

−4

−3

−2

−1

BER

SNR (dB)

Float

GR, 16−Bit

GR, 32−Bit

MGS, 16−Bit

MGS, 32−Bit

uncoded BER

coded BER

Fig. 8. Coded and uncoded BER for MGS and GR-plain implementations

with 16-Bit and 32-Bit I/O data-width

Figure 8 shows the BER results of both 16-Bit and 32 -

Bit im plementations using the MGS and GR-Plain variants.

Figure 9 illustrates the BER performance of the im plementa-

tion variants using CORDIC, namely GR-CORDIC a nd GR-

Hybrid. Note that the Q format

has b een adjusted in the

implementation to ach ieve closer to ﬂoating-point performan ce

for all the variants of implementation. Therefore, performanc e

differences are minimal between these variants. The applied Q

formats for the 16-Bit and 32- Bit variants are the following,

MGS: Q4.11 and Q5.26; GR-Plain: Q2.13 and Q5.26; GR-

CORDIC and GR-Hybrid: Q3.12 and Q3.28 .

0 5 10 15 20 25 30

−6

−5

−4

−3

−2

−1

BER

SNR (dB)

Float

16−Bit, It. 2

16−Bit, It. 4

16−Bit, It. 6

16−Bit, It. 8

32−Bit, It. 2

32−Bit, It. 4

32−Bit, It. 6

32−Bit, It. 8

uncoded BER

coded BER

Fig. 9. Coded and uncoded BER for CORDIC based implementations with

16-Bit and 32-Bit I/O data-width for different CORDIC iterations

B. Processing Time

For all the measurements in this section, the TMS320C6000

C Compiler v6.0 8 with ”-o3 -pm” optio ns has been used.

Program code and data are stored in L1RAM, with cache s en-

abled. Table IV presents the processing time in terms o f cycles

consumed by the ﬂoating -point and ﬁxed-p oint implementa-

tions of the MGS and GR-p la in variants. The cycle counts for

CORDIC based implementation-variants, with varying number

of CORDIC iterations, are presented in Tables VI and VII.

Note that only the best case cycle counts are presented in all

the tables.

Cycle count

Implementation MGS GR-Plain

Float C 11126 89536

C 32 457 1974

C 16 443 1917

ASM 32 410 1906

ASM 16 231 1762

TABLE IV

PROCESSING TIME I N CYCLES FOR PERFORMING MMSE-SQRD USING

THE MGS AND GR-PLAIN I MPLEMENTATI ONS

Q format represents the ﬁxed-point number format where the number of

fractional bits and integer bits is speciﬁed. For example, Q2.13 represents a

16-bit number with a sign bit, 2 integer bits and 13 fractional bits.

Clock cycles taken by the CORDIC implem entations have

been divided into three parts: prep rocessing, CORDIC kernel

and postprocessing. Cyc les taken by the CORDIC kerne l

depends on the number of CORDIC iterations. Total number

of cycles for executing a CORDIC operation can be given by:

V ec, Rot, SinCos

pre

+ l

· n

+ l

post

(11)

V ec

, l

Rot

and l

SinCos

indicate the total number of cycles

consumed fo r performing vectoring, rotation and sincos oper-

ations respectively (described in Section V-A2). n

denotes

the number of CORDIC iterations. l

pre

, l

and l

post

represent

the number of clock cycles for preprocessing, CORDIC kernel

and postprocessing r e spectively. Table V outlines the number

of cycles consumed for performing these operation s.

Kernel Vectoring Rotation Sine Cosine

pre

post

pre

post

pre

post

Float C 493 372 372 348 348 348 496

C 32 8 38 52 38 52 38 52

C 16 8 25 29 24 25 24 26

ASM 32 6 14 16 18 20 18 20

ASM 16 6 12 14 16 15 16 15

TABLE V

PROCESSING TIME I N CYCLES FOR IMPLEMENTING THE BUILDING

BLOCKS DESCRIBED IN SECTION V-A2, SEPARATED I NTO PRE-,

POST-PROCESSING AND CORDIC KERNEL

CORDIC iterations (n

)

Impl. 2 4 6 8 10

Float C 45544 80764 115130 149642 184158

C 32 3243 3739 4235 4731 5227

C 16 2340 2855 3338 3830 4327

ASM 32 1759 2134 2500 2878 3247

ASM 16 1481 1855 2227 2600 2972

TABLE VI

PROCESSING TIME I N CYCLES FOR PERFORMING MMSE-SQRD USING

THE GR-CORDIC I MPLEMENTATION, (l

GR−CORDIC

)

CORDIC iterations (n

)

Impl. 2 4 6 8 10

Float C 33677 51515 68807 86082 103353

C 32 1684 1940 2199 2452 2708

C 16 1374 1632 1886 2142 2398

ASM 32 1137 1330 1521 1718 1904

ASM 16 953 1145 1337 1529 1721

TABLE VII

PROCESSING TIME I N CYCLES FOR PERFORMING MMSE-SQRD USING

THE GR-HYBRID IMPLEMENTATION, (l

GR−Hybrid

)

Equations for estimating the cycle counts while using GR-

CORDIC and GR-Hybrid implementations are given in Equa-

tion 12 and 13 r espectively.

GR−CORDIC

Sort

+ n

V ec

· (l

V ec

) + l

)+ (12)

+ n

Rot

· (l

Rot

) + l

)

GR−Hybrid

Sort

+ n

V ec

· (l

V ec

) + l

+ n

SinCos

· (l

SinCos

) + l

)+ (13)

+ n

CMP Y

· (l

CMP Y

)

Sort

represents the processing cycles for performing one-

time sorting (wh ic h is a full sorting in a 2 × 2 MIMO

system). l

V ec

, l

Rot

and l

SinCos

indicate the cycles consumed

for vectoring, rotation and sincos operations r e spectively.

CMP Y

denotes the average cycle count for performing a

complex multiplication. n

V ec

, n

Rot

, n

SinCos

and n

CMP Y

rep-

resent the number of vectoring, rotation, sincos and complex

multiplication operations. The number of required complex

multiplications is the sum of all rotations listen in Table II.

The factors l

and l

mark the overhead due to the address

and memory & address operations respectively. Note that the

cycle counts for the imp lementations using CORDIC depend

on the number of CORDIC iterations. The processing time

taken by the constant items is presented in Tab le VIII.

Sort

CM P Y

32-Bit 151 6 4 7

16-Bit 50 5 1 9

TABLE VIII

PROCESSING TIME I N CYCLES FOR CONSTANTS

C. Discussion

Moving from a ﬂoating-point C implementation to a highly

optimized assembly code brin gs the most improvement by

a factor of 48 in MGS and the least improvement by a

factor of 31 in GR-CORDIC implementatio n (f or 2 CORDIC

iterations). The maximum and minimum reductions in terms of

cycle count when moving from a ﬂoa ting-point to a ﬁxed-point

implementation are noticed in GR-Plain (by a factor of 45) and

GR-CORDIC (by a factor of 14) (for 2 CORDIC iterations)

respectively. In ﬁxed-point imp lementations, maximum and

minimum difference between C 32 and A SM 16 are found

in M GS (by a factor of 2) and GR-Plain respectively (by a

factor of 1.1), other implementations vary betwe en the factors,

1.6 and 1.7. Even with 4 CORDIC iterations, the minimum

difference between MGS and G R based implementations is

found to be a factor of 5 (while using GR-Hybrid). When we

compare with GR-CORDIC, the difference rises to a factor of

8 for 4 CORDIC iterations and almost by a factor of 10 for

6 CORDIC iterations, whic h is a more fair comparison when

considerin g BER performance. However, one of the key draw-

backs of the SQRD implementation using MGS algorithm is

the ba d numerical stab ility, predom inantly due to the division

operation resulting in a high dynamic range. This results in a

degradation in BER performance at hig her dBs, which can be

noticed in M G S 16-bit implementation (Figure 8). Moreover,

the GR algorithm can be easily parallelized. This makes it

attractive for VLSI imp le mentations.

When the C64x+ DSP is ope rated at the maximum clock

frequency of 584 MHz, 16-Bit and 32-bit assembly imple-

mentations of MGS algorithm need only 0.39 µs and 0.7 µs

respectively for performing a SQRD. Among the GR based 16-

Bit assembly implementations, GR-Plain, GR-CORDIC and

GR-Hybrid (both with 6 CORDIC iterations) need 3 µs, 3.8 µs

and 2.3 µs respectively for on e SQRD operation. From the pro-

cessing time, it can be clearly seen that only the ﬁxed-point C

and assembly implementa tions of the MGS algorithm can meet

the timing constraints while considering the most demanding

scenario of 1 µs f or one SQRD operation in LTE. However,

other 16-Bit assembly implementations become attr a ctive for

other scenarios that are less demanding, e.g. lower number

of data carrying sub-carriers, lower speed of the mobile set,

etc. For example, in the second scenario described for LTE in

Section II-B, CORDIC based implementation s can easily meet

the timing co nstraint of 8.3 µs and can be more attrac tive due

to the reusability.

Our results show th e amount of performance that nee ds

to be sacriﬁced when going for ﬂexible implementations.

Though, the MGS algorithm does not offer the same amount

of ﬂexibility in terms of implementation-variants like GR

algorithm s a nd is n umerically unstable, it is very suitable

for DSP architectures. On the other hand, SQRD implemen-

tation using CORDIC offers full ﬂexibility, where CORDIC

iterations can be varied depending on the need. Moreover,

CORDIC algorithm can be reu sed f or implementing several

other function alities, like shown in [3].

Our an alysis has highlighted that merely increasing porta-

bility is not a solutio n for SDRs, particularly for new standards

that deman d high performance. Performance requirements

and energy efﬁciency have to be considered by all means.

However, this will decrease portability when adopting tradi-

tional approaches for SDR development. This calls for more

sophisticated SDR development approaches.

Library based a pproaches are highly desirable for increasing

reusability in system development. However, in order to in -

crease portability and efﬁciency, the components of the library

have to b e standardized for which efﬁcient implementations

can be provided by vendors for a PE [1]. Moreover, the

components of a library should be independen t of wireless

standards and must be reusable. We have shown that using

algorithm ic kernels, like CORDIC, can enhance reusability and

offer more ﬂexibility in imp le mentations. At the same time,

they can be implemented efﬁciently as well. Enhanced versions

of the implementations using the algorithmic kernel, suiting

architecture of a PE, as highlighted by GR-Hybrid implemen-

tation, can a lways be implemen te d for better performance, if

needed. Still, por ta bility effort is considerably low.

The architectur e of a PE plays a huge role in reducing

processing time and increasin g energy efﬁciency. T his has

been highlighted by more DSP friendly MGS algorithm c om-

pared to GR. T herefore, extreme care must be taken while

”mapping” an algorithm o nto a PE, particularly while using

heteroge neous hardware p la tforms. This in turn highlights the

need for quic k ”mapping” exploration at a high abstra c tion

level, presu mably at compo nent level, for reducing the time

needed for performing the exploration. We have derived

equations that acc urately estimates the cycles which could

indeed can be used by tools to perform mapping exploration

automatically.

VII. CONCLUSIONS AND OUTLOOK

In this pap e r, we have presented the analysis on ﬂexibility

vs. efﬁciency tradeoffs that need to be made while developing

SDRs by using the computation in te nsive MMSE-SQRD,

which is widely used in MIM O receivers, as a case study. Two

algorithm s for performing SQRD were investigated and efﬁ-

ciently implemented in several versions. Flexible versions of

implementations with varying degree of reusability, portability

and efﬁciency have bee n implemented . Though the dedicated

implementations differ in processing time by good margin

when compared to ﬂexible implementations, efﬁcient imple-

mentations with reusable a lgorithms can still provide ﬂexibility

and can be used in scenarios with tight constraints, like

latency. Accurate equations which can be used for performing

constraint-aware mapping at a high abstraction level with

tool-assistance have been derived. As outlook, the analysis

of ﬂexibility vs. efﬁciency tra deoffs can be continued b y

implementing more computation-intensive wirele ss p hysical

layer algorithms on different hardware architectures.

REFERENCES

[1] V. Ramakrishnan et al., “Efﬁcient and portable SDR waveform develop-

ment: The Nucleus concept,” in Proc. IEEE Military Communications

Conf. MILCOM 2009, 2009, pp. 1–7.

[2] G. Ghosh et al., Fundamentals of LTE, 1st ed. Prentice Hall, August

2010.

[3] J. Valls et al., “The use of cordic in software deﬁned radios: a tutorial,”

Communications Magazine, IEEE, vol. 44, no. 9, pp. 46 –50, 2006.

[4] Y. Wang et al., “Parallel MIMO detection algorithm based on house-

holder transformation,” in Proc. Int. Symp. Intelligent Signal Processing

and Communication Systems ISPACS 2007, 2007, pp. 180–183.

[5] K.-L. Chung et al., “The complex Householder transform,” vol. 45, no. 9,

pp. 2374–2376, 1997.

[6] C. K. Singh et al., “VLSI Architecture for Matrix Inversion using

Modiﬁed Gram-Schmidt based QR Decomposition,” in Proc. th Int VLSI

Design Held jointly with 6th Int. Conf. Embedded Systems. Conf, 2007,

pp. 836–841.

[7] C. K. Singh et al., “A Fixed-Point Implementation for QR Decom-

position,” in Proc. IEEE Dallas/CAS Workshop Design, Applications,

Integration and Software, 2006, pp. 75–78.

[8] R.-H. Lai et al., “A modiﬁed sorted-QR decomposition algorithm for

parallel processing in MIMO detection,” in Proc. IEEE Int. Symp.

Circuits and Systems ISCAS 2009, 2009, pp. 1405–1408.

[9] D. Wubben et al., “MMSE extension of V-BLAST based on sorted QR

decomposition,” in P roc. VTC 2003-Fall Vehicular Technology Conf.

2003 IEEE 58th, vol. 1, 2003, pp. 508–512.

[10] N. W. Gabriel L. Nazar, Christina Gimmler, “Implementation Compar-

isons of the QR decomposition for MIMO Detectionson,” 2010.

[11] C. Studer et al., “Matrix Decomposition Architecture for MIMO Sys-

tems: Design and Implementation Trade-offs,” in Conference Record of

the Forty-First Asilomar Conference on Signals, Systems and Computers

ACSSC 2007, P. Blosch, Ed., November 2007, pp. 1986–1990.

[12] I. LaRoche et al., “An efﬁcient regular matrix inversion circuit archi-

tecture for MIMO processing,” in Proc. IEEE Int. Symp. Circuits and

Systems ISCAS 2006, 2006.

[13] Z.-Y. Huang et al., “High-throughput QR decomposition for MIMO

detection in OFDM systems,” in Proceedings of 2010 IEEE International

Symposium on Circuits and Systems (ISCAS), May 2010, pp. 1492 –

1495.

[14] P. Luethi et al., “VLSI Implementation of a H igh-Speed Iterative

Sorted MMSE QR Decomposition,” in IEEE International Symposium

on Circuits and Systems, 2007. ISCAS 2007., May 2007, pp. 1421–1424.

[15] K.-H. Lin et al., “Implementation of QR decomposition for MIMO-

OFDM detection systems,” in 15th IEEE International Conference on

Electronics, Circuits and Systems, 2008. ICECS 2008., 2008, pp. 57–60.

[16] P. Luethi et al., “Gram-Schmidt-based QR decomposition for MIMO

detection: VLSI implementation and comparison,” in Proc. IEEE Asia

Paciﬁc Conf. Circuits and Systems APCCAS 2008, 2008, pp. 830–833.

[17] K. Mohammed et al., “A MIMO Decoder Accelerator for Next Gener-

ation Wireless Communications,” no. 99, p. 1, 2009, early Access.

[18] M. S. Khairy et al., “Efﬁcient FPGA Implementation of MIMO Decoder

for Mobile WiMAX System,” in Proc. IEEE Int. Conf. Communications

ICC ’09, 2009, pp. 1–5.

[19] J. Eilert et al., “Efﬁcient Complex Matrix Inversion for MIMO Software

Deﬁned Radio,” in P roc. IEEE Int. Symp. Circuits and Systems ISCAS

2007, 2007, pp. 2610–2613.

[20] A. Irturk et al., “Architectural Optimization of Decomposition Algo-

rithms for Wireless Communication Systems,” in Proc. IEEE Wireless

Communications and Networking Conf. WCNC 2009, 2009, pp. 1–6.

[21] Z. Nikolic et al., “Design and implementation of numerical linear algebra

algorithms on ﬁxed point DSPs,” EURASIP J. Adv. Signal Process, vol.

2007, no. 2, pp. 13–13, 2007.

[22] T. Haustein et al., “Real-time signal processing for multiantenna sys-

tems: algorithms, optimization, and implementation on an experimental

test-bed,” EURASIP J. Appl. Signal Process., vol. 2006, pp. 136–136,

2005.

[23] A. Bjoerck, “Numerics of Gram-Schmidt orthogonalization,” Linear

Algebra and its Applications, vol. 197-198, pp. 297 – 316, 1994.

[24] TMS320C64x+ IQmath Library User’s Guide, Texas Instuments, De-

cember 2008.