Hardware/Software Co-Design:EDA Systems for Hardware/Software Co-Design

EDA Systems for Hardware/Software Co-Design

The design of systems with IP cores, in particular with processor cores, requires a new dimension of performance from the used software tools:

• support in design space exploration after allocation of hardware and software;
• Support for simultaneous, concurrent development of hardware and software;
• Support for veriﬁcation with different models on different abstraction levels.

Design Space Exploration Tools

The automatic partitioning of a given task in a software and a hardware part, controlled by a formal executable speciﬁcation [7.9] is a fastidious task and still the subject of intensive research [7.4], [7.7]. These efforts are concentrated on complex applications of signal processors (ASIP) where algorithms are well deﬁned by a C language model or in a special speciﬁcation language like SpecC [7.9], [7.36], [7.37] or SpecCharts [7.39]. The mapping from such a functional description in a speciﬁcation language to a given architecture may have several solutions and permits a number of variants, which must be evaluated concerning effort, silicon area consumption, and operational speed. These variants span the so called design space, which has to be explored for an optimal solution. Questions concerning the used cost functions and the given constrains are far from trivial. The results of these researches are programming systems such as CATHEDRAL [7.4] from Co- Ware, developed at the IMEC, which are avail- able today in commercial versions. Furthermore PTOLOMY [7.5] and COSSAP, which are more related to the design, reﬁnement, and optimisation of signal processor applications from formal algorithmic speciﬁcations. The output of these pro- grams are synthesizable VHDL models and related control schedules (e.g., in the form of C codes or microcode). These software tools are able to process and verify very heterogeneous designs, but partitioning and resource allocation is mainly done by human interaction.

Designs in telecommunications are characterized frequently by the fact of the signals being processed at very different data rates. So the primary signal may be processed with a high sampling rate of above 100 k samples/sec, the control loops, tuning, amplitude regulation, and automatic frequency adjustment has low bandwidth and is comparably done with low rates. These control loops may be irregular and complex. The associated algorithms have various communication relations to the external world as well as to internal structures.

From the data rate it follows that mainly primary and close coupled receiver structures working at a fast rate are preferably implemented in hardware, the complex, but slower, control loops which need thousands of cycles, are better implemented as software on a processor core. These are the ﬁrst and main criteria for partitioning and allocation.

The big differences in processing rate and time constants between the two areas, sometimes of more than a factor of 100,000 makes simulation and veriﬁcation difﬁcult. Whoever has tried to simulate a PLL with SPICE knows the problem of large cycle numbers, leading to long simulation times and huge data volumes.

Modern software systems, such as offered by CoWare, allow one to integrate good and veriﬁed blocks such as commercial processor cores taken from the library SYMPHONY, and support by this the demand for re-use of developed and qualiﬁed components [7.4].

Compilers for Irregular Target Architectures (Retargetable Compiler) Today the variety of commercially available processor cores is very large and there are many associated software development systems. Central programs are the assembler for the selected core, able to generate object code for the target architecture, a simulator/debugger and a C cross compiler. The C compiler may have constraints concerning the ANSI standard (i.e., lack of programs for ﬂoating point processing). The systems offered from IAR Systems [7.33], KEIL [7.34], [7.35] Hewlett Packard, Mentor Graphics, and Cadence run on a PC under WINDOWS or LINUX and can be used with different processor architectures.

The main programming language is C, other languages such as PASCAL are rarely used. Only a low percentage of high level languages are used in embedded applications, for which assembly programming dominates, especially for driver development. Normally, compiler generated code is substantially larger than hand coded programs and usually has poor speed performance. Be- cause of this handicap [7.10], an acceptable level should be below 20 % in memory requirement and a maximum of 30 % in speed performance. Other uses of a compiler would be inefﬁcient. Good C compilers with small overhead and optimised libraries exists in the controller area (IAR, KEIL). In 32 bit systems (ARM) the compiler handicap is accepted because there is no alternative and adequate memory resources are provided, so most of the projects are programmed in C language.

Even if these compilers can be switched according to the target processor, they are not re-targetable compilers. A re-targetable compiler is a compiler which can be conﬁgured to new architectures and instruction sets with little effort. They permit de- velopment of the above mentioned ASIP processor cores with an application speciﬁc instruction set and architecture. This is an important research topic today [7.16], [7.15].

Retargetable compilers RTC may be generated with the classical compiler tools. Irregular architectures, non-symmetric and non-orthogonal register sets, unusual commands and addressing modes make it difﬁcult to deﬁne a general solution. Lim-

ited use of resources, different cycle and pipeline processing, and interdependencies between internal structures requires specialists to deﬁne the con- ﬁguration ﬁles for RTC. Highly demanding are the SIMD 1) architectures, and VLIW processor cores, which can mainly be found in the signal processing area, with more than one CPU, pixel processing cores and special instructions for multimedia applications. Companies which develop ASIP use nearly all home made software tools, which sometimes have their origin in university development (GNU, CATHEDRAL).

The typical re-targetable compiler consists of several layers (ﬁg. 7.12):

• Parser;
• syntax analysis;
• operator covering;
• generation of object code;
• binding and linking.

The Parser reads the C source code and parses it for keywords, declarations, variables, etc. Following the syntax analysis the statements are checked for validity and meaning and then transformed into a kind of meta-language. In this meta-language, essentially only identiﬁers and operators exist, which are combined with control statements such as if-then-else. Up to here only the grammar of the C language is used. This part of the compiler is independent from the target architecture and is the same for all target systems.

In the next processing step the identiﬁers are related to memory space or registers, depending on their type, operators are ‘covered’ with functions, which can be processed by the hardware. There are:

• unary operators with one input value and one output value such as, e.g., assignment, negation, inverting;
• binary operators such as addition, subtraction, multiplication, division, etc.

All mathematical expressions, even with parentheses, can be resolved into a tree structure of unary and binary operators. Expressions with more than two inputs are mapped onto a sequence of binary operators, ﬁg. 7.13. The resulting tree, or data ﬂow graph, consists of operators at the nodes. These nodes can be processed by hardware functions, and identiﬁers at the edges, which are mapped to memory locations. The task of the compiler is to cover the tree structure in optimal form with the existing resources of the target processor. The compiler may also ﬁnd a sequence of operations and memory assignments that achieve the result, and is more optimal with resource usage at the same time. As an example it may be better to keep intermediate data in registers than in memory. To do that the compiler must know the target architecture as well as the costs of usage and constraints in resources. This information is found in a conﬁguration ﬁle, which must be provided.

In the last step of the compilation run, object code in assembler mnemonics is generated for the statements found and sequenced. This part of the compiler, the code generator, is very speciﬁc to the target architecture and is adjusted to the following assembler and its syntax. An actual example for a retargetable compiler, where all machine dependent programs are concentrated on one conﬁguration ﬁle only, is the LCC (Little C Compiler), created by FRASER and HANSON [7.29]. The associated book [7.30] describes in detail all that is needed to formulate the rules in the conﬁguration ﬁle. Further work in this area is being done at the University of Dortmund, Germany, with the LANCE system [7.15], [7.16], [7.31], at the Center for Embedded Computing Systems, University of California, Irvine, with EXPRESSION [7.38], and the GNU compiler gcc, which has been ported successfully to different target architectures including ARM. Porting gcc requires some effort.

There is a strong interdependence between processor architecture, instruction set, and efﬁciency of code generation. The C language deﬁnes different data types like char (8 bit), int (16 bit), long (32 bit), ﬂoat (32 bit), double (64 bit), etc. These data types can be held completely, or only partly, in one register. 32 bit processors are able to handle large word sizes better than 16 bit or even 8 bit controllers, in which the transportation of data needs more than one assembly statement. Thus a C compiler for 8 bit architectures is generally ineffective. On the other hand, 32 bit cores need signiﬁcantly larger resources in memory that are not available in single chip SOC designs.

For one expression in meta code several assembler instructions may be inserted during the coverage procedure. This may also be the call

Tree of Operators

tine from an attached assembler library, optimised for the core. Such libraries contain more, complex instructions such as division, ﬂoating point processing, and string processing. The use of a library is favourable, since these routines are used several times but only stored once. The binder, or linker, called part of the compiling software inserts (sometimes) only those parts of the library into the object code which are called in the application program. The generated assembler statements must be assembled and linked to the executable. The linker binds the routines from the library and allocates memory in the available RAM or ROM architecture. Depending on type and programming style, globals are allocated in RAM, constants are allocated in the ROM. Assembler and linker are sometimes closely integrated, so the different steps are hidden from the user. With compilation the whole process of generation of an executable from a high level language is designated. For integrated cores it is preferred to carry out these steps care- fully in a deﬁnite sequence under the control of the designer; this is especially true if a RTOS or BIOS has to be included [7.3], [7.33], [7.34].

Large and complex functionalities are generally integrated at C language level. But binding of the stdlib of the ANSI C standard already requires a lot of memory space (about 5 ... 10 Kbytes); further- more, the stdio, string, and math libraries also be- long to the standard. C compilation without these libraries is always rudimentary. These libraries form the basis for every higher operating systems and must be supported in some way, perhaps with constraints.

For a retargetable compiler there must be at mini- mum three data sets speciﬁed:

• Conﬁguration ﬁle with description of the target architecture, constraints, and cost functions, as well as a description of the data types and memory models,
• Mapping of the metacode to object codes of the target system,
• A library in object code of the target systems with elementary functions and deﬁnitions, preferably the stdlib library set from the ANSI- C-standard.

The usage of the C language is effective with architectures with word size of 16 bit and 32 bit and enough memory resources. With increasing density in actual chips, RISC cores with 32 bit word size, such as ARM 7, will be used in applications which are today still dominated by 8 bit controllers. It can be expected that high level languages will be used to program these systems without any compromise.

Integrated Development Environments (IDE)

Even if a software development system for the selected core exists, it is generally not coupled with a hardware simulator at the RTL or VHDL level. Because of the great importance of co-simulation of hardware and software, coupled systems are offered by the big EDA providers. One example is the SEAMLESS Co-Veriﬁcation Environment from Mentor Graphics Inc.; similar development systems are offered from other providers, too. These are densely coupled programs for simulation of embedded code in a software simulator/debugger on the one hand, a gate level simulator (perhaps a VHDL-simulator like modelsim) on the other hand. Mentor Graphics reports speeds of about 5000 instructions/sec, which allows the complete simulation of smaller applications. The system can be delivered with compilers and simulators for all important processor cores and may re- place dedicated and proprietary development systems of the core suppliers. The system may be coupled with hardware emulator boards on a FPGA basis, where part of the system may be represented in hardware directly.

For large application programs this kind of coupled simulation is still too slow, since the VHDL simulator run at the RT level is event controlled, whereby the number of the events is about propor- tional to the number of implemented gates. This behaviour simulation. One solution may be a cycle based simulation, in which behaviour models are generated from the true RT description for each clock cycle. Effects of timing are neglected. In a synchronous design, where the maximum time delay must be shorter than the delay given by the critical path through the network, this abstraction is allowed. With this mapping, a direct, cycle true executable code is generated, which shows speed similar to the application software code it- self. Such a program is offered by Quickturn Design Systems Inc. under the name SpeedSim. An acceleration of the simulation by a factor of 10 ... 100 is not unusual, so hardware emulation or special hardware accelerators may be avoided. An advantage of this solution is that the model can be derived automatically from existing VHDL code, so that the method can be integrated into a consistent design ﬂow. If one works with hand designed models the risk of inconsistencies and discrepancies is large.

Search This Blog

Electronic Design Automation (EDA)