Hardware/Software Co-Design:Design with Virtual Components and Processor Cores.
Design with Virtual Components and Processor Cores
Design to Target Technology FPGA
When looking at FPGA as a target technology a special situation exists: companies like ALTERA or XILINX are primary interested in selling chips. For this reason they do not offer design house activities. Their interest is to spread IP technology, so they provide and distribute cores at low cost. Libraries offered under these circumstances, and Besides standard libraries, which contain logic se- ries and well known building blocks (e.g., counters, registers, adders, etc.), more complex LPM modules are available by using the provided gen- erators. Structural block diagram set-up with in co-operation with these components is easy and effective. In the case of a language driven design, where the target functions are described within the design language, operators are effectively replaced during synthesis by firm macros. The library from ALTERA [7.1] or a similar one from a competitor, offers a large selection of complex building blocks, from adders over multipliers to complete memory designs. Who really needs more? In addition, companies like MENTOR, SYNOPSYS, and others offer libraries specifically suited for FPGA design. Furthermore, target independent synthesis software for FPGA synthesis (i.e., LEONARDO Spectrum from Exemplar Logic Inc, now Mentor) is also available.
The PCI interface was one of the first commercial successful complex soft cores. This made the breakthrough for the use of IP technology. This particular core demonstrated the ease with which a design can be performed. Using IP in this way showed a benefit, the reduction of development time and risk, to the customer. This core permitted easy design of PC card interfaces utilizing FPGAs; this resulted in a huge market opening up for the FPGA companies.
FPGA cores are usually available as firm IPs in code form. They can be completely simulated on dedicated development systems of the manufacturer before a license is purchased and before the integration of the core into the design. In this situation the customer does not have to carry a design risk. The license is only necessary for programming the design into the target component (OpencoreQc program ALTERA).
A remarkable large market for huge FPGA with more than 100 k gates is the emulation of soft cores and whole designs, which are implemented for testing purposes on FPGA. Significant simulation times can be saved and design accuracy is price; it is not a primary commercial objective for these companies. They work as brokers for distribution of cores developed by customers or public domain IP, without guarantee, and generally promote the use of IP.
increased. Emulation is still of advantage if the target frequencies in the FPGA technology cannot be realized. The gain in speed may, in relation to a classical digital simulation, be more than a factor of 100, sometimes 10,000 or more and with the newest FPGA chips real time performance may be achieved. It will become commonplace to synthe- size and verify large digital designs on a FPGA first, even if the internal structures are very distant from the target design. On a 50 k gate FPGA it is possible, without any major complications, to demonstrate the functionality and performance of a 16 bit processor core complete with all its periphery. Mega-gate sized FPGAs are available today; very large designs containing significant memory requirements can be prototyped [7.8]. Evaluation boards for this purpose are on the market from the FPGA companies as well as from small providers filling this market gap. Evaluation, verification, and optimization has become a matter of choice for prototyping IPs.
Processor Cores in ASICs
A very attractive driver application for virtual components is the possibility of using microprocessor cores as a component in an ASIC. In this way de- signs can be realized which are software controlled and showing similar intelligence, flexibility, and performance to a classical discrete microprocessor circuit. Today’s complex algorithms and standards (e.g., MPEG) are subject to continuous small modifications. A programmable chip is substantially more flexible with respect to modifications than a hardwired circuit (that is why it is called ‘soft’- ware). First choice are processors, which are already well known from the discrete market, as the 8051 family, the 650x-family, and, to a smaller extent, some of the larger cores (e.g., x86 series, the 680xx family, ARM 7, and ARM 9). The ARM family was designed for ASIC integration. Table 7.5 show available processor cores for ASIC integration.
In modern CMOS technologies a processor core is only a few square millimetres in size. There is still plenty of space for additional customized circuits, see table 7.6.
Only a few design houses have the capability of developing their own processor cores. If the design is too similar to the discrete original then license issues may arise. The industry standard controller 8051 no longer has this inhibition; a clone is currently available from a number of companies. Compatibility with an existing standard controller has an advantage in that all necessary software design programs (e.g., compiler, assembler, and debugger) are already available, and the device is well known in the market. A competitor who starts a design from scratch must develop all of these essential tools on his own. For this reason only occasionally is a completely new design introduced. One of these, which has shown success, is the 32 bit ARM 7TDI.
In contrast to this mainstream development scenario of cloning successfully discrete processors, there are application specific instruction set processors ASIP, which are used in certain large mass applications (e.g., an MPEG decoder). The effort to create development tools is much higher than the drive to reduce chip area. There are only few applications with complex algorithms which attempt to minimize the area; there is no solution at all with standard processors.
ASIPs play a significant role [7.19] where fixed algorithms (example: JPEG, MPEG cores) are needed with high performance requirements and are applied in mass. Many modern mobile phones contain ASIPs which are derived from classical processors and specifically designed for these algorithms. ASIPs also dominate the set top box scene, computer game consoles and graphic accelerators, as well as multimedia applications such as face to face phones with demanding extreme GOPS performance and low power requirements as well [7.19]. These ASIPs contain signal processor characteristics as well as control functionality. Internal structures are working strongly in parallel, forming so called VLIW architectures 1) with astonishingly low clock frequencies. Table 7.7 lists some actual ASIP cores, taken from [7.19].
The use of programmable processor cores allows, in connection with integrated memory, the design of complete systems on only one chip, a so called SOC (system on a chip). Thus there are numerous new products in the wireless area with SOC. One-chip systems are very advantageous in power consumption. Substantial arithmetic performance is demanded in these applications, both in mobile phones, where signal processors are adopted, as well as in classical arithmetical tasks, such as in GPS receivers. Standard 8 bit controllers such as the 8051 family are not well suited and have poor performance. Computer game consoles place high demands on a processor, but to be competitive it must be provided at a low price. Often computer game consoles have more computing and graphics capabilities than many modern computers. Here we find RISC cores, such as the 32 bit ARM 7, although the licensing costs are still high. The ARM cores are available in a large family with large selection in performances, now adaptable to nearly all relevant CMOS technologies [7.2]. They are accompanied by high quality software tools, supported by many providers. The cores are now available on FPGA too, in the form of hard cores as well as in the form of soft cores in high level design language. The license costs are still high for small companies, so ARM is mostly found being used by big customers.
Smaller processor cores are mostly distributed as soft cores, but the effort for verification is significant and depends strongly on the support of the provider and the synthesis scripts used. So there are still hard cores available and used because of the lower design risk, if they are available in the target technology. Sometimes re-targeting is done by the supplier.
A substantial problem is still the integration of the building blocks belonging to the processor system such as interface, parallel interface, RAM, and ROM. Because these blocks originate from different suppliers and there is no real standardisation of the bus interconnection, numerous adjustments and critical simulations are needed to verify performance. ARM is here again in the forefront of the technology with its AMBA bus standard, which allows certain plug in performance. It is therefore very important to have an exact description of the bus, e.g., in the form of a bus model. There are many suggestions for such bus standards [7.27], but they may be adopted with new designs only.
Small and medium enterprises, for which commercial processor cores are too expensive, may fall back to the freely available public domain cores mentioned before, e.g., to the FHOP16 controller fig. 7.3 with 1.5 mm2 area and a throughput of 5 MIPS, designed at the institute of the author, which is already used successfully in several applications. The use of hard IP requires that a semiconductor manufacturer be chosen to provide the chips. This leads to a long term binding agreement with the chosen manufacturer. The decision for a certain processor core is in this way a strategic decision at the management level and binds the company because of the high and long term investments in license, tools, and, not least, in the training of the engineers.
ASIPs require very high personnel and capital expenditure. In addition to the actual core the en- tire software development system, including compiler, must be developed. In order to simulate the core and, with the dedicated hardware, to validate its performance, special developments have to be made in the CAE area. Ordinarily only large enterprises operating world wide are able to do that.
Embedded Software
With the use of processor cores a new level of complexity is reached, since the functionality of the ASIC no longer depends on the gate logic, the architecture, or the hardware alone, rather than the software implemented on the chip, determines functionality. This is similar to a freely programmable computer system made of discrete devices. There are many advantages:
• Functionalities can be designed which would not be feasible with hardware alone because of their complexity;
• The software may have many different tasks which can be specified late in the development of the chip, and can be modified flexibly;
• The chip can be used for very different applications and recovers thereby a certain degree of universality, whereby potentially larger numbers of deliveries and with that lower costs may follow. At least this leads to substantial savings in development;
• If the software is downloaded on activation of the chip or is stored in a FLASH or EEPROM memory, then extreme flexibility is achievable. This level of flexibility is not possible with uni- versal, discrete embedded systems.
Most of these application programs are still developed in assembler because of the small resources in RAM and ROM. But it is recommended to apply certain principles of real time operation systems (RTOS). These principles are:
• Organize the software in layers with their own functionality and well defined interfaces between the layers;
• Separate the communication modules and the hardware related modules from the main application program and define the BIOS. Isolate the hardware from the application software with an abstract interface;
• Use a modular or object oriented style for the application software.
The layer model requires a minimum of three lay- ers:
• The layer with the hardware driver;
• the operating system layer;
• the application layer.
In many cases a more detailed partitioning makes little sense. The driver layer contains all routines, which directly affects the hardware; it drives interfaces, obtains information from the real world (e.g., keyboard readout, display driver, sensor interface, real time clock, a/d converter readout, etc.) and drives special ASIC periphery modules. The software routines are normally small and specifically defined concerning their I/O addresses. The initialisation routine may also be connected to the drivers, and it sets up all the hardware after a reset or power down. These routines are generally combined in the BIOS 1), which is placed in the ROM of the chip. To be able to boot in the right way the BIOS instructions must be mapped to those addresses in the ROM where the processor starts execution after a reset. This may be at the bottom of the memory map at 0x0000 or at the top, depending on the processor architecture.
The procedures are called with arguments which may be transferred via the registers of the processor core. As far as being implemented in the core, a software interrupt call (SWI call) is preferred compared to a normal function call. With normal subroutine calls the addresses of the routines have to be known at compilation, which can be done by header files and an added BIOS library. A more independent solution is to use the index mechanism of software interrupts to isolate the BIOS routines which are provided as ROM ware, sometimes called firm ware, from the application program, with a common agreement on the interrupt number, where each vector stands for a certain routine. BIOS calls are now independent of version and no library or header file has to be provided for compilation (e.g., if the software interrupt 15 is activated this may call in each case a subroutine, copying memory from location A to location B, interrupt 18 may provide an output of data on pin number 3, etc.). The index table 2) is loaded during the initialisation phase by the associated BIOS routine and always contains the actual call addresses to the subroutines. Because this table is not fixed these vectors can be exchanged by the application program at run time, exchanging in this way complete routines and functionality. This adds to the flexibility, see fig. 7.7 for more details.
In the operating system layer the more complex scheduler tasks are accommodated. This includes the administration of the interrupts and of the task priorities. The control of the sequence of tasks, called scheduling, and the administration of chip resources (e.g., the RAM) belongs further to this layer. One of the most important features is the ability to start tasks and to communicate through communication channels. This may be via simple keyboard or display operations, but more frequently these are complex, protocol driven communications via USB, CAN, or TCP/IP interface modules. The protocol machine states are organized here, as is the central task scheduler; in this case they are implemented as software routines. The operating system layer refers to the underlying driver layer concerning the driver routines and generates the calls for this subroutines. Further provided in this OS layer are useful routines which extend the features of the processor (e.g., routines for mathematical operations like division, sorting etc.). If there is a SWI mechanism, this part of the BIOS will be standardised, too, by using indirect calls as describe before. Otherwise there must be some kind of classical subroutine call, perhaps organized as a jump table.
The application layer finally contains the code for the tasks which form the actual application. With the concept of a structured programming style the application layer consists to a large extent of sub- routine calls to the underlying layer. Very complex functionalities can be programmed by this hierarchical architecture. At the same time the code can be well maintained and easily modified, also with little memory. At this level cross compilers can be used effectively, which improves the productivity of the developers substantially, without large con- cessions in memory consumption.
By its isolation, the BIOS and the application program can be developed further independently. Once developed, the BIOS might be transferred with slight modifications to a new chip design, and in that way re-used in more than one application.
The application program can be arranged in the ROM, too. The chip can, however, be used only for the intended purpose. It is more elegant to download (boot) and execute the application pro- gram at the start-up of the chip, which may be the rising slope of the main supply voltage. Such a download may happen from a serial EEPROM as well as from a FLASH-ROM, if there is no reprogrammable memory on board the chip. Download- ing is carried out by the initialising BIOS routine, which executes automatically after a reset. The required EEPROM is tiny and cheap, see fig. 7.8. The procedure of downloading is comparable to the initialisation of a RAM based FPGA, where the configuration information is stored in an EEPROM and loaded into the chip with the rising of the
If we include an upload feature into the BIOS, too, it is now possible to modify the application program on-line, to update dynamically, and so provide new features. The application program may be organized in pages, which are loaded by the operating system dynamically in a classical overlay style. The number of loadable pages and by this the size of the implemented program is thus nearly unlimited. As an alternative to an EEPROM also a FLASH ROM can be used for booting.
Programming of embedded software is done today using integrated development systems with cross compilers which usually run on PC platforms. As far as the core is compatible with discrete processors, the classical development systems can be used. In practice, however, because of the very limited resources of memory and the high performance demands, more than 70 % of the programs are written in assembly language. The use of high level language C-compilers still play a subordinated role, not at least because of the un- acceptable overheads of code and the sub-optimal performance [7.19]. This is changing with the new 32 bit processors and larger (external) memory, because there is no alternative.
For widespread discrete processors real time operating systems (RTOS) are available as complete, configurable programs. For chip applications these are usually too large or have to be substantially modified. The advantage in using such predefined modules is then small. Recently there are first RTOS kernel, which can be used in SOC [7.25]. For the larger RISC processors there exist many real time operating systems, which may be adopted (e.g., for the ARM family), according to data given from the manufacturer [7.2], there
are more than 20 different providers 1). For ASIP the manufacturers maintain their own OS. Huge ROM based operating systems like WINDOWS CE, PSION EPOC, NetBSD, or embedded LINUX are available, too, but are never integrated into the chip itself.
The programming and validating of the driver routines remain the task of the designer. In the mean- time there is more support on the horizon. Now there are driver routines for reading and writing into FLASH ROM provided by the big manufacturers INTEL, AMD, and others, sometimes with all file system features, for the most important processor architectures. It has to be expected that other providers of IP or virtual components (VC) will deliver the driver routines in future for their products for dominating processor architectures too. This will ease and accelerate the development significantly.
Hardware/Software Co-Simulation
With the implementation of processor cores the classical top down design flow is considerably disturbed. At the beginning of the SOC design cycle, it must be decided on the partitioning of the task. What can be done in hardware and what has to be done in software? Which functionalities are made by programming and which have to be made in dedicated hardware blocks? Software is much cheaper, so there is a rule: ‘what is possible to do in software shall be programmed!’. Hard- ware blocks are only designed in cases in which requirements concerning speed, performance or functionality cannot be fulfilled by a solely soft- ware solution. In most cases it is easyer to decide. Enormous research efforts, are going into the field of software/hardware partitioning [7.18]. There are mixed solutions where a close interaction of the processor core with special hardware blocks is needed, sometimes these are termed coprocessors. These solutions require accurate cycle true hardware/software co simulation. Verification of these systems takes much more effort than pure hardware or pure software solutions.
One goal of partitioning is the isolation of the two work areas with the intention of being able to work on both tasks in parallel and as independently as possible. Such concurrent engineering makes sense because of the time needed for software development. A sequential processing of, first, hardware and, then, software development will blow up the overall development time. Frequently software development takes more than 50 % of the overall development effort [7.10]. The dependencies between hardware and software must be bridged by modelling constructs, so that most of the tasks may be designed independently of the counterpart.
For Software simulation, so called virtual software components may be used. Modern operating systems like WINDOWS or LINUX allow communi-
cation between independent tasks via an OLE or DDE mechanism. So hardware components like keyboard or display may be represented by a WIN- DOW program, which is then connected via OLE directly to a debugger/simulator. If an application is debugged in the simulator and keyboard entry
is required, the appropriate data is typed in and channelled to the program as if there were a hard- ware connection of a mechanical keyboard. In the same way, data for the display is generated which may be channelled to a WINDOW screen. Figure
shows the virtual components ‘keyboard’ and ‘display’ of a pocket calculator system, which is simulated truly cycle by cycle on a debugger. In the simulation there may now be a detailed analysis and optimisation of the program done by break- points, register displays, stepwise execution and detailed reference to the source code. All blocks which are referenced by the BIOS driver layer may be chosen as virtual software components. Using the file features of the PC, complex scenarios may be recorded and replayed.
In the usual design flow the hardware development takes place to a large extent uncoupled from the software development, whereby the driver pro- grams of BIOS must be verified in a classical functional digital simulation. Because these driver programs are usually quite short, the time effort is affordable. The content of the ROM must be loaded from the development system with suit- able converter programs at the C or VHDL level into the ROM model before simulation. As an intermediate format the wide spread INTEL Hex- format is appropriate, which can be generated by many development systems and is also used in the programming of EPROM components.
If the design process is proceeding further, a total verification of all functions must be done in the next step. This must include all external circuitry and building blocks as well as the application pro- grams. This can be done:
• by classical digital simulation at the gate or register transfer level with loaded BIOS and application program;
• on an integrated hardware/software design system;
• by rapid prototyping and emulation on FPGA.
With the assumption that the application program takes some kilobytes and the BIOS is of the same order of magnitude, a verification effort easily of some million processor cycles can be predicted. Each cycle may be divided again into some timing intervals, which would lead to computing times of hours, days, or weeks already on big computers. This is usually not acceptable. Some people call it the verification trap, the brute force method does not work. There is a lot of workaround to this problem. One can take behavioural models for the core instead of the RTL model, which speeds up simulation significantly. Timing simulation may be omitted, but the verification value is decreased. The periphery must always be present in RTL style. External interfaces are difficult to simulate, especially if human interaction is included. These inputs are non-deterministic and cannot be replaced by pre-programmed behaviour in the same way. One can state in summary that a satisfying verification of SOC with classical digital simulation is not possible.
Most important EDA software providers today delivers combined hardware/software development systems which allow a limited simulation of software controlled SOC. These systems use behaviour models, formulated at a high abstraction level in the C language. The problem of long simulation times and emulation of interaction scenarios is not solved, only improved. There is still the possible inconsistency between the used core and the model, derived from different origins. Real- time performance is far away. These systems will be discussed later in detail.
In practice, therefore, the emulation of designs in FPGA, designated as ‘rapid prototyping’, is the preferred way of verification. The processor core may be included as a discrete component, as far as one exists, into the emulation. If the core exists as a soft IP it may be synthesized to FPGA and included into the emulation directly. Large designs may be spread over more than one FPGA, today circuits with up to 10 million gates are available for this purpose.
For pure digital circuits this FPGA emulation allows a realistic simulation, the FPGA can be connected to the real world and fed with real signals. Most of there emulations work in real time. Except that with very high clock frequencies real time is not possible, a slowdown in clock frequency is needed and the input stimuli have to be delayed. But it has to be remembered that the internal architecture of FPGA is quite different from an ASIC implementation, and a comparison concerning timing behaviour is not directly possi- ble. In clock synchronous designs, which are state of the art and generally preferred, the behaviour is defined at the rising clock edge and this should be equal to FPGA or ASIC. So with timing problems the master clock has to be slowed down until the above conditions are fulfilled. Compared to digital simulation there is still a gigantic improvement and speed up.
Today FPGA emulation systems are offered on the market from several EDA providers. The standard development systems from ALTERA, XILINX and others may be taken for this purpose as well. Suitable electronic boards with prototype FPGA are available, where a design can be downloaded from the workstation, but in many cases they fit only for relatively smaller projects.
Placement and Routing of ASIC Cores
Only soft IP can be integrated into FPGA and gate arrays, perhaps firm IP with the associated placing rules. Using standard cell design style hard IP can be used too, as far as they are available in the selected technology. Hard IP providers port their cores into the target CMOS technology on request, so there are few restrictions in using a block design style for creating big designs.
Processor cores may be placed in one corner, the even larger RAM and ROM-blocks in the other and interconnected with a power supply ring. The other building blocks are then placed into the remaining gaps. If they do not fit, they are dissolved and flattened in hierarchy to use the silicon area as optimally as possible. With multiple metal layer routing, very dense designs can be created, al- though the block style is not optimal in respect of area. Analogue blocks need their own power supply system and are separated by guard rings with contacts to substrate and supply or ground, isolated as far as possible from the digital parts of the circuit. These blocks are preferred to be placed at the borders of the chip, in oder to obtain short interconnection wiring to the pad cells and to be able to organize their own separated, and isolated analogue supply. Analogue cells are not quite as densely routed when compared to digital cells and need more space on silicon. Fig. 7.11 shows a typical block cell design with analogue and digital building blocks.
Comments
Post a Comment