In addition to satisfying the demands of the most computationally intensive, real-time signal-processing applications, SHARC processors integrate large memory arrays and application-specific peripherals designed to simplify product development and reduce time to market. Irrespective of the specific product choice, all SHARC processors provide a common set of features and functionality useable across many signal processing markets and applications. This hardware extension to first generation SHARC processors doubles the number of computational resources available to the system programmer. Second generation products contain dual multipliers, ALUs, shifters, and data register files - significantly increasing overall system performance in a variety of applications. This capability is especially relevant in consumer, automotive, and professional audio where the algorithms related to stereo channel processing can effectively utilize the SIMD architecture.
|Published (Last):||17 April 2016|
|PDF File Size:||9.82 Mb|
|ePub File Size:||19.94 Mb|
|Price:||Free* [*Free Regsitration Required]|
One of the biggest bottlenecks in executing DSP algorithms is transferring information to and from memory. This includes data , such as samples from the input signal and the filter coefficients, as well as program instructions , the binary codes that go into the program sequencer. For example, suppose we need to multiply two numbers that reside somewhere in memory. To do this, we must fetch three binary values from memory, the numbers to be multiplied, plus the program instruction describing what to do.
Figure a shows how this seemingly simple task is done in a traditional microprocessor. This is often called a Von Neumann architecture , after the brilliant American mathematician John Von Neumann Von Neumann guided the mathematics of many important discoveries of the early twentieth century.
His many achievements include: developing the concept of a stored program computer, formalizing the mathematics of quantum mechanics, and work on the atomic bomb. If it was new and exciting, Von Neumann was there! As shown in a , a Von Neumann architecture contains a single memory and a single bus for transferring data into and out of the central processing unit CPU. Multiplying two numbers requires at least three clock cycles, one to transfer each of the three numbers over the bus from the memory to the CPU.
We don't count the time to transfer the result back to memory, because we assume that it remains in the CPU for additional manipulation such as the sum of products in an FIR filter. The Von Neumann design is quite satisfactory when you are content to execute all of the required tasks in serial.
In fact, most computers today are of the Von Neumann design. We only need other architectures when very fast processing is required, and we are willing to pay the price of increased complexity.
This leads us to the Harvard architecture , shown in b. This is named for the work done at Harvard University in the s under the leadership of Howard Aiken As shown in this illustration, Aiken insisted on separate memories for data and program instructions, with separate buses for each.
Since the buses operate independently, program instructions and data can be fetched at the same time, improving the speed over the single bus design.
Most present day DSPs use this dual bus architecture. Figure c illustrates the next level of sophistication, the Super Harvard Architecture. The idea is to build upon the Harvard architecture by adding features to improve the throughput. First, let's look at how the instruction cache improves the performance of the Harvard architecture.
A handicap of the basic Harvard design is that the data memory bus is busier than the program memory bus. When two numbers are multiplied, two binary values the numbers must be passed over the data memory bus, while only one binary value the program instruction is passed over the program memory bus.
To improve upon this situation, we start by relocating part of the "data" to program memory. For instance, we might place the filter coefficients in program memory, while keeping the input signal in data memory. This relocated data is called "secondary data" in the illustration. At first glance, this doesn't seem to help the situation; now we must transfer one value over the data memory bus the input signal sample , but two values over the program memory bus the program instruction and the coefficient.
In fact, if we were executing random instructions, this situation would be no better at all. However, DSP algorithms generally spend most of their execution time in loops, such as instructions of Table This means that the same set of program instructions will continually pass from program memory to the CPU.
The Super Harvard architecture takes advantage of this situation by including an instruction cache in the CPU. This is a small memory that contains about 32 of the most recent program instructions. The first time through a loop, the program instructions must be passed over the program memory bus.
This results in slower operation because of the conflict with the coefficients that must also be fetched along this path. However, on additional executions of the loop, the program instructions can be pulled from the instruction cache. This means that all of the memory to CPU information transfers can be accomplished in a single cycle: the sample from the input signal comes over the data memory bus, the coefficient comes over the program memory bus, and the program instruction comes from the instruction cache.
In the jargon of the field, this efficient transfer of data is called a high memory-access bandwidth. This is how the signals enter and exit the system. These are extremely high speed connections. This is fast enough to transfer the entire text of this book in only 2 milliseconds! Just as important, dedicated hardware allows these data streams to be transferred directly into memory Direct Memory Access, or DMA , without having to pass through the CPU's registers.
The main buses program memory bus and data memory bus are also accessible from outside the chip, providing an additional interface to off-chip memory and peripherals. The overriding goal is to move the data in, perform the math, and move the data out before the next sample is available.
Everything else is secondary. Some DSPs have on-board analog-to-digital and digital-to-analog converters, a feature called mixed signal.
However, all DSPs can interface with external converters through serial or parallel ports. Now let's look inside the CPU. These control the addresses sent to the program and data memories, specifying where the information is to be read from or written to. In simpler microprocessors this task is handled as an inherent part of the program sequencer, and is quite transparent to the programmer.
However, DSPs are designed to operate with circular buffers , and benefit from the extra hardware to manage them efficiently. This avoids needing to use precious CPU clock cycles to keep track of how the data are stored. This means that each DAG holds 32 variables 4 per buffer , plus the required logic. Why so many circular buffers? Some DSP algorithms are best carried out in stages.
For instance, IIR filters are more stable if implemented as a cascade of biquads a stage containing two poles and up to two zeros. Multiple stages require multiple circular buffers for the fastest operation. In this mode, the DAGs are configured to generate bit-reversed addresses into the circular buffers, a necessary part of the FFT algorithm.
In addition, an abundance of circular buffers greatly simplifies DSP code generation- both for the human programmer as well as high-level language compilers, such as C.
The data register section of the CPU is used in the same way as in traditional microprocessors. These can hold intermediate calculations, prepare data for the math processor, serve as a buffer for data transfer, hold flags for program control, and so on.
If needed, these registers can also be used to control loops and counters; however, the SHARC DSPs have extra hardware registers to carry out many of these functions. The math processing is broken into three sections, a multiplier , an arithmetic logic unit ALU , and a barrel shifter.
The multiplier takes the values from two registers, multiplies them, and places the result into another register. Elementary binary operations are carried out by the barrel shifter, such as shifting, rotating, extracting and depositing segments, and so on. In a single clock cycle, data from registers can be passed to the multiplier, data from registers can be passed to the ALU, and the two results returned to any of the 16 registers.
There are also many important features of the SHARC family architecture that aren't shown in this simplified illustration. For instance, an 80 bit accumulator is built into the multiplier to reduce the round-off error associated with multiple fixed-point math operations. Another interesting. These are duplicate registers that can be switched with their counterparts in a single clock cycle.
They are used for fast context switching , the ability to handle interrupts quickly. When an interrupt occurs in traditional microprocessors, all the internal data must be saved before the interrupt can be handled.
This usually involves pushing all of the occupied registers onto the stack, one at a time. In comparison, an interrupt in the SHARC family is handled by moving the internal data into the shadow registers in a single clock cycle. When the interrupt routine is completed, the registers are just as quickly restored. This feature allows step 4 on our list managing the sample-ready interrupt to be handled very quickly and efficiently. Now we come to the critical performance of the architecture, how many of the operations within the loop steps of Table can be carried out at the same time.
Specifically, within a single clock cycle, it can perform a multiply step 11 , an addition step 12 , two data moves steps 7 and 9 , update two circular buffer pointers steps 8 and 10 , and control the loop step 6. There will be extra clock cycles associated with beginning and ending the loop steps 3, 4, 5 and 13, plus moving initial values into place ; however, these tasks are also handled very efficiently. If the loop is executed more than a few times, this overhead will be negligible. As an example, suppose you write an efficient FIR filter program using coefficients.
You can expect it to require about to clock cycles per sample to execute i. This is very impressive; a traditional microprocessor requires many thousands of clock cycles for this algorithm. Smith, Ph. Smith Blog Contact. Book Search. Download this chapter in PDF format Chapter Digital Filters Match 2: Windowed-Sinc vs.
Chebyshev Match 3: Moving Average vs. How to order your own hardcover copy Wouldn't you rather have a bound book instead of loose pages?
Your laser printer will thank you! Order from Amazon.
SHARC Processor Architectural Overview
The original design dates to about January SHARC processors are or were used because they have offered good floating-point performance per watt. The SHARC is a Harvard architecture word-addressed VLIW processor; it knows nothing of 8-bit or bit values since each address is used to point to a whole bit word, not just an octet. Analog Devices chose to avoid the issue by using a bit char in their C compiler. The word size is bit for instructions, bit for integers and normal floating-point, and bit for extended floating-point.
Super Harvard Architecture Single-Chip Computer