Design Features |
The growing densities of PLDs are making the chips suitable for increasingly complex designs. Those designs, in turn, require high-performance scratchpad RAM for temporary storage of data that passes through the logic circuitry that the chips implement. Vendors are responding with devices that incorporate memory as well as logic gates.
PLDs ranging from complex, cell-based implementations to FPGAs are no longer merely implementing a few TTL chips' worth of glue logic. Instead, as gate densities move toward 100,000 gates per device and beyond, PLDs are becoming subsystems on a chip. Recognizing this trend, PLD vendors are increasingly incorporating the memory capacity that such subsystems demand.
The goals are increased performance and reduced chip count. A single chip incorporating logic and memory takes up less board real estate than a separate PLD plus RAM chip would. Further, the embedded approach eliminates board-level routing complexities and the attendant delays in chip-to-chip data transfers.
"People buy our devices not for a particular RAM capacity, but because they want a one-chip solution," says Sandeep Vij, director of FPGA marketing at Xilinx.
That's not to say that a PLD with embedded memory is a panacea. "If the majority of the silicon on a programmable-logic chip is devoted to memory functions, your device is in danger of becoming primarily a memory device and of having to compete with memory pricing," says Stan Kopec, marketing director at Lattice Semiconductor. "The memory content of a programmable device must be less than halflet's say one-thirdof the silicon area, to avoid this situation."
But numerous applications, ranging from data communications and telecommunications to PC add-ons, do fall within Kopec's recommended embedded-memory-to-logic ratio. "Any system requiring some fast buffer memory for data and some custom logic is a suitable application, he says. Bob Beachler, strategic marketing manager at Altera, concurs that embedded-LD applications are diverse.
"We do 50% of our overall PLD business in telecomm and datacomm, including cellular base stations and PBX switches, and 17% in computers and peripherals, with applications like industrial control taking up the rest," says Beachler. He expects application distribution to stay the same for the Flex 10K, Altera's pioneering family of embedded-array PLDs.
For such applications, the case for embedded PLDs is clear. Less clear is determining which of the available configurations is best for your application. Table 1provides an overview of gate count, memory capacity, speed, and other specs. It's difficult to make design decisions on such specs, however, because of the varied ways in which vendors implement the logic and memory functions. For any implementation, performance can depend heavily on the accumulation of microparameters, such as logic delay and address setup time of individual logic cells or memory blocks within a device. Lattice's Kopec elaborates on this point.
"All times we quote for our CPLDs are pin-to-pin worst case, including worst-case programmable-interconnect delays," he says. "FPGA vendors typically quote internal block delays without programmable-interconnect and I/O-buffer delays." Kopec claims this omission is due to the unpredictability of FPGA interconnection, varying from 10 to 30 nsec. You must add worst-case programmable-interconnect and buffer delays to an FPGA's quoted specs, he says, to get real, in-system performance. Lattice's CPLDs offer worst-case interconnect delays of 3 nsec across the entire chip, he says.
Naturally, vendors publish specs, such as those in Table 1, that reflect their devices' performance in the best possible light. Because your application requirements seldom precisely match a vendor's test model, you must carefully consider all of the available embedded-PLD configurations to avoid incurring subtle penalties in performance, real estate, and cost.
The range of configurations begins
with the tabula rasa of Xilinx's XC4000 devices at one extreme (Figure 1
). On the XC4000's "blank slate" of uncommitted
logic blocks, you design your choice of synchronous or asynchronous, single- or
dual-port memory, and logic functions, distributing each freely throughout the
chip.
At the other extreme is the
WaferScale Integration Inc (WSI) PSD family (Figure 2
). In addition to programmable logic, PSD devices include
dedicated functions, such as interrupt controllers, counter/timers, EPROM, and
RAM, carved in stone. The PSD devices are less PLDs with embedded memory than
dedicated chips with some embedded field-programmable logic.
In fact, the WSI parts extend the capabilities of standard, off-the-shelf microcontrollers from such companies as Intel (Folsom, CA), Motorola (Austin, TX), and Zilog (Campbell, CA). The goal is to let you add a $10 PSD part to an overburdened $3 microcontroller, instead of replacing the $3 microcontroller with a $30 or even $100 one. However, the WSI approach is anathema to traditional PLD manufacturers.
Says Altera's Beachler, "I could put a PCI core on a PLD, but then I could sell it only to the PCI bus people, and that's never been the idea for programmable logic." It doesn't make marketing sense for his company to pursue a niche with only 100 customers, he says: "The goal is a broad customer base, not a narrow one. Within the past 12 months we've shipped to almost 13,000 customers."
The Xilinx approach optimizes design flexibility; WSI aims at optimum price/performance for very specific applications. Altera's PLD architecture stakes out a position between these extremes, as do technologies from companies including Actel, Lattice Semiconductor, and Lucent Technologies.
Yet another approach comes from Chip Express (San Jose, CA). Its laser-programmable gate arrays (LPGAs) aren't user-programmable, so they don't qualify as FPGAs. Nevertheless, the company can offer a one-day turnaround on prototypes, making the devices nearly competitive with field-programmable devices with respect to product-development cycle times. If you need gate counts beyond what FPGAs offer or if you anticipate extremely high production volumes, LPGAs might be your best choice. The company's CX2000 LPGAs provide 30,000 to 200,000 usable gates and as much as 128 kbits of 200-MHz SRAM, configurable as FIFO, single-port, or dual-port memory. An 80,000-gate device costs about $20 in very high volumes; comparable PLDs generally cost more than $100 in production quantities.
All of these companies are attempting to balance design flexibility and speed for logic and memory configurations, and each vendor counts on its approach meeting most of your application needs.
The trade-offs that vendors juggle stem from two inescapable factors. On the one hand, a relatively large, contiguous block of dedicated RAM takes up less silicon and achieves better performance than implementing the same bit-storage capacity in a general-purpose sea of gates. On the other hand, a dedicated memory block is less efficient than distributed logic for applications whose memory needs don't closely match the dedicated block's capacity and organization.
Secondary factors come into play as well. Proponents of dedicated embedded RAM make several arguments. For example, they often claim that designers can use any surplus RAM as a look-up table, thus converting the RAM to logic in a way that mirrors the flexibility of distributed-logic architectures.
Further, Warren Miller, Actel's director of product planning and applications, echoes a concern of Lattice's Kopec: a problem with distributed memory, he says, is predictability.
"You can't easily predict the performance of the memory, because the signals going to memory can be spread across the device," says Miller. His company's A3200DX devices include eight to 15 optimized 32×8-bit, dual-port RAM blocks that operate at 100 MHz. These blocks find use as scratchpads, register banks, microcoded state machines, line buffers, cache RAM, status registers, multiprocessing mailboxes, and address pointers for data-structure linked lists in applications ranging from pattern recognition to convolution.
Actel's PLD memory boasts the highest speeds of the vendors of field-programmable PLDs that this article covers. However, for relatively small memory-block requirements, memory implementations in distributed logic can be faster than some vendors' block-memory implementations, according to Vij at Xilinx.
"If you need single- or dual-port memory up to 32×32 bits, the XC4000 series is faster" than some competitors' PLDs with blocks of single- and dual-port RAM, he says. "As you move to larger memory, you are more likely to use dual-port RAM, so we optimized the 4000 Series for the dual-port configuration." Vij cites a 12-nsec access time for an XC4000 8×32-bit, dual-port RAM as an example of XC4000 performance. He concedes that XC4000 memory performance gradually declines as memory blocks get larger, and block-memory architectures suffer stepwise speed reductions when memory demands increase to levels at which you must cascade remote blocks.
Vij takes exception to claims that unused portions of memory blocks can adequately implement logic functions. "With a block structure, if you don't use RAM, it's wasted. People may claim that you can put logic functions there, and a few, like multipliers, will fit, but most simply will not. With distributed logic, you can trade off as necessary without sacrificing logic performance."
Xilinx based its XC4000EX FPGAs on configurable logic blocks (CLBs) distributed uniformly throughout the chip. You can combine these building blocks to implement a variety of logic and memory functions. You can use 72 of them, for example, to build a 256×8-bit, single-port RAM or 48 to build a 32×16-bit, simultaneous-read/write FIFO buffer. On the logic side, you can use one CLB to implement, for instance, a 9-bit parity checker or 400 to implement a 55-MHz, 8-bit, 16-tap parallel FIR filter.
Lucent Technologies takes a similar
approach with its ORCA (Optimized Reconfigurable Cell Array) devices (Figure 3
). Lucent's version of the CLB is the programmable
function unit (PFU). Each PFU contains four 16-bit look-up tables (LUTs) and
four flip-flops. The LUTs can operate in any of three modes. In combinatorial
mode, they can realize any four-, five-, or six-input logic functions. Ripple
mode adds a high-speed carry capability for implementing arithmetic functions,
and in memory mode, the LUTs can implement 16×4-bit, single-port or 16×2-bit, dual-port memory. The Lucent devices come in 3.3
and 5V versions; the 3.3V models have 5V-tolerant I/O buffers.
Altera aims to optimize memory
performance and maintain design flexibility. Altera's Flex 10K architecture (Figure 4
) forgoes a single block that can serve as memory or
logic and instead combines two structures on a single chipone optimized
for logic and the other optimized for memory but still usable as logic. Those
structures are the logic-array blocks (LABs), which include logic elements for
random logic functions, and the embedded array blocks (EABs), which can
implement either RAM in 2-kbit blocks or optimized logic functions, such as
multipliers, ALUs, and DSP algorithms. If you do not use the EABs for memory,
each EAB provides approximately 500 logic-gates' worth of logic.
To connect blocks, Altera provides its FastTrack Interconnect, which comprises rows and columns of wire spanning the chip, thereby providing a routing mechanism that maintains predictable interconnect delays.
Using a dedicated-memory approach,
Actel distributes 3238-bit blocks of dual-port RAM among the combinatorial and
sequential logic modules that constitute the datapath circuitry throughout the
company's 3200DX chips (Figure 5
). Unlike the Xilinx and Lucent parts, the RAM blocks are
dedicated to the RAM function. They offer 100-MHz performance but can implement
random logic functions only when you use them as LUTs. In addition to RAM,
combinatorial, and sequential modules, the Actel devices include address decoder
modules that can decode a 35-bit address with array-to-pin delays of 12 nsec.
Lattice Semiconductor combines
general-purpose programmable logic with dedicated memory, as Actel does, but
adds a dedicated register/counter/timer block (Figure 6
). That block features eight banks of 16-bit functions;
you can cascade the functions to 128 bits, and the block features independent
read-from and write-to ability.
As for memory, Lattice offers three devices, each optimized for FIFO memory or dual- or single-port RAM.
"FIFOs are great for data-rate-matching buffers, where data is being written in at one speed and removed at another," says Kopec. "Dual-port RAMs are great for communication between two microprocessors or subsystems or where multiple intelligent bus masters are used, and single-port RAM is good for general-purpose, single-bus systems."
Altera's Beachler calls all of these approaches to embedded PLDs "a logical progression."
Says Xilinx's Vij, whose company introduced the RAM-capable XC4000 in 1991, "RAM in FPGAs is one of the most important functions you can have." The appearance of complex-PLD and other FPGA vendors in the embedded-memory market bears him out.
Altera's Beachler elaborates on the evolution of embedded PLDs: "Before we introduced the embedded PLD in March 1995," he says, "we looked at what the mask-programmable gate-array vendors were doing. They were embedding microprocessor cores and memory and then surrounding that with regular gate-array gates. As field-programmable logic reached higher and higher densities, it made sense to embed memory. In fact, almost 50% of gate-array design starts now use memory to some degree, and the percentage gets higher at higher densities."
Concludes Miller at Actel, "FPGAs with embedded memory are just the first of a new generation of more system-oriented building blocks," the precursors of what he calls system-programmable gate arrays.
Table 1Representative PLDs with embedded memory | ||||||
| Company | Device | Usable gates | Datapath delay/speed |
Total RAM capacity (bits) | Memory access time/speed | Memory organization |
| Actel Corp Sunnyvale, CA (800) 228-3532 fax (408) 739-1540 |
A32400DX | 40,000 | 200 MHz | 4096 | 100 MHz (5-nsec SRAM access time) | 16 32×8-bit dual-port RAM blocks can be cascaded to form wider or deeper arrays |
| A32300DX | 30,000 | 200 MHz | 3072 | 100 MHz (5-nsec SRAM access time) | 12 32×8-bit dual-port RAM blocks can be cascaded to form wider or deeper arrays | |
| A32200DX | 20,000 | 200 MHz | 2560 | 100 MHz (5-nsec SRAM access time) | 10 32×8-bit dual-port RAM blocks can be cascaded to form wider or deeper arrays | |
| A32100DX | 10,000 | 200 MHz | 2048 | 100 MHz (5-nsec SRAM access time) | Eight 32×8-bit dual-port RAM blocks can be cascaded to form wider or deeper arrays | |
| Altera Corp
San Jose, CA (408) 894-7000 fax (408) 944-0952 |
EPF10K100 | 62,000 to 158,000 | 19.1 to 24.2 nsec | 24,576 | 10 nsec/70 MHz | 12 2048-bit blocks configurable as 38, 4, 2, and 1 bits; two blocks can be config- ured as 256×16 or 512×8 bits |
| EPF10K70 | 46,000 to 118,000 | 19.1 to 24.2 nsec1 | 18,432 | 10 nsec/70 MHz | Nine 2048-bit blocks | |
| EPF10K50V | 36,000 to 116,000 | 17.2 to 27 nsec1 | 20,480 | 10 nsec/70 MHz | 10 2048-bit blocks | |
| EPF10K50 | 36,000 to 116,000 | 17.2 to 27 nsec1 | 20,480 | 10 nsec/70 MHz | 10 2048-bit blocks | |
| EPF10K40 | 29,000 to 93,000 | 17.2 to 21.1 nsec1 | 16,384 | 10 nsec/70 MHz | Eight 2048-bit blocks | |
| EPF10K30 | 22,000 to 69,000 | 17.2 to 21.1 nsec1 | 12,288 | 10 nsec/70 MHz | Six 2048-bit blocks | |
| EPF10K20 | 15,000 to 63,000 | 16.1 to 20 nsec1 | 12,288 | 10 nsec/70 MHz | Six 2048-bit blocks | |
| EPF10K10 | 7,000 to 31,000 | 16.1 to 20 nsec1 | 6144 | 10 nsec/70 MHz | Three 2048-bit blocks | |
| Lattice Semiconductor Hillsboro, OR (503) 681-0118 fax (503) 681-3037 |
ispLSI 6192 FF | 25,000 | CPLD: 15 nsec, 70 MHz; register- counter block: 125 MHz | 4608 | 20 nsec/50 MHz | 51239- or 256318-bit FIFO buffer |
| ispLSI 6192 SM | 25,000 | CPLD: 15 nsec, 70 MHz; register- counter block: 125 MHz | 4608 | 20 nsec/50 MHz | 51239- or 256318-bit single-port SRAM | |
| ispLSI 6192 DM | 25,000 | CPLD: 15 nsec, 70 MHz; register- counter block: 125 MHz | 4608 | 20 nsec/50 MHz | 51239- or 256318-bit dual-port SRAM | |
| Lucent
Technologies Allentown, PA (800) 372-2447 fax (610) 712-4106 |
OR2C/T40A | 43,200 (99,400 with 30% usage as RAM) | 2.1-nsec, four- input-look-up- table delay | 57,600 | 84 MHz | Asynchronous single port, synchronous single port, synchronous dual port2 |
| OR2C/T26A | 27,600 (63,600 with 30% usage as RAM) | 2.1-nsec, four- input-look-up- table delay | 38,864 | 84 MHz | Asynchronous single port, synchronous single port, synchronous dual port2 | |
| OR2C/T15A | 19,200 (44,200 with 30% usage as RAM) | 2.1-nsec, four- input-look-up- table delay | 25,600 | 84 MHz | Asynchronous single port, synchronous single port, synchronous dual port2 | |
| OR2C/T12A | 15,600 (35,800 with 30% usage as RAM) | 2.1-nsec, four- input-look-up- table delay | 20,736 | 84 MHz | Asynchronous single port, synchronous single port, synchronous dual port2 | |
| OR2C/T10A | 12,300 (28,300 with 30% usage as RAM) | 2.1-nsec, four- input-look-up- table delay | 16,384 | 84 MHz | Asynchronous single port, synchronous single port, synchronous dual port2 | |
| OR2C/T08A | 9400 (21,600 with 30% usage as RAM) | 2.1-nsec, four- input-look-up- table delay | 12,544 | 84 MHz | Asynchronous single port, synchronous single port, synchronous dual port2 | |
| OR2C/T06A | 6900 (15,900 with 30% usage as RAM) | 2.1-nsec, four- input-look-up- table delay | 9,216 | 84 MHz | Asynchronous single port, synchronous single port, synchronous dual port2 | |
| OR2C/T04A | 4800 (11,000 with 30% usage as RAM) | 2.1-nsec, four- input-look-up- table delay | 6,400 | 84 MHz | Asynchronous single port, synchronous single port, synchronous dual port2 | |
| WaferScale
Integration Fremont, CA (510) 656-5400 fax (510) 657-5916 |
PSD5XX | As many as 30 macrocells (61 inputs and 140 output product terms), 40 I/O ports, four 16-bit counter/timers, eight-input interrupt controller, 1-Mbit EPROM (38 or 16 bits) | Supports host- microcontroller clock speed | 16k | 70 nsec | 38 or 16 bits |
| PSD4XX | As many as 24 macrocells (59 input and 126 output product terms), 40 I/O ports, 1-Mbit EPROM (38 or 16 bits) | Supports host- microcontroller clock speed | 16k | 70 nsec | 38 or 16 bits | |
| PSD3XX | As many as 40 product terms (16 inputs and 24 outputs), 2-Mbit EPROM (38 or 16 bits) | Supports host- microcontroller clock speed | 16k | 70 nsec | 38 or 16 bits | |
| Xilinx San Jose, CA (408) 559-7778 fax (408) 559-7114 |
XC40125EX/XL | 125,000 (250,000 with 30% usage as RAM) | 66 MHz3 | 147,968 | 66 MHz3 | Distributed (convertible from logic) |
| XC4085EX/XL | 85,000 (170,000 with 27.5% usage as RAM | 66 MHz3 | 100,352 | 66 MHz3 | Distributed (convertible from logic) | |
| XC4062EX/XL | 62,000 (120,000 with 25% usage as RAM | 66 MHz3 | 73,728 | 66 MHz3 | Distributed (convertible from logic) | |
| XC4052EX/XL | 52,000 (90,000 with 20% usage as RAM) | 66 MHz3 | 61,952 | 66 MHz3 | Distributed (convertible from logic) | |
| XC4044EX/XL | 44,000 (74,000 with 20% usage as RAM) | 66 MHz3 | 51,200 | 66 MHz3 | Distributed (convertible from logic) | |
| XC4036EX/XL | 36,000 (60,000 with 20% usage as RAM) | 66 MHz3 | 41,472 | 66 MHz3 | Distributed (convertible from logic) | |
| XC4028EX/XL | 28,000 (48,000 with 20% usage as RAM) | 66 MHz3 | 32,768 | 66 MHz3 | Distributed (convertible from logic) | |
| XC4025E | 25,000 (48,000 with 20% usage as RAM) | 66 MHz3 | 23,768 | 66 MHz3 | Distributed (convertible from logic) | |
| XC4020E | 20,000 (37,000 with 20% usage as RAM) | 66 MHz3 | 25,088 | 66 MHz3 | Distributed (convertible from logic) | |
| XC4013E/L | 13,000 (27,000 with 20% usage as RAM) | 66 MHz3 | 18,432 | 66 MHz3 | Distributed (convertible from logic) | |
| XC4010E/L | 10,000 (19,000 with 20% usage as RAM) | 66 MHz3 | 12,800 | 66 MHz3 | Distributed (convertible from logic) | |
| XC4008E | 8000 (16,000 with 20% usage as RAM) | 66 MHz3 | 10,386 | 66 MHz3 | Distributed (convertible from logic) | |
| XC4006E | 6000 (12,000 with 20% usage as RAM) | 66 MHz3 | 8,192 | 66 MHz3 | Distributed (convertible from logic) | |
| XC4005E/L | 5000 (10,000 with 20% usage as RAM) | 66 MHz3 | 6,272 | 66 MHz3 | Distributed (convertible from logic) | |
| XC4003E | 3000 (4000 with 20% usage as RAM) | 66 MHz3 | 3,200 | 66 MHz3 | Distributed (convertible from logic) | |
| 1
Register-to-register delay via four logic elements, three row
interconnects, and four local interconnects (maximum for best speed
grade). 2 Single-port configurations yield 64 bits per programmable-logic unit; dual-port configurations yield 32 bits per programmable-logic unit. 3 Typical system performance; logic is configurable as RAM. | ||||||
| Author's biography Rick Nelson, former managing editor of EDN, is a free-lance editor specializing in high-technology topics. |
| EDN Access | feedback | subscribe to EDN! |