EDN logo

Design Features


Embedded memory enhances programmable logic
for complex, compact designs

Rick Nelson, Contributing Editor


The growing densities of PLDs are making the chips suitable for increasingly complex designs. Those designs, in turn, require high-performance scratchpad RAM for temporary storage of data that passes through the logic circuitry that the chips implement. Vendors are responding with devices that incorporate memory as well as logic gates. 

  PLDs ranging from complex, cell-based implementations to FPGAs are no longer merely implementing a few TTL chips' worth of glue logic. Instead, as gate densities move toward 100,000 gates per device and beyond, PLDs are becoming subsystems on a chip. Recognizing this trend, PLD vendors are increasingly incorporating the memory capacity that such subsystems demand.

  The goals are increased performance and reduced chip count. A single chip incorporating logic and memory takes up less board real estate than a separate PLD plus RAM chip would. Further, the embedded approach eliminates board-level routing complexities and the attendant delays in chip-to-chip data transfers.

  "People buy our devices not for a particular RAM capacity, but because they want a one-chip solution," says Sandeep Vij, director of FPGA marketing at Xilinx.

  That's not to say that a PLD with embedded memory is a panacea. "If the majority of the silicon on a programmable-logic chip is devoted to memory functions, your device is in danger of becoming primarily a memory device and of having to compete with memory pricing," says Stan Kopec, marketing director at Lattice Semiconductor. "The memory content of a programmable device must be less than half—let's say one-third—of the silicon area, to avoid this situation."

  But numerous applications, ranging from data communications and telecommunications to PC add-ons, do fall within Kopec's recommended embedded-memory-to-logic ratio. "Any system requiring some fast buffer memory for data and some custom logic is a suitable application, he says. Bob Beachler, strategic marketing manager at Altera, concurs that embedded-LD applications are diverse.

  "We do 50% of our overall PLD business in telecomm and datacomm, including cellular base stations and PBX switches, and 17% in computers and peripherals, with applications like industrial control taking up the rest," says Beachler. He expects application distribution to stay the same for the Flex 10K, Altera's pioneering family of embedded-array PLDs.

  For such applications, the case for embedded PLDs is clear. Less clear is determining which of the available configurations is best for your application. Table 1provides an overview of gate count, memory capacity, speed, and other specs. It's difficult to make design decisions on such specs, however, because of the varied ways in which vendors implement the logic and memory functions. For any implementation, performance can depend heavily on the accumulation of microparameters, such as logic delay and address setup time of individual logic cells or memory blocks within a device. Lattice's Kopec elaborates on this point.

  "All times we quote for our CPLDs are pin-to-pin worst case, including worst-case programmable-interconnect delays," he says. "FPGA vendors typically quote internal block delays without programmable-interconnect and I/O-buffer delays." Kopec claims this omission is due to the unpredictability of FPGA interconnection, varying from 10 to 30 nsec. You must add worst-case programmable-interconnect and buffer delays to an FPGA's quoted specs, he says, to get real, in-system performance. Lattice's CPLDs offer worst-case interconnect delays of 3 nsec across the entire chip, he says.

  Naturally, vendors publish specs, such as those in Table 1, that reflect their devices' performance in the best possible light. Because your application requirements seldom precisely match a vendor's test model, you must carefully consider all of the available embedded-PLD configurations to avoid incurring subtle penalties in performance, real estate, and cost.

The embedded-PLD gamut

  The range of configurations begins with the tabula rasa of Xilinx's XC4000 devices at one extreme (Figure 1). On the XC4000's "blank slate" of uncommitted logic blocks, you design your choice of synchronous or asynchronous, single- or dual-port memory, and logic functions, distributing each freely throughout the chip.

  At the other extreme is the WaferScale Integration Inc (WSI) PSD family (Figure 2). In addition to programmable logic, PSD devices include dedicated functions, such as interrupt controllers, counter/timers, EPROM, and RAM, carved in stone. The PSD devices are less PLDs with embedded memory than dedicated chips with some embedded field-programmable logic.

  In fact, the WSI parts extend the capabilities of standard, off-the-shelf microcontrollers from such companies as Intel (Folsom, CA), Motorola (Austin, TX), and Zilog (Campbell, CA). The goal is to let you add a $10 PSD part to an overburdened $3 microcontroller, instead of replacing the $3 microcontroller with a $30 or even $100 one. However, the WSI approach is anathema to traditional PLD manufacturers.

  Says Altera's Beachler, "I could put a PCI core on a PLD, but then I could sell it only to the PCI bus people, and that's never been the idea for programmable logic." It doesn't make marketing sense for his company to pursue a niche with only 100 customers, he says: "The goal is a broad customer base, not a narrow one. Within the past 12 months we've shipped to almost 13,000 customers."

  The Xilinx approach optimizes design flexibility; WSI aims at optimum price/performance for very specific applications. Altera's PLD architecture stakes out a position between these extremes, as do technologies from companies including Actel, Lattice Semiconductor, and Lucent Technologies.

  Yet another approach comes from Chip Express (San Jose, CA). Its laser-programmable gate arrays (LPGAs) aren't user-programmable, so they don't qualify as FPGAs. Nevertheless, the company can offer a one-day turnaround on prototypes, making the devices nearly competitive with field-programmable devices with respect to product-development cycle times. If you need gate counts beyond what FPGAs offer or if you anticipate extremely high production volumes, LPGAs might be your best choice. The company's CX2000 LPGAs provide 30,000 to 200,000 usable gates and as much as 128 kbits of 200-MHz SRAM, configurable as FIFO, single-port, or dual-port memory. An 80,000-gate device costs about $20 in very high volumes; comparable PLDs generally cost more than $100 in production quantities.

  All of these companies are attempting to balance design flexibility and speed for logic and memory configurations, and each vendor counts on its approach meeting most of your application needs.

Performance vs flexibility

  The trade-offs that vendors juggle stem from two inescapable factors. On the one hand, a relatively large, contiguous block of dedicated RAM takes up less silicon and achieves better performance than implementing the same bit-storage capacity in a general-purpose sea of gates. On the other hand, a dedicated memory block is less efficient than distributed logic for applications whose memory needs don't closely match the dedicated block's capacity and organization.

  Secondary factors come into play as well. Proponents of dedicated embedded RAM make several arguments. For example, they often claim that designers can use any surplus RAM as a look-up table, thus converting the RAM to logic in a way that mirrors the flexibility of distributed-logic architectures.

  Further, Warren Miller, Actel's director of product planning and applications, echoes a concern of Lattice's Kopec: a problem with distributed memory, he says, is predictability.

  "You can't easily predict the performance of the memory, because the signals going to memory can be spread across the device," says Miller. His company's A3200DX devices include eight to 15 optimized 32×8-bit, dual-port RAM blocks that operate at 100 MHz. These blocks find use as scratchpads, register banks, microcoded state machines, line buffers, cache RAM, status registers, multiprocessing mailboxes, and address pointers for data-structure linked lists in applications ranging from pattern recognition to convolution.

  Actel's PLD memory boasts the highest speeds of the vendors of field-programmable PLDs that this article covers. However, for relatively small memory-block requirements, memory implementations in distributed logic can be faster than some vendors' block-memory implementations, according to Vij at Xilinx.

  "If you need single- or dual-port memory up to 32×32 bits, the XC4000 series is faster" than some competitors' PLDs with blocks of single- and dual-port RAM, he says. "As you move to larger memory, you are more likely to use dual-port RAM, so we optimized the 4000 Series for the dual-port configuration." Vij cites a 12-nsec access time for an XC4000 8×32-bit, dual-port RAM as an example of XC4000 performance. He concedes that XC4000 memory performance gradually declines as memory blocks get larger, and block-memory architectures suffer stepwise speed reductions when memory demands increase to levels at which you must cascade remote blocks.

  Vij takes exception to claims that unused portions of memory blocks can adequately implement logic functions. "With a block structure, if you don't use RAM, it's wasted. People may claim that you can put logic functions there, and a few, like multipliers, will fit, but most simply will not. With distributed logic, you can trade off as necessary without sacrificing logic performance."

Logic units, embedded arrays

  Xilinx based its XC4000EX FPGAs on configurable logic blocks (CLBs) distributed uniformly throughout the chip. You can combine these building blocks to implement a variety of logic and memory functions. You can use 72 of them, for example, to build a 256×8-bit, single-port RAM or 48 to build a 32×16-bit, simultaneous-read/write FIFO buffer. On the logic side, you can use one CLB to implement, for instance, a 9-bit parity checker or 400 to implement a 55-MHz, 8-bit, 16-tap parallel FIR filter.

  Lucent Technologies takes a similar approach with its ORCA (Optimized Reconfigurable Cell Array) devices (Figure 3). Lucent's version of the CLB is the programmable function unit (PFU). Each PFU contains four 16-bit look-up tables (LUTs) and four flip-flops. The LUTs can operate in any of three modes. In combinatorial mode, they can realize any four-, five-, or six-input logic functions. Ripple mode adds a high-speed carry capability for implementing arithmetic functions, and in memory mode, the LUTs can implement 16×4-bit, single-port or 16×2-bit, dual-port memory. The Lucent devices come in 3.3 and 5V versions; the 3.3V models have 5V-tolerant I/O buffers.

  Altera aims to optimize memory performance and maintain design flexibility. Altera's Flex 10K architecture (Figure 4) forgoes a single block that can serve as memory or logic and instead combines two structures on a single chip—one optimized for logic and the other optimized for memory but still usable as logic. Those structures are the logic-array blocks (LABs), which include logic elements for random logic functions, and the embedded array blocks (EABs), which can implement either RAM in 2-kbit blocks or optimized logic functions, such as multipliers, ALUs, and DSP algorithms. If you do not use the EABs for memory, each EAB provides approximately 500 logic-gates' worth of logic.

  To connect blocks, Altera provides its FastTrack Interconnect, which comprises rows and columns of wire spanning the chip, thereby providing a routing mechanism that maintains predictable interconnect delays.

Dedicated memory blocks

  Using a dedicated-memory approach, Actel distributes 3238-bit blocks of dual-port RAM among the combinatorial and sequential logic modules that constitute the datapath circuitry throughout the company's 3200DX chips (Figure 5). Unlike the Xilinx and Lucent parts, the RAM blocks are dedicated to the RAM function. They offer 100-MHz performance but can implement random logic functions only when you use them as LUTs. In addition to RAM, combinatorial, and sequential modules, the Actel devices include address decoder modules that can decode a 35-bit address with array-to-pin delays of 12 nsec.

  Lattice Semiconductor combines general-purpose programmable logic with dedicated memory, as Actel does, but adds a dedicated register/counter/timer block (Figure 6). That block features eight banks of 16-bit functions; you can cascade the functions to 128 bits, and the block features independent read-from and write-to ability.

  As for memory, Lattice offers three devices, each optimized for FIFO memory or dual- or single-port RAM.

  "FIFOs are great for data-rate-matching buffers, where data is being written in at one speed and removed at another," says Kopec. "Dual-port RAMs are great for communication between two microprocessors or subsystems or where multiple intelligent bus masters are used, and single-port RAM is good for general-purpose, single-bus systems."

  Altera's Beachler calls all of these approaches to embedded PLDs "a logical progression."

  Says Xilinx's Vij, whose company introduced the RAM-capable XC4000 in 1991, "RAM in FPGAs is one of the most important functions you can have." The appearance of complex-PLD and other FPGA vendors in the embedded-memory market bears him out.

  Altera's Beachler elaborates on the evolution of embedded PLDs: "Before we introduced the embedded PLD in March 1995," he says, "we looked at what the mask-programmable gate-array vendors were doing. They were embedding microprocessor cores and memory and then surrounding that with regular gate-array gates. As field-programmable logic reached higher and higher densities, it made sense to embed memory. In fact, almost 50% of gate-array design starts now use memory to some degree, and the percentage gets higher at higher densities."

  Concludes Miller at Actel, "FPGAs with embedded memory are just the first of a new generation of more system-oriented building blocks," the precursors of what he calls system-programmable gate arrays.

Table 1—Representative PLDs with embedded memory

Company Device Usable gates Datapath
delay/speed
Total RAM capacity (bits) Memory access time/speed Memory organization
Actel Corp
Sunnyvale, CA
(800) 228-3532
fax (408) 739-1540
A32400DX 40,000 200 MHz 4096 100 MHz (5-nsec SRAM access time) 16 32×8-bit dual-port RAM blocks can be cascaded to form wider or deeper arrays
  A32300DX 30,000 200 MHz 3072 100 MHz (5-nsec SRAM access time) 12 32×8-bit dual-port RAM blocks can be cascaded to form wider or deeper arrays
  A32200DX 20,000 200 MHz 2560 100 MHz (5-nsec SRAM access time) 10 32×8-bit dual-port RAM blocks can be cascaded to form wider or deeper arrays
  A32100DX 10,000 200 MHz 2048 100 MHz (5-nsec SRAM access time) Eight 32×8-bit dual-port RAM blocks can be cascaded to form wider or deeper arrays
Altera Corp
San Jose, CA
(408) 894-7000
fax (408) 944-0952
EPF10K100 62,000 to 158,000 19.1 to 24.2 nsec 24,576 10 nsec/70 MHz 12 2048-bit blocks configurable as 38, 4, 2, and 1 bits; two blocks can be config- ured as 256×16 or 512×8 bits
  EPF10K70 46,000 to 118,000 19.1 to 24.2 nsec1 18,432 10 nsec/70 MHz Nine 2048-bit blocks
  EPF10K50V 36,000 to 116,000 17.2 to 27 nsec1 20,480 10 nsec/70 MHz 10 2048-bit blocks
  EPF10K50 36,000 to 116,000 17.2 to 27 nsec1 20,480 10 nsec/70 MHz 10 2048-bit blocks
  EPF10K40 29,000 to 93,000 17.2 to 21.1 nsec1 16,384 10 nsec/70 MHz Eight 2048-bit blocks
  EPF10K30 22,000 to 69,000 17.2 to 21.1 nsec1 12,288 10 nsec/70 MHz Six 2048-bit blocks
  EPF10K20 15,000 to 63,000 16.1 to 20 nsec1 12,288 10 nsec/70 MHz Six 2048-bit blocks
  EPF10K10 7,000 to 31,000 16.1 to 20 nsec1 6144 10 nsec/70 MHz Three 2048-bit blocks
Lattice Semiconductor
Hillsboro, OR
(503) 681-0118
fax (503) 681-3037
ispLSI 6192 FF 25,000 CPLD: 15 nsec, 70 MHz; register- counter block: 125 MHz 4608 20 nsec/50 MHz 51239- or 256318-bit FIFO buffer
  ispLSI 6192 SM 25,000 CPLD: 15 nsec, 70 MHz; register- counter block: 125 MHz 4608 20 nsec/50 MHz 51239- or 256318-bit single-port SRAM
  ispLSI 6192 DM 25,000 CPLD: 15 nsec, 70 MHz; register- counter block: 125 MHz 4608 20 nsec/50 MHz 51239- or 256318-bit dual-port SRAM
Lucent Technologies
Allentown, PA
(800) 372-2447
fax (610) 712-4106
OR2C/T40A 43,200 (99,400 with 30% usage as RAM) 2.1-nsec, four- input-look-up- table delay 57,600 84 MHz Asynchronous single port, synchronous single port, synchronous dual port2
  OR2C/T26A 27,600 (63,600 with 30% usage as RAM) 2.1-nsec, four- input-look-up- table delay 38,864 84 MHz Asynchronous single port, synchronous single port, synchronous dual port2
  OR2C/T15A 19,200 (44,200 with 30% usage as RAM) 2.1-nsec, four- input-look-up- table delay 25,600 84 MHz Asynchronous single port, synchronous single port, synchronous dual port2
  OR2C/T12A 15,600 (35,800 with 30% usage as RAM) 2.1-nsec, four- input-look-up- table delay 20,736 84 MHz Asynchronous single port, synchronous single port, synchronous dual port2
  OR2C/T10A 12,300 (28,300 with 30% usage as RAM) 2.1-nsec, four- input-look-up- table delay 16,384 84 MHz Asynchronous single port, synchronous single port, synchronous dual port2
  OR2C/T08A 9400 (21,600 with 30% usage as RAM) 2.1-nsec, four- input-look-up- table delay 12,544 84 MHz Asynchronous single port, synchronous single port, synchronous dual port2
  OR2C/T06A 6900 (15,900 with 30% usage as RAM) 2.1-nsec, four- input-look-up- table delay 9,216 84 MHz Asynchronous single port, synchronous single port, synchronous dual port2
  OR2C/T04A 4800 (11,000 with 30% usage as RAM) 2.1-nsec, four- input-look-up- table delay 6,400 84 MHz Asynchronous single port, synchronous single port, synchronous dual port2
WaferScale Integration
Fremont, CA
(510) 656-5400
fax (510) 657-5916
PSD5XX As many as 30 macrocells (61 inputs and 140 output product terms), 40 I/O ports, four 16-bit counter/timers, eight-input interrupt controller, 1-Mbit EPROM (38 or 16 bits) Supports host- microcontroller clock speed 16k 70 nsec 38 or 16 bits
  PSD4XX As many as 24 macrocells (59 input and 126 output product terms), 40 I/O ports, 1-Mbit EPROM (38 or 16 bits) Supports host- microcontroller clock speed 16k 70 nsec 38 or 16 bits
  PSD3XX As many as 40 product terms (16 inputs and 24 outputs), 2-Mbit EPROM (38 or 16 bits) Supports host- microcontroller clock speed 16k 70 nsec 38 or 16 bits
Xilinx
San Jose, CA
(408) 559-7778
fax (408) 559-7114
XC40125EX/XL 125,000 (250,000 with 30% usage as RAM) 66 MHz3 147,968 66 MHz3 Distributed (convertible from logic)
  XC4085EX/XL 85,000 (170,000 with 27.5% usage as RAM 66 MHz3 100,352 66 MHz3 Distributed (convertible from logic)
  XC4062EX/XL 62,000 (120,000 with 25% usage as RAM 66 MHz3 73,728 66 MHz3 Distributed (convertible from logic)
  XC4052EX/XL 52,000 (90,000 with 20% usage as RAM) 66 MHz3 61,952 66 MHz3 Distributed (convertible from logic)
  XC4044EX/XL 44,000 (74,000 with 20% usage as RAM) 66 MHz3 51,200 66 MHz3 Distributed (convertible from logic)
  XC4036EX/XL 36,000 (60,000 with 20% usage as RAM) 66 MHz3 41,472 66 MHz3 Distributed (convertible from logic)
  XC4028EX/XL 28,000 (48,000 with 20% usage as RAM) 66 MHz3 32,768 66 MHz3 Distributed (convertible from logic)
  XC4025E 25,000 (48,000 with 20% usage as RAM) 66 MHz3 23,768 66 MHz3 Distributed (convertible from logic)
  XC4020E 20,000 (37,000 with 20% usage as RAM) 66 MHz3 25,088 66 MHz3 Distributed (convertible from logic)
  XC4013E/L 13,000 (27,000 with 20% usage as RAM) 66 MHz3 18,432 66 MHz3 Distributed (convertible from logic)
  XC4010E/L 10,000 (19,000 with 20% usage as RAM) 66 MHz3 12,800 66 MHz3 Distributed (convertible from logic)
  XC4008E 8000 (16,000 with 20% usage as RAM) 66 MHz3 10,386 66 MHz3 Distributed (convertible from logic)
  XC4006E 6000 (12,000 with 20% usage as RAM) 66 MHz3 8,192 66 MHz3 Distributed (convertible from logic)
  XC4005E/L 5000 (10,000 with 20% usage as RAM) 66 MHz3 6,272 66 MHz3 Distributed (convertible from logic)
  XC4003E 3000 (4000 with 20% usage as RAM) 66 MHz3 3,200 66 MHz3 Distributed (convertible from logic)
1 Register-to-register delay via four logic elements, three row interconnects, and four local interconnects (maximum for best speed grade).
2 Single-port configurations yield 64 bits per programmable-logic unit; dual-port configurations yield 32 bits per programmable-logic unit.
3 Typical system performance; logic is configurable as RAM.

  Author's biography
Rick Nelson, former managing editor of EDN, is a free-lance editor specializing in high-technology topics.

| EDN Access | feedback | subscribe to EDN! |


Copyright © 1996 EDN Magazine. EDN is a registered trademark of Reed Properties Inc, used under license.