Lauro on CDNS Palladium-XP2 vs. MENT Veloce 2 vs. SNPS Zebu 3

Source: DeepChip

Subject: Lauro on CDNS Palladium-XP2 vs. MENT Veloce 2 vs. SNPS Zebu 3

> Category 1:
>
>  - Cadence Palladium.  Hats off to Cadence for being pioneers in
>    emulation and sustaining innovation to maintain a very competitive
>    product year-over-year.  
>
>  - Mentor Veloce.  Their revenue numbers show emulation is a growing
>    segment for them.  (See ESNUG 510 #7.)  Clearly Wally and Greg
>    have been investing heavily in emulation.
>
> Category 2:
>
>  - Synopsys EVE Zebu.  This has been the choice for companies and
>    design groups doing mid-size SoCs or blocks for emulation.  It
>    is no secret that Intel was an EVE customer.  (See ESNUG 508 #6.)
>    My expectation is that with the Synopsys acquisition, EVE will now
>    move upstream to challenge Cadence and Mentor at the high end.
>
>        - from http://www.deepchip.com/items/0522-04.html

From: [ Lauro Rizzatti, Emulation Consultant ]

Hi, John,

About 18 months ago, you published a multi-part overview of HW emulators by
Jim Hogan in ESNUG 522 #4.  Hogan's report gave the impression back then
that Cadence Palladium was the undisputed leader in emulation.

Recently in the DAC'14 #1 engineering survey 7 readers mentioned MENT Veloce,
3 cited CDNS Palladium, and only one spoke on SNPS EVE ZeBu.

As you know, I was EVE's VP of marketing at one time -- but I no longer work
Synopsys -- so my analysis doesn't reflect an official Synopsys view.  On
the other hand, over the last few months, I've written articles on emulation
technology, its market and its applications sponsored by Mentor Graphics.

         ----    ----    ----    ----    ----    ----   ----

Based on my knowledge of emulation from careful investigation of published
data, word-of-mouth, common sense, blah, blah, blah... here's my summary of
what each emulation tool currently has to offer today:
Cadence
Palladium-XP2 (GLX)
Mentor
Veloce 2
Synopsys EVE
Zebu Server 3
Chip Structuremassive array
of 65nm custom
Boolean processors
custom 65nm FPGAs
targeting emulation
general purpose
Xilinx 28nm FPGAs
Single Cabinet
Capacity
72 million
ASIC-gates
1 billion
ASIC-gates
300 million
ASIC-gates
Total Max
Capacity
1.1 billion
ASIC-gates in
16 cabinets
2.0 billion
ASIC-gates in
2 cabinets
2.0 billion
ASIC-gates in
7 cabinets
# of Users
per Cabinet
16 users64 users5 users
Max Design
Clock Freq
~2.0 MHz~2.0 MHz~5.0 MHz
Compilation
Speed
~70 MG/hour
[single PC]
~40 MG/hour
[server farms]
~5 MG/hour
[server farms]
Design Visibility
w/o Compilation
full visibility
at high-speed
full visibility
at high-speed
full visibility
at low-speed
Single Cabinet
Power Use
unpublished~44.0 kW~4.0 kW
Cooling Systemwater cooledforced airforced air
Low-Power
Analysis
CPFUPFUPF
Power
Estimation
SAIF & FSDBSAIF & FSDBSAIF & FSDB
Reliabilitypoorgoodexcellent
SW Debugphysical or
virtual JTAG
[intrusive]
physical or
virtual JTAG
[intrusive]
or
CodeLink
[non-intrusive]
physical or
virtual JTAG
[intrusive]
SVA Support?YesYesYes
UVM Support?YesYesNo
Save and
Restore?
YesYesYes
Checkpointing?YesYesYes
Functional
Coverage?
YesYesNo
Dimensions of
Single Cabinet
unpublished
[very large]
unpublished
[very large]
~20″x20″x20″
Weight of
Single Cabinet
unpublished
[very heavy]
unpublished
[very heavy]
~154 lbs
Interconnect
Network
cablesbackplanecables
Best Deploymentexcellent
in ICE
excellent
in ICE and in
TBX/VirtuaLAB
excellent
in TBV
         ----    ----    ----    ----    ----    ----   ----

CADENCE PALLADIUM:

The original architecture of Palladium was an offspring of an IBM technology
that Quickturn acquired in 1996 and promoted in 1997 under the CoBALT name.
Based on a vast array of Boolean processors, it was sold an alternative to
Quickturn's standard FPGA-based emulator.
In 1998, Cadence bought Quickturn, and discontinued the FPGA-based approach,
claiming it to be inferior to Palladium CoBALT's custom processor-based
architecture in three main areas:

  - very slow setup-time and compilation time;
  - rather poorer debugging capabilities; and
  - a significant drop in execution speed as design size increased.

Over the years, Cadence launched five generations of this custom processor
technology under the name of Palladium.

The 5th and last implementation called Palladium-XP2 was introduced in 2013.
It appears to be an improvement of the hardware and software of the previous
Palladium-XP version -- but XP2 is NOT a brand-new emulator based on a
re-spin of its 65nm custom-processor chip.

Palladium-XP2 continues to excel in very fast compilation time; an inherent
benefit of the custom-processor approach.  According to published specs its
compilation speed reaches 70-million gates per hour on a single workstation.

As for maximum design capacity, the benefit of a processor-based emulator is
that instead of a hard limit typical of the FPGA-based emulator, it enjoys
a somewhat soft limit.  A Palladium user can slightly exceed max capacity
specified by Cadence -- maybe by as much as 10% -- at the expense of a drop
in performance (that may be significant.)

HEAT, CABLES, AND RELIABILTY

But while overall Palladium design capacity is more than adequate for most
system-on-chip (SoC) designs, Palladiums demand the largest number of boxes
of the three emulator to achieve a comparable capacity.  Just consider that
the maximum Palladium-XP2 capacity of 2.3 billion ASIC-equivalent gates, as
specified in the datasheet, requires a setup of 32 interconnected boxes.
Its interconnection network is a massive collection of cables that affect
the reliability of the Palladium system.  From my research, the largest
Palladium configuration installed today has 16 boxes for about 1.1 billion
ASIC-equivalent gates.  Palladium-XP2 does not scale too well.

More boxes translate to larger dimensions, heavier weight and more power
consumption.  Palladium-XP2 runs with water cooling.  One negative of its
65nm processor-based technology is that it consumes significantly more
energy than a 65nm or 28nm FPGA-based emulator with equivalent capacity.
This increases the cost-of-ownership and further worsens the emulation
system reliability.

Palladium-XP2 supports CPF (but not UPF) power analysis and it generates
the switching activity for your power estimation tools.

REALLY GOOD AT ICE

In terms of maximum speed of execution, Palladium-XP2 clocks in the range
of 1.5 to 2.0 MHz, i.e., in the same ballpark of Veloce 2, both slower that
ZeBu-Server 3.  However, this speed is reached only in two deployment modes:

                 - In-Circuit Emulation (ICE) mode
                 - targetless mode

In ICE mode, a design-under-test (DUT) in mapped inside the Palladium box.
It uses a socket to reach your external target system where the rest of
your system (and where your test SW) is.  In targetless mode, your DUT
and your test environment are all included inside the Palladium emulator.
For instance, when the testbench is synthesizable or when your DUT is
crunching on embedded software that has no external dependencies.

Historically, Cadence's (and its predecessor Quickturn's) emulators were
mostly used in ICE mode.  This approach requires a speed adapter -- Cadence
calls them "speed-bridges" -- to accommodate your chip's fast clock rate
(usually hundreds of megahertz or even gigahertz) to the Palladium box's
slow clock rate (one/two megahertz or less).  The long history of Palladium
(CoBALT) has fostered the creation a large library of speed bridges.  This
is definitely a plus for Cadence.

TRANSACTION BASED VERIFICATION (TBV)

This 1-to-2 Mhz speed is NOT achieved in acceleration mode, i.e., when your
DUT is driven by external software testbench running on the workstation.
This should not be surprising -- if your external testbench talks to your
DUT by way of a programming language interface (PLI) -- but it is a bit
surprising if your interface is based on Direct Programming Interface (DPI)
calls -- typical for transaction-based communication.

Different vendors call it Transaction-Based Acceleration (TBA or TBX) mode
or Transaction-Based Verification (TBV) mode.

Regardless of the name, this verification mode is the emerging trend in the
industry.  It does not require human manned supervision to plug/unplug speed
adapters when you switch from one design to the next.  As such, TBV is the
mandatory choice for remote access at large emulation datacenters accessible
24/7 from anywhere in the world.  Palladium-XP2 supports transaction-based
verification TBV, but it is rumored that its throughput is significantly
lower than that of Veloce 2 and ZeBu 3.  Consider that recurring in the
recent CDNS quarterly financial earnings calls that Cadence CEO Lip-Bu Tan
claims "progress in TBA" -- indicating it is an issue that needs attention.

DEBUG

Hardware emulators are mandatory to clear a large chip of all the residual
bugs that were not uncovered by Verilog/VHDL/SystemVerilog SW runs -- and
before final tape-out.  Needless to say, debug must be efficient: which
means easy to use, effective, and fast.

For general debugging, Palladium supports System Verilog Assertions (SVA),
Universal Verification Methodology (UVM), save/restore, and functional
coverage.

Palladium-XP2 also has "FullVision", defined as "at-speed full visibility of
nets for typically two-million samples during runtime," and "InfiniTrace",
defined as "enables unlimited trace-capture depth and allows users to revert
back to any checkpoint and restart emulation from that point."  Further its
"Dynamic Probes" allow for "fast waveform upload of up to 80 million samples
of selected signals before run."

All of this sounds impressive, but these definitions do NOT clearly state
that the timing window extension from 2 million cycles to 80 million cycles
trades off full vision -- to a partial vision of 50,000 signals that must be
pre-selected at compile time.

EMBEDDED DEBUG

For embedded software validation, Palladium-XP2 supports software debugging
by way of a physical JTAG connection in ICE mode at full emulation speed.
This is a popular method that requires a HW debug infrastructure embedded in
your DUT.  As an alternative to a physical JTAG connection, Palladium-XP2
can be deployed with a transaction-based virtual JTAG connection.

A virtual JTAG presents several benefits vs. a physical JTAG:

  - Virtual can be used earlier in the design cycle
  - Virtual removes complexities due to physical timing
    dependencies making it simpler/quicker/cheaper to use.
  - Virtual let's you create massive emulation datacenters
    as mentioned earlier.

However, debugging your chip's software by way of a virtual JTAG connection
in a Palladium-XP2 is rather slow.

In addition, it supports System Verilog Assertions (SVA), Universal
Verification Methodology (UVM), save/restore, and functional coverage.

         ----    ----    ----    ----    ----    ----   ----

With two decades in business, Cadence Palladium easily enjoys the largest
list of customers from all segments of the semiconductor industry.

         ----    ----    ----    ----    ----    ----   ----
         ----    ----    ----    ----    ----    ----   ----
         ----    ----    ----    ----    ----    ----   ----

MENTOR VELOCE:

In 2012, Mentor Graphics launched a new emulator, Veloce-2, an evolution
of their 65nm custom emulator-on-a-chip first introduced with Veloce.
The "emulator-on-chip" concept was architected by merging the custom-FPGA
emulator from Meta Systems, a French startup acquired by Mentor in 1996,
with the Virtual Wire approach implemented in the IKOS VStation emulators
purchased by Mentor in 2002.

Compilation speed of a Veloce-2 stands at about 35 million gates per hour
on a farm of workstations -- a notch below Palladium-XP2.

The Veloce-2 execution speed hovers around 1.5 MHz -- with minor drop at
the increase of the design size.  This is from its scalable architecture
based on an active backplane that removes interconnect bottlenecks based
on cables.

In early 2014, Mentor announced a new operating system called "Veloce OS3"
that makes the emulator a global datacenter.

Veloce-2 doubles the capacity of the original Veloce (launched in 2007) to
2 billion ASIC-equivalent gates with two interconnected Maximus cabinets:
each accommodating 1 billion gates.  From what I can find, the largest
configuration installed today is two Maximus cabinets for almost 2 billion
ASIC gates.  Veloce-2 scales to almost the target maximum capacity.

LESS HEAT, LESS CABLES, MORE RELIABLE

The Maximus cabinet is made up of four units interconnected internally with
a backplane and limited cabling to avoid impacting reliability.  Lots of
cables means lots of failure points.

The Veloce-2 forced air cooling makes it easier and a less expensive to
install than a Palladium-XP2.  And not only does air cooling save on user's
A/C bills, it also increases Veloce-2's reliability as compared to a
water cooled emulator.

Veloce-2 supports UPF (but not CPF) power analysis and generates switching
activity for power estimation tools.

For ICE support, Veloce-2 also has a large library of speed adapters, too.

TRANSACTION BASED VERIFICATION

Mentor is actively pushing its Transaction-Based-Acceleration (TBX), because
it does not require human manned supervision to plug/unplug speed adapters
when you switch from one design to the next -- thus enabling large remote
emulation datacenters.  While no vendor publishes specs for their throughput
in acceleration mode, Veloce-2 users claim that they've seen no degradation
in speed while switching from ICE to TBX.  In fact, a few Veloce-2 users
reported higher throughput in TBX than in ICE.

One negative aspect to the TBX acceleration mode is that you need to create
a testbench.  Mentor addressed this by introducing their VirtuaLAB concept;
which is their virtual target system that's functionally equivalent to a
physical target system, but without the need for cables nor speed adapters.

The ViruaLAB is driven by operating systems, drivers, and stacks of software
running on the emulator.  It eliminates the need to create a testbench -- a
foreign concept for a software developer used to writing software programs.

Veloce-2 does 100% visibility without compilation.  Its on-board memories
coupled to their 65nm emulator-on-chip devices, store up to 500 K samples of
"compressed" data -- including registers and memory contents.  The data is
uploaded to the host workstation by way of wideband channels, and there's a
reconstruction mechanism running on the host computer that rebuilds your
waveforms all of your combinational logic nodes.

DEBUG

While some Veloce-2 debug is similar to Palladium-XP2's, Mentor devised a
faster debugging scheme based on the on-demand waveform streaming of a few
selected signals without requiring compilation.  A Veloce-2 debug process
called "back-replay debug" consists of rewinding and re-running a test with
added visibility such as: assertions, monitors, trackers, $display, and
waveform capture.  It removes the need for a testbench and reduces the
amount of data sent to the host, providing a boost in time-to-visibility
in a fully deterministic, repetitive environment.

Like Palladium, it also supports SVAs, UVM, save/restore, and functional
coverage.

EMBEDDED DEBUG

In addition to physical and virtual JTGA connections like a Palladium box,
a Veloce-2 can do tracing software debug by way of Codelink.  When your DUT
does not have any hardware debug infrastructure in place, Codelink traces
the state of your processor by observing signals in and around the RTL code
of the processors in your design -- and it does not interfere with the
operation of your design being run.  With Codelink, the HW developer can
begin debugging earlier in the design cycle and offline.

Another Veloce-2 approach replaces the RTL processing cores in your SoC
design with QEMU-based cores -- and then moves them into the host connected
to Veloce2 by way of transactors.  The emulator continues to execute the
remaining synthesizable portion of your SoC; pushing performance from 1 to
3 MIPS up to an upper limit of 100 MIPS when your entire SoC is mapped
inside Veloce-2.  With this some users are booting an Android RTOS, and
then running applications like Antutu for performance characterization
prior to silicon.

         ----    ----    ----    ----    ----    ----   ----

Today, Mentor Veloce can not match the sheer number of customers claimed
by Cadence Palladium but, in the past 2 to 3 years, MENT has increased its
own customer base by taking a bite out of the CDNS customer base.

         ----    ----    ----    ----    ----    ----   ----
         ----    ----    ----    ----    ----    ----   ----
         ----    ----    ----    ----    ----    ----   ----

SYNOPSYS EVE ZEBU:

After acquiring EVE in 2012, Synopsys launched the ZeBu Server 3 in 2014.
EVE was an early developer of standard FPGA-based emulators.  The name ZeBu
stands for "zero-bugs" to ensure that your design had no bugs.
ZeBu Server 3 is based on the Xilinx 28nm Virtex7-LX2000T.  While all the
FPGA prototyping companies assume the ~12 million gate capacity of the
V7-LX2000T, Synopsys elected to lower it (i.e. ~50% utilization) to about
6.5 million gates. 

From what I can find, the largest configuration ZeBu installed today is
7 boxes for close to 2 billion ASIC gates.  ZeBu Server 3 scales nicely,
but it may not reach the target of 3 billion ASIC-equivalent gates.

SLOW COMPILES, FAST RUNS

It's horrible 5 M gates/hour design compilation speed puts the ZeBu Server 3
at disadvantage vis-a-vis 70 M gates/hour Palladium and 35 M gates/hour
Veloce.  It compiles at 1/14th the speed of a Palladium!

The main compile-time hurdle is in the place-and-route of the Xilinx FPGAs.
Synopsys does not publish data, but it is public knowledge that the P&R of
a Virtex7-LX2000 may take several hours -- even while limiting the resource
utilization to 50% or less.

The ZeBu Server 3 leads the pack with the highest clock speed, bordering
the performance of 28nm FPGA prototyping for designs of 100+ million gates.
But this performance drops significanlty when multiple ZeBu boxes are used
due to the massive interconnecting cabling.

ZeBu Server 3 supports UPF (but not CPF) power analysis and it generates
switching activity for power estimation tools.

Compared to the Palladium XP2 and the Veloce 2, the forced air cooling of
the Zebu 3 plus its small physical dimensions, its light weight, and low
power consumption -- gives the ZeBu 3 relatively high reliability.

LESS ICE, MORE TBV

Apparently Synopsys continues EVE's approach of not actively promoting ICE,
and instead supports TBV.  But just like the Veloce 2, the ZeBu Server 3
also performs TBV at speeds in the same ballpark as its ICE.

DEBUG

Design debug is 100% visibility via dynamic probing, a feature that takes
advantage of the built-in scan chain in the Xilinx Virtex FPGAs.  While
dynamic probing does not require compilation, it comes with a drawback:
to retrieve data takes a long time at a speed of a few 10's of hertz.

The sequential data activity is not stored in on-board memories.  Rather,
it is sent directly to the host server where the combinational activity is
recreated via a proprietary mechanism.  EVE/Synopsys points out that the
overall performance of doing: 

          - data retrieval via dynamic probing,
          - data transfer to the server, and
          - reconstruction of combinational data

is comparable to Palladium or Veloce when doing these same tasks combined.

The ZeBu Server 3 also supports SVAs and save/restore, but there is NO
mention of UVM nor functional coverage.

         ----    ----    ----    ----    ----    ----   ----

By all indications, the ZeBu is trailing behind Palladium and Veloce in
the number of customers.  A monitoring of the quarterly earnings calls
would reveal that Mentor and Cadence boast success after success in the
emulation space.  Not so for Synopsys.  This may be company policy, or it
may reflect the difficulty in reporting a sales "win".

         ----    ----    ----    ----    ----    ----   ----
         ----    ----    ----    ----    ----    ----   ----
         ----    ----    ----    ----    ----    ----   ----

CONCLUSION:
Palladium-XP2 offers the fastest compilation speed combined with excellent
HW debug capabilities.  Fast 2.0 Mhz execution speed.  For it's 1.1 B gate
max capacity, scalability is questionable when design sizes approach/exceed
those billion gates.  Palladium is very strong in ICE, but has noticably
slower TBV compared its rivals.  CPF.

Palladium's large physical footprint, energy consumption, water cooling,
and reliability are not the best.

Veloce-2 does fast compilation and excellent debug but it also has added
stuff like: on-demand waveform streaming of a few selected signals without
requiring compilation, less need to write testbenches, QEMU-based cores,
and tracing software.  Fast 2.0 Mhz execution speed.  It runs both TBV and
ICE equally fast.  Large 2.0 B gate max capacity, and scalable.  UPF.

Having a backplane with much less cabling, it being air cooled, and it uses
less energy gives the Veloce-2 box good reliability.

ZeBu Server 3 leads the pack with the highest 5.0 Mhz execution speed, and
large 2.0 B gate max capacity.  Ideal for software debug in TBV.  But no
mention of ICE, no UVM, no functional coverage.  And its compilation speed
is 1/14th of rivals.  Data retrieval is painfully in 10's of hertz.  UPF.

Small size, low power, and air cooling gives Zebu Server 3 high reliability.

Today, all three emulators choices can do the job, some better than others.

    - Lauro Rizzatti, emulation consultant