# **ALL PROGRAMMABLE**

# Soft FIT Reliability Zynq-7000 AP SoC

Austin Lesea, Wojciech Koszek, Glenn Steiner Gary Swift, and Dagan White

Xilinx, Inc.

#### April 2014

© Copyright 2014 Xilinx

# Zynq-7000 in 1 picture: ARM CPU + FPGA fabric





#### XILINX > ALL PROGRAMMABLE.



# > Zynq SEU Testing Methodology

## FIT Results Based on Radiation Measurements

## > Possible Configurations for Reduced FIT

# Summary





# **ALL PROGRAMMABLE**

# **Testing Methodology**

© Copyright 2014 Xilinx

# **Objective**

Obtain soft FIT data based on SEU proton beam measurements while the processor is executing code representative of a typical user application.





XILINX > ALL PROGRAMMABLE.

### How we tested?





#### XILINX ➤ ALL PROGRAMMABLE..



Page 7 ZYNQ

© Copyright 2014 Xilinx

# What Was the Test Code and What Did it Cover?

#### > Xilinx proprietary System Level Testing OS

- Proven test suite used for testing PowerPC/MicroBlaze/ARM CPUs
- Used across Xilinx for all silicon processors since 2008

#### > Very aggressive testing and error identification

- Continuous result checks rather than checks at end of sequence of operations
- > Executes in Symmetric Multiprocessing Mode

#### > Test application code mix is representative of typical code mix

- Loads/Stores
- Branches
- Conditionals
- Integer / Floating Point / NEON
- Exceptions / Interrupts





# What was tested?

#### > APU

- Core 0 and 1
  - A9-MPCore
  - I Cache
  - D Cache
  - NEON/FPU
- Snoop Control
- L2 Cache
- OCM
- OCM Interconnect
- MMU
- GIC

#### > IOP

- CAN w/ DMA
- Ethernet w/ DMA
- I2C
- SD/SDIO
- UART
- NOR & SRAM Interfaces
- GPIO
- IO MUX & MIO

### > PS / Other

- DDR/DMA Controller Logic
- Central Switch
- Device Configuration
- XADC
- System clocks

XILINX > ALL PROGRAMMABLE.



Page 10 ZYNQ

© Copyright 2014 Xilinx

XILINX ➤ ALL PROGRAMMABLE.

# Logging

#### > All processor exceptions (interrupts) including:

- Parity errors
- Invalid instructions
- Data and Pre-fetch Aborts (Invalid memory access)
- SCU Errors
- MMU Errors
- TAG RAM Errors
- Secure mode exceptions
- Invalid or Unexpected Interrupts

#### System hangs

- One observed (0.67 FIT)

#### Software data result compare errors

Zero observed





## **Test Facility**



- > The cyclotron can accelerate protons from 1MeV to 68MeV
- For this beam line the diameter of the beam spot can be up to 6 cm, but we used smaller one
  - Just to cover Zynq's die
- Running experiments in the beam was fully automated to eliminate user errors





# **Testing**

#### Target platform

- ZC702 board with 7020
- > Over 25 hours of testing over 3 days
- > 500++ experiments performed
- > 5000++ upsets documented



- Equivalent to 175,000 years of Terrestrial Radiation exposure
  - New York City
  - Correlated via Xilinx Rosetta Experiments over 10 years and 6 generations of products



XILINX > ALL PROGRAMMABLE.

# **ALL PROGRAMMABLE**

# **FIT Results Based on Radiation Measurements**

© Copyright 2014 Xilinx

# Measured Results Better Than Predicted By A Factor 2!

- > Zynq data is based on TSMC 28nm process node
- > Prior estimates were based on ARM and TSMC data

#### > ARM made certain assumptions on design implementation

- Xilinx design implementation exceeded these assumptions

#### > All ARM and TSMC data is covered by their NDA with Xilinx

- prohibited from sharing their numbers





# **Proton Beam Testing Derived FIT**

- Xilinx has a significant lead in SEU measurement, reporting and mitigation
  - Xilinx performs accelerated soft error testing and in-situ testing
  - Xilinx UG116 documents FIT/Mb data for its devices
  - Xilinx FPGAs outshine competitive FPGAs in FIT calculation
    - Average of 2X better FIT numbers for equivalent FPGA density/functionality
  - Xilinx offers Mitigation IPs (SEM) to improve FIT in PL
    - Xilinx is presenting more on this topic during 2nd day
      15:15 16:30, Session XII: Errors in Memories

Results available under NDA

<sup>\*\*\*</sup> Result based on typical design using a significant amount of logic fabric.



<sup>\*</sup> Xilinx beam test results for a typical embedded application.

<sup>\*\*</sup> Measurement accuracy +- 15% with a 95% confidence interval

# Silent Data Corruption (SDC)

- > SDC is where the result of a process is not correct
- > No interrupt, no exception, no error is flagged, no timeout
- > BUT the arithmetic result is wrong!
- > Failure rate is less than 15 FIT in these tests





# **ALL PROGRAMMABLE**

Possible solutions to reduce FIT (Sample configurations)

# Scenario: Error Detection is Required Reboot is Acceptable

- > PS monitors parity errors/exceptions
- Reboot on error detected

#### **FIT = 0.67**

- Defined as one non-detected error (processor hang)

#### > Zero with Watchdog timer in PS & Backup watchdog timer in PL





# Scenario: Reduce FIT by 67% (3X time to fail) Some Performance is Compromised

#### > L2 Cache Disabled

- Performance impact :
  - Zero for small arrays and "localized" code to 70% for large random arrays

#### > OCM not used, or BRAM with ECC is used

#### > Parity and watchdogs per previous scenario

Performance based on read, write, and read/write tests both with random and with varying memory strides





# Scenario: Reduce FIT to ~1/4<sup>th</sup> (4X time to fail) Performance is Compromised

Single Core Used

#### > L2 Cache Disabled

- Performance impact :
  - Zero for small arrays and "localized" code to 70% for large random arrays

#### > OCM not used, or BRAM w ECC used instead

#### > Parity and watchdogs per previous scenario





## **Summary**

#### >Zynq FIT is half of TSMC/ARM predictions

#### > Xilinx is confident of its results

- Large number of tests
- Test process proven over multiple generations of chips
- Measurement accuracy +- 15% with a 95% confidence interval

#### > Techniques exist enabling reduced FIT implementations

- Placement of watchdogs in PL lead to 0 undetected events



