IN-HOUSE EDA PART 5

DFT: The Crucial Gap in Open-Source Chip Design

Synthesis has Yosys. Place-and-route has OpenROAD. DFT has nothing. As we neared tapeout, we had no choice — we built the missing infrastructure ourselves.

WIOWIZ Technologies • February 2026 • 18 min read

1. The Gap That Blocks Tapeout

As we neared tapeout, a hard reality set in. Our RTL was verified. Synthesis ran clean through Yosys. Place-and-route was handled by OpenROAD.

But none of that gets you a testable chip.

Without scan insertion, every internal flip-flop is buried behind sequential logic — unreachable from chip I/O pins. Without fault simulation and ATPG, there are no test patterns. Without fault coverage numbers, no production flow can sign off. Without STIL export, no ATE machine can load your vectors. Without JTAG, there is no debug access to the die.

The open-source RTL-to-GDSII flow has a complete hole where manufacturing testability should be. Yosys, OpenROAD, Verilator, SymbiYosys — none of them address DFT. Not partially. Not at all. This isn't a gap in quality. It's a gap in existence.

eFabless, ChipFoundry, and Open MPW shuttles accept designs without DFT. The chips come back untested. For a test chip, maybe acceptable. For anything headed toward production, it's a non-starter.

We needed DFT for our tapeout. It didn't exist. So we built the pipeline in-house — scan insertion, fault simulation, ATPG, pattern export, JTAG, MBIST, and everything beyond — learning every step the hard way.

JTAGControl port — the front door to everything on die

Scan + ATPGThe actual test — this is where 80% of DFT engineering effort lives

BISTSelf-test hardware — memories (MBIST) and logic (LBIST)

Fault simValidates everything — proves your patterns actually detect real faults

STIL exportThe bridge to silicon — translates patterns into ATE-loadable format

Most engineers think DFT = JTAG. Reality: JTAG is ~5% of the work. Scan + ATPG is 80%.

2. The Complete DFT Pipeline

DFT is not a single tool. It's a pipeline. Each step feeds the next. Skip any step and the chain breaks. What we built — DFT — covers the full flow from RTL to ATE-ready test vectors, with JTAG and BIST as parallel infrastructure layers.

DFT — Full Pipeline Architecture

RTL Design

Verilog / SystemVerilog

Scan Insertion

DFT FF + Chain Stitching
dfxtp → sdfrtp

                        JSON:

Fault Simulation

Stuck-at / Transition
86 Sky130 gate models

                        JSON: Fault_List / Tester Vectors

ATE Export

STIL / WGL
Teradyne · Advantest

Coverage Analysis

Detect / Redundant / Untested
Target: ≥ 95%

                        JSON:

ATPG

Random + Exhaustive
+ PODEM

                        JSON: RTL_Netlist

JTAG (IEEE 1149.1)

TAP FSM · IR · BYPASS · IDCODE
Boundary Scan · Scan Access

BIST

MBIST: March C- (4× 256×32)
LBIST: PRPG + MISR on-chip

                      JSON: Boundary Scan Data

                      JSON: Test Results

DFT — 10 tools, 1 unified runner, 1 HTML signoff report

Scan Insertion converts every flip-flop into a scannable element. A mux chains them into a shift register during test mode. Without this, internal state is unreachable from chip pins.

Fault Simulation models manufacturing defects. Every net can be stuck-at-0 or stuck-at-1. A design with 273,000 gates produces over 580,000 possible fault sites. Each must be simulated individually.

ATPG generates the test vectors that detect those faults. Random patterns catch the easy ones. Exhaustive enumeration handles small logic cones. PODEM targets the structurally hard faults.

STIL/WGL Export translates patterns into IEEE 1450 format — what Teradyne and Advantest ATE machines consume. Without this, patterns never reach silicon.

3. Scan Insertion — Every Flip-Flop Converted

Normal mode:  logic → D → [FF] → Q → logic

Scan mode:    scan_in → [FF₁] → [FF₂] → ... → [FFₙ] → scan_out
              (shift register: load ANY value into ANY flip-flop)

We ran scan insertion on three open-source designs of increasing complexity, all synthesized to Sky130:

Design	Gates	DFFs	Scan FFs
IBEX (RISC-V core)	10,987	1,969	1,969
PicoSoC (RISC-V SoC)	24,976	10,723	10,723
NVDLA SDP (NVIDIA DLA)	273,240	29,061	29,061

Every flip-flop converted. Zero left behind. Sky130 has specific scan cell variants — sdfrtp, sedfxtp — and the insertion engine maps each non-scan cell to its correct scan equivalent, including edge cases like edfxtp and dfstp families that commercial PDK documentation doesn't always make obvious.

4. Gate Models — Where Coverage Really Comes From

Sky130 has 86 standard cell types. Not just and2 and or2 — complex compound cells like a221oi (AND-AND-OR-Invert), nand4bb (NAND with two inverted inputs), o2bb2ai, and mux4. Each has specific pin names, inversion behavior, and boolean equations.

This is the part nobody talks about. Everyone focuses on the ATPG algorithm — PODEM, D-algorithm, FAN. But coverage comes from the gate models. If your boolean model for a221oi evaluates pin A1 where it should evaluate pin B1, every fault simulation through that gate produces wrong results. And the dangerous part: wrong results look plausible. You get 36% coverage and think the ATPG needs work. The ATPG is fine. Your gates are lying to you.

We discovered this after coverage plateaued at 36.27% on IBEX. The diagnosis: 3,595 compound gates — 33% of the circuit — were being evaluated with wrong boolean functions. Not slightly wrong. Fundamentally wrong. o21ai, a21oi, a22oi, maj3 — all treated as simple OR gates.

The fix was exhaustive: test every gate model individually. Every input combination. Python reference against C implementation:

Sky130 cell types:     86
Python truth tables:   826/826 pass    (10 were broken → fixed)
C extension models:     86/86  match   (52 had wrong pin mapping → rewritten)
Full circuit match:    100%   C vs Python on complete netlist
False positive test:   ZERO   no-fault baseline, all 3 designs

Then full-circuit validation on the complete netlist. C simulator versus Python, net by net. This caught three more bugs at system level: constant nets (1'b0, 1'b1) not propagated correctly, flip-flop outputs not treated as pseudo-primary-inputs during scan mode, and virtual inverter nets inflating the fault count.

After these fixes, IBEX went from 36.27% to past 95%. The ATPG algorithm didn't change. The gate models became correct.

5. C-Accelerated Fault Simulation

Pure Python fault simulation: 18 minutes for 100 patterns on IBEX (11,000 gates). That's 400K gate-evaluations per second. NVDLA has 273,000 gates. At Python speed, NVDLA ATPG would take days per run — iteration becomes impossible.

We built a C extension — compiled as a shared library, loaded via ctypes. Every gate type has a dedicated C function. No interpretation overhead. No Python object allocation per gate evaluation.

Python fault sim:         ~400K ops/sec
C-accelerated fault sim:  601M ops/sec

Speedup: 1,500×

But the first C extension was wrong. 52 out of 86 gate models had incorrect pin mapping. The C code would produce 40.94% coverage where Python produced 54.05%. We spent a session building the accelerator and another session discovering it was broken — because we tested at system level instead of per-gate.

The lesson was expensive but simple: validate every C gate function against its Python equivalent, input by input, before running any design through it. We did this. 826 tests. Only then did the C extension become trustworthy.

The other critical requirement is fault persistence. After injecting a stuck-at fault on a net, the simulator must hold that value throughout forward propagation. If the engine recomputes from inputs, it overwrites the injected value — and produces ghost detections. We validate with a no-fault baseline: inject no fault, confirm zero detections. Proven across all three designs.

6. ATPG — Three Phases

Pattern generation runs in three phases. Each targets a different class of faults based on logic cone size.

1 Random Patterns

500 random patterns applied through scan chains. The key insight: more scan FFs means more controllability. IBEX with 1,969 scan FFs hits ~56% random coverage. PicoSoC with 10,723 FFs reaches 86%. The same algorithm, dramatically different results — because controllability scales with scan chain depth.

NVDLA SDP — 581,006 faults, 500 patterns, 31 min93.37%

542,459 detected

2 Exhaustive Small Cones

Faults whose logic cone has ≤16 primary inputs: enumerate every combination. This either detects the fault or proves it redundant — mathematically untestable because the circuit structure physically prevents the required value combination. No ambiguity.

Industry standard: proven-redundant faults are excluded from the coverage denominator. Effective coverage = detected / (total − redundant). This is how every DFT tool reports numbers.

3 PODEM — Targeting the Hard Faults

Medium-cone faults (17–64 PIs). Too large for exhaustive enumeration, structured enough for guided backtracing. PODEM picks an objective (set the fault site, propagate to an observation point), backtraces through the circuit to find PI assignments, simulates forward, and backtracks with randomized alternatives if the first attempt fails.

On NVDLA, PODEM ran on 38,547 remaining targets after random. It found 140 additional detections before diminishing returns set in. The remaining undetected faults are in reconvergent fanout structures or proven structurally redundant.

7. The NVDLA Problem — Where Scale Breaks Everything

IBEX at 11,000 gates is a teaching-scale design. PicoSoC at 25,000 gates is a real SoC. NVDLA at 273,000 gates is where everything breaks.

The first NVDLA run produced 50% coverage. That's not a tuning problem — that's a fundamental problem. Diagnosis: the design was synthesized from NVIDIA's HLS flow, which loses hierarchy information. When Yosys flattened the netlist, 49,401 nets became undriven — signals that should have been connected to the wider NVDLA subsystem but were now dangling.

The fix: promote every undriven net as a primary input. This gives the ATPG engine control over signals that would otherwise be permanently stuck. After promotion, the circuit came alive. Random coverage jumped from 50% to 93.37% in 31 minutes.

A second problem: escaped net identifiers. Yosys produces names like \cpu.reg[3] with inconsistent whitespace after the backslash. 191,378 net names needed normalization before fault simulation could correctly match nets between the good circuit and the faulted circuit.

These aren't algorithmic problems. They're engineering problems. And they only surface on real-world netlists at real-world scale.

8. Coverage Results — Three Designs, Three Fault Models

All numbers produced by actual tool runs on actual scan-inserted netlists. Timestamps on disk. JSON results archived.

DFT — Complete Coverage Scorecard (26)

DESIGNSTAGECOVERAGE

IBEX · 10K gatesStuck-At ATPG95.71%

IBEX · 1,969 DFFsTransition ATPG (LOS)88.18%

IBEX · Path Delay ATPG95.18%

──────────────────────────

PicoSoC · 24K gatesStuck-At ATPG97.35%

PicoSoC · 10,723 DFFsTransition ATPG (LOS)93.68%

PicoSoC · Path Delay ATPG32.65%

──────────────────────────

NVDLA · 266K gatesStuck-At ATPG97.55%*

NVDLA · 29,061 DFFsTransition ATPG (LOS)97.55%

NVDLA · Path Delay ATPG26.32%

                    * Transition ATPG on 533,384 faults. Stuck-at verified via unified pipeline.

                    Path delay at 25–32% on PicoSoC/NVDLA is expected — this is the hardest fault model. Commercial
                    tools achieve 60–85% with full SAT solvers.

                    Industry production target: ≥ 95% stuck-at. DFT meets it on all three designs.

9. Tapeout Enablers — STIL, JTAG, MBIST, Compression

Fault coverage alone doesn't close DFT. The patterns need to reach ATE machines. The die needs a debug port. Embedded memories need self-test. And 9MB STIL files need compression before ATE time becomes affordable.

STIL / WGL Pattern Export

ATPG results converted to IEEE 1450 STIL — the format Teradyne and Advantest ATE machines consume. Complete signal declarations, scan chain definitions, timing waveforms, and pattern blocks.

Design	STIL Size	WGL Size	Scan Chain
PicoSoC	2.9 MB	40 KB	1 × 10,723 bits
IBEX	1.0 MB	41 KB	4 × ~493 bits
NVDLA SDP	9.0 MB	50 KB	1 × 29,061 bits

Scan Compression — 127× Test Time Reduction

Uncompressed, NVDLA needs 29,061 serial shift cycles per pattern. At ATE time costs, that's expensive. LFSR decompressor + XOR compactor reduces it dramatically:

Design	DFFs	Channels	Ratio	Cycle Reduction
IBEX	1,969	8→32	4.0×	1,969 → 62 (31.8×)
PicoSoC	10,723	32→128	4.0×	10,723 → 84 (127.7×)
NVDLA SDP	29,061	32→128	4.0×	29,061 → 228 (127.5×)

NVDLA goes from 29,061 serial cycles to 228 cycles. That's the difference between a $2 test cost and a $0.02 test cost per chip at ATE.

JTAG TAP Controller (IEEE 1149.1)

The front door to every DFT feature on die. TAP state machine, instruction register, bypass, IDCODE, boundary scan register, and scan chain access — all generated as synthesizable Verilog.

JTAG TAP — Simulation (Icarus Verilog)
TEST 1: IDCODE Read                    PASS  (0x149A0001)
TEST 2: BYPASS Register 1-bit delay    PASS
TEST 3: TAP Reset via TMS              PASS
TEST 4: IDCODE after reset             PASS

4/4 PASS — IEEE 1149.1 compliant

MBIST — Memory Built-In Self-Test

March C- algorithm: write and read every cell in every direction with every value combination. Catches stuck-at, transition, and coupling faults in memory arrays.

MBIST March C- — Simulation (Icarus Verilog)
Memory 1:  256×32 single-port     PASS
Memory 2:  256×32 single-port     PASS
Memory 3:  256×32 single-port     PASS
Memory 4:  256×32 single-port     PASS

4/4 memories — ALL PASS

LBIST — Logic Built-In Self-Test

On-chip PRPG (pseudo-random pattern generator) + MISR (multiple-input signature register). The chip tests itself — no ATE needed. Required for ISO 26262 automotive in-field test. Deterministic signatures verified across all three designs.

LBIST Signatures — Deterministic, Verified
IBEX:    0xa3cf6393  (32 chains,  1000 patterns)   ✅ MATCH
PicoSoC: 0x8faa72c7  (128 chains, 1000 patterns)   ✅ MATCH
NVDLA:   0xd3670e99  (128 chains,  500 patterns)   ✅ MATCH

10. What We Learned Building This

Every row in this table cost us time. We're sharing them because anyone building DFT infrastructure will hit the same walls.

Lesson	What It Cost
Gate models must be verified per-cell, not at system level	Weeks of plausible-looking wrong coverage
C extension models must match Python — every pin, every inversion	52 out of 86 C models had wrong pin mapping
Fault persistence during forward propagation is non-negotiable	An engine that once reported detections on unfaulted circuits
O(n²) topological sort works on 11K gates. Hangs on 25K.	First PicoSoC run: infinite hang → O(n) adjacency list
More scan FFs = dramatically better random coverage	IBEX ~56% random. PicoSoC ~86%. Same algorithm.
Undriven nets from HLS must be promoted to input ports	NVDLA stuck at 50% until 49,401 nets promoted
Net name normalization — escaped identifiers break silently	191,378 mismatched names from Yosys flattening
Path delay launch point must be explicitly forced	0% coverage until V1/V2 transition was pinned per-path
Never trust coverage without a zero-fault baseline test	The only way to prove your simulator isn't lying to you

11. Current Status — What's Actually Built

This is the honest, disk-backed answer. Every item below ran through the unified DFT pipeline and produced JSON results with timestamps.

✓ Built and Verified

Scan insertion (dfxtp → sdfrtp, balanced chains)
86 Sky130 gate models (826 individual tests)
C-accelerated fault simulation (1,500× speedup)
Stuck-at ATPG (Random + Exhaustive + PODEM)
Transition fault ATPG (STR + STF, LOS method)
Path delay ATPG (SDR + SDF, targeted sensitization)
STIL / WGL export (IEEE 1450, ATE-ready)
Scan compression (LFSR+XOR, 32–128× reduction)
LBIST (PRPG + MISR, deterministic signatures)
JTAG TAP controller (IEEE 1149.1, 4/4 tests pass)
MBIST March C- (4 memories, ALL PASS)
DFT DRC (62 rules, 9 categories)
Unified runner + HTML signoff report

→ Next

Boundary scan automated insertion flow
Multi-clock / CDC scan handling (OCC, lockup latches)
PODEM-based path delay (close the 25–32% gap)
Low-power DFT (power-aware ATPG)
Cell-aware fault models
Parallel fault sim (multi-core)
SAT-based ATPG for large cones

12. Why This Matters

Today, a test chip designed on Sky130 through eFabless or ChipFoundry can be synthesized with Yosys, placed and routed with OpenROAD, verified with Verilator — and then hit a wall at DFT. The scan chains don't exist. The test patterns don't exist. The fault coverage is unknown. The chip goes to fab untested.

We built this because we needed it for our own work. The infrastructure didn't exist, so we created it — 10 tools, one unified pipeline, one signoff report. Validated on real designs from 11K gates to 273K gates on Sky130. Coverage numbers that meet industry thresholds, verified with zero-fault baseline and disk timestamps.

The gap in open-source silicon infrastructure is real. We're working to bridge it.

WioWiz In-House EDA Series

01 Why In-House EDA and Chip Flow Intelligence Are No Longer Optional

02 EQWAVE — Timestamps Lie. Behavior Doesn't.

03 VHE — Million-Gate Simulation at Zero License Cost

04 Parallel Region-Based Routing on OpenROAD

05 DFT — The Crucial Gap in Open-Source Chip Design ←

#semiconductor #EDA #DFT #ATPG #scan-insertion #JTAG #MBIST #LBIST #scan-compression #Sky130 #tapeout #open-silicon #DFT