DFT: The Crucial Gap in Open-Source Chip Design
1. The Gap That Blocks Tapeout
As we neared tapeout, a hard reality set in. Our RTL was verified. Synthesis ran clean through Yosys. Place-and-route was handled by OpenROAD.
But none of that gets you a testable chip.
Without scan insertion, every internal flip-flop is buried behind sequential logic — unreachable from chip I/O pins. Without fault simulation and ATPG, there are no test patterns. Without fault coverage numbers, no production flow can sign off. Without STIL export, no ATE machine can load your vectors. Without JTAG, there is no debug access to the die.
The open-source RTL-to-GDSII flow has a complete hole where manufacturing testability should be. Yosys, OpenROAD, Verilator, SymbiYosys — none of them address DFT. Not partially. Not at all. This isn't a gap in quality. It's a gap in existence.
eFabless, ChipFoundry, and Open MPW shuttles accept designs without DFT. The chips come back untested. For a test chip, maybe acceptable. For anything headed toward production, it's a non-starter.
We needed DFT for our tapeout. It didn't exist. So we built the pipeline in-house — scan insertion, fault simulation, ATPG, pattern export, JTAG, MBIST, and everything beyond — learning every step the hard way.
Most engineers think DFT = JTAG. Reality: JTAG is ~5% of the work. Scan + ATPG is 80%.
2. The Complete DFT Pipeline
DFT is not a single tool. It's a pipeline. Each step feeds the next. Skip any step and the chain breaks. What we built — DFT — covers the full flow from RTL to ATE-ready test vectors, with JTAG and BIST as parallel infrastructure layers.
dfxtp → sdfrtp
86 Sky130 gate models
Teradyne · Advantest
Target: ≥ 95%
+ PODEM
Boundary Scan · Scan Access
LBIST: PRPG + MISR on-chip
Scan Insertion converts every flip-flop into a scannable element. A mux chains them into a shift register during test mode. Without this, internal state is unreachable from chip pins.
Fault Simulation models manufacturing defects. Every net can be stuck-at-0 or stuck-at-1. A design with 273,000 gates produces over 580,000 possible fault sites. Each must be simulated individually.
ATPG generates the test vectors that detect those faults. Random patterns catch the easy ones. Exhaustive enumeration handles small logic cones. PODEM targets the structurally hard faults.
STIL/WGL Export translates patterns into IEEE 1450 format — what Teradyne and Advantest ATE machines consume. Without this, patterns never reach silicon.
3. Scan Insertion — Every Flip-Flop Converted
Normal mode: logic → D → [FF] → Q → logic
Scan mode: scan_in → [FF₁] → [FF₂] → ... → [FFₙ] → scan_out
(shift register: load ANY value into ANY flip-flop)
We ran scan insertion on three open-source designs of increasing complexity, all synthesized to Sky130:
| Design | Gates | DFFs | Scan FFs | Left |
|---|---|---|---|---|
| IBEX (RISC-V core) | 10,987 | 1,969 | 1,969 | 0 |
| PicoSoC (RISC-V SoC) | 24,976 | 10,723 | 10,723 | 0 |
| NVDLA SDP (NVIDIA DLA) | 273,240 | 29,061 | 29,061 | 0 |
Every flip-flop converted. Zero left behind. Sky130 has specific scan cell variants —
sdfrtp, sedfxtp — and the insertion engine maps each non-scan cell to its
correct scan equivalent, including edge cases like edfxtp and dfstp families
that commercial PDK documentation doesn't always make obvious.
4. Gate Models — Where Coverage Really Comes From
Sky130 has 86 standard cell types. Not just and2 and or2 —
complex compound cells like a221oi (AND-AND-OR-Invert), nand4bb (NAND with
two inverted inputs), o2bb2ai, and mux4. Each has specific pin names,
inversion behavior, and boolean equations.
This is the part nobody talks about. Everyone focuses on the ATPG algorithm — PODEM, D-algorithm,
FAN. But coverage comes from the gate models. If your boolean model for a221oi evaluates
pin A1 where it should evaluate pin B1, every fault simulation through that gate produces wrong
results. And the dangerous part: wrong results look plausible. You get 36% coverage and think the ATPG
needs work. The ATPG is fine. Your gates are lying to you.
We discovered this after coverage plateaued at 36.27% on IBEX. The diagnosis: 3,595 compound
gates — 33% of the circuit — were being evaluated with wrong boolean functions. Not
slightly wrong. Fundamentally wrong. o21ai, a21oi, a22oi,
maj3 — all treated as simple OR gates.
The fix was exhaustive: test every gate model individually. Every input combination. Python reference against C implementation:
Sky130 cell types: 86
Python truth tables: 826/826 pass (10 were broken → fixed)
C extension models: 86/86 match (52 had wrong pin mapping → rewritten)
Full circuit match: 100% C vs Python on complete netlist
False positive test: ZERO no-fault baseline, all 3 designs
Then full-circuit validation on the complete netlist. C simulator versus Python, net by net. This
caught three more bugs at system level: constant nets (1'b0, 1'b1) not
propagated correctly, flip-flop outputs not treated as pseudo-primary-inputs during scan mode, and
virtual inverter nets inflating the fault count.
After these fixes, IBEX went from 36.27% to past 95%. The ATPG algorithm didn't change. The gate models became correct.
5. C-Accelerated Fault Simulation
Pure Python fault simulation: 18 minutes for 100 patterns on IBEX (11,000 gates). That's 400K gate-evaluations per second. NVDLA has 273,000 gates. At Python speed, NVDLA ATPG would take days per run — iteration becomes impossible.
We built a C extension — compiled as a shared library, loaded via ctypes. Every gate type has a dedicated C function. No interpretation overhead. No Python object allocation per gate evaluation.
Python fault sim: ~400K ops/sec
C-accelerated fault sim: 601M ops/sec
Speedup: 1,500×
But the first C extension was wrong. 52 out of 86 gate models had incorrect pin mapping. The C code would produce 40.94% coverage where Python produced 54.05%. We spent a session building the accelerator and another session discovering it was broken — because we tested at system level instead of per-gate.
The lesson was expensive but simple: validate every C gate function against its Python equivalent, input by input, before running any design through it. We did this. 826 tests. Only then did the C extension become trustworthy.
The other critical requirement is fault persistence. After injecting a stuck-at fault on a net, the simulator must hold that value throughout forward propagation. If the engine recomputes from inputs, it overwrites the injected value — and produces ghost detections. We validate with a no-fault baseline: inject no fault, confirm zero detections. Proven across all three designs.
6. ATPG — Three Phases
Pattern generation runs in three phases. Each targets a different class of faults based on logic cone size.
1 Random Patterns
500 random patterns applied through scan chains. The key insight: more scan FFs means more controllability. IBEX with 1,969 scan FFs hits ~56% random coverage. PicoSoC with 10,723 FFs reaches 86%. The same algorithm, dramatically different results — because controllability scales with scan chain depth.
2 Exhaustive Small Cones
Faults whose logic cone has ≤16 primary inputs: enumerate every combination. This either detects the fault or proves it redundant — mathematically untestable because the circuit structure physically prevents the required value combination. No ambiguity.
Industry standard: proven-redundant faults are excluded from the coverage denominator. Effective coverage = detected / (total − redundant). This is how every DFT tool reports numbers.
3 PODEM — Targeting the Hard Faults
Medium-cone faults (17–64 PIs). Too large for exhaustive enumeration, structured enough for guided backtracing. PODEM picks an objective (set the fault site, propagate to an observation point), backtraces through the circuit to find PI assignments, simulates forward, and backtracks with randomized alternatives if the first attempt fails.
On NVDLA, PODEM ran on 38,547 remaining targets after random. It found 140 additional detections before diminishing returns set in. The remaining undetected faults are in reconvergent fanout structures or proven structurally redundant.
7. The NVDLA Problem — Where Scale Breaks Everything
IBEX at 11,000 gates is a teaching-scale design. PicoSoC at 25,000 gates is a real SoC. NVDLA at 273,000 gates is where everything breaks.
The first NVDLA run produced 50% coverage. That's not a tuning problem — that's a fundamental problem. Diagnosis: the design was synthesized from NVIDIA's HLS flow, which loses hierarchy information. When Yosys flattened the netlist, 49,401 nets became undriven — signals that should have been connected to the wider NVDLA subsystem but were now dangling.
The fix: promote every undriven net as a primary input. This gives the ATPG engine control over signals that would otherwise be permanently stuck. After promotion, the circuit came alive. Random coverage jumped from 50% to 93.37% in 31 minutes.
A second problem: escaped net identifiers. Yosys produces names like \cpu.reg[3] with
inconsistent whitespace after the backslash. 191,378 net names needed normalization before fault
simulation could correctly match nets between the good circuit and the faulted circuit.
These aren't algorithmic problems. They're engineering problems. And they only surface on real-world netlists at real-world scale.
8. Coverage Results — Three Designs, Three Fault Models
All numbers produced by actual tool runs on actual scan-inserted netlists. Timestamps on disk. JSON results archived.
Path delay at 25–32% on PicoSoC/NVDLA is expected — this is the hardest fault model. Commercial tools achieve 60–85% with full SAT solvers.
Industry production target: ≥ 95% stuck-at. DFT meets it on all three designs.
9. Tapeout Enablers — STIL, JTAG, MBIST, Compression
Fault coverage alone doesn't close DFT. The patterns need to reach ATE machines. The die needs a debug port. Embedded memories need self-test. And 9MB STIL files need compression before ATE time becomes affordable.
STIL / WGL Pattern Export
ATPG results converted to IEEE 1450 STIL — the format Teradyne and Advantest ATE machines consume. Complete signal declarations, scan chain definitions, timing waveforms, and pattern blocks.
| Design | STIL Size | WGL Size | Scan Chain |
|---|---|---|---|
| PicoSoC | 2.9 MB | 40 KB | 1 × 10,723 bits |
| IBEX | 1.0 MB | 41 KB | 4 × ~493 bits |
| NVDLA SDP | 9.0 MB | 50 KB | 1 × 29,061 bits |
Scan Compression — 127× Test Time Reduction
Uncompressed, NVDLA needs 29,061 serial shift cycles per pattern. At ATE time costs, that's expensive. LFSR decompressor + XOR compactor reduces it dramatically:
| Design | DFFs | Channels | Ratio | Cycle Reduction |
|---|---|---|---|---|
| IBEX | 1,969 | 8→32 | 4.0× | 1,969 → 62 (31.8×) |
| PicoSoC | 10,723 | 32→128 | 4.0× | 10,723 → 84 (127.7×) |
| NVDLA SDP | 29,061 | 32→128 | 4.0× | 29,061 → 228 (127.5×) |
NVDLA goes from 29,061 serial cycles to 228 cycles. That's the difference between a $2 test cost and a $0.02 test cost per chip at ATE.
JTAG TAP Controller (IEEE 1149.1)
The front door to every DFT feature on die. TAP state machine, instruction register, bypass, IDCODE, boundary scan register, and scan chain access — all generated as synthesizable Verilog.
JTAG TAP — Simulation (Icarus Verilog)
TEST 1: IDCODE Read PASS (0x149A0001)
TEST 2: BYPASS Register 1-bit delay PASS
TEST 3: TAP Reset via TMS PASS
TEST 4: IDCODE after reset PASS
4/4 PASS — IEEE 1149.1 compliant
MBIST — Memory Built-In Self-Test
March C- algorithm: write and read every cell in every direction with every value combination. Catches stuck-at, transition, and coupling faults in memory arrays.
MBIST March C- — Simulation (Icarus Verilog)
Memory 1: 256×32 single-port PASS
Memory 2: 256×32 single-port PASS
Memory 3: 256×32 single-port PASS
Memory 4: 256×32 single-port PASS
4/4 memories — ALL PASS
LBIST — Logic Built-In Self-Test
On-chip PRPG (pseudo-random pattern generator) + MISR (multiple-input signature register). The chip tests itself — no ATE needed. Required for ISO 26262 automotive in-field test. Deterministic signatures verified across all three designs.
LBIST Signatures — Deterministic, Verified
IBEX: 0xa3cf6393 (32 chains, 1000 patterns) ✅ MATCH
PicoSoC: 0x8faa72c7 (128 chains, 1000 patterns) ✅ MATCH
NVDLA: 0xd3670e99 (128 chains, 500 patterns) ✅ MATCH
10. What We Learned Building This
Every row in this table cost us time. We're sharing them because anyone building DFT infrastructure will hit the same walls.
| Lesson | What It Cost |
|---|---|
| Gate models must be verified per-cell, not at system level | Weeks of plausible-looking wrong coverage |
| C extension models must match Python — every pin, every inversion | 52 out of 86 C models had wrong pin mapping |
| Fault persistence during forward propagation is non-negotiable | An engine that once reported detections on unfaulted circuits |
| O(n²) topological sort works on 11K gates. Hangs on 25K. | First PicoSoC run: infinite hang → O(n) adjacency list |
| More scan FFs = dramatically better random coverage | IBEX ~56% random. PicoSoC ~86%. Same algorithm. |
| Undriven nets from HLS must be promoted to input ports | NVDLA stuck at 50% until 49,401 nets promoted |
| Net name normalization — escaped identifiers break silently | 191,378 mismatched names from Yosys flattening |
| Path delay launch point must be explicitly forced | 0% coverage until V1/V2 transition was pinned per-path |
| Never trust coverage without a zero-fault baseline test | The only way to prove your simulator isn't lying to you |
11. Current Status — What's Actually Built
This is the honest, disk-backed answer. Every item below ran through the unified DFT pipeline and produced JSON results with timestamps.
✓ Built and Verified
- Scan insertion (dfxtp → sdfrtp, balanced chains)
- 86 Sky130 gate models (826 individual tests)
- C-accelerated fault simulation (1,500× speedup)
- Stuck-at ATPG (Random + Exhaustive + PODEM)
- Transition fault ATPG (STR + STF, LOS method)
- Path delay ATPG (SDR + SDF, targeted sensitization)
- STIL / WGL export (IEEE 1450, ATE-ready)
- Scan compression (LFSR+XOR, 32–128× reduction)
- LBIST (PRPG + MISR, deterministic signatures)
- JTAG TAP controller (IEEE 1149.1, 4/4 tests pass)
- MBIST March C- (4 memories, ALL PASS)
- DFT DRC (62 rules, 9 categories)
- Unified runner + HTML signoff report
→ Next
- Boundary scan automated insertion flow
- Multi-clock / CDC scan handling (OCC, lockup latches)
- PODEM-based path delay (close the 25–32% gap)
- Low-power DFT (power-aware ATPG)
- Cell-aware fault models
- Parallel fault sim (multi-core)
- SAT-based ATPG for large cones
12. Why This Matters
Today, a test chip designed on Sky130 through eFabless or ChipFoundry can be synthesized with Yosys, placed and routed with OpenROAD, verified with Verilator — and then hit a wall at DFT. The scan chains don't exist. The test patterns don't exist. The fault coverage is unknown. The chip goes to fab untested.
We built this because we needed it for our own work. The infrastructure didn't exist, so we created it — 10 tools, one unified pipeline, one signoff report. Validated on real designs from 11K gates to 273K gates on Sky130. Coverage numbers that meet industry thresholds, verified with zero-fault baseline and disk timestamps.
The gap in open-source silicon infrastructure is real. We're working to bridge it.