VHE — Virtual Hardware Emulator
1. The Challenge
We were verifying a 1.4 million gate NPU design. The simulation started. Six hours later, we had a 56GB trace file, 139 billion cycles queued, and a killed process.
This wasn't a bug. This was a fundamental limitation. Open-source simulators are excellent for small-to-medium designs. But when you cross into million-gate territory, the rules change.
Hardware emulators that handle this scale require significant investment — often beyond reach for early-stage teams. We couldn't wait for funding. We couldn't wait for subsidies. We needed to verify our chip.
So we asked a different question: What if we built our own?
2. The Gap We Saw
We're not claiming to replace enterprise tools. We built what we needed — a research-grade simulator for teams in similar situations.
3. What is VHE?
VHE (Virtual Hardware Emulator) is a GPU-accelerated gate-level simulator. It takes synthesized netlists and simulates them on CUDA-capable GPUs, achieving throughput that would be impractical on CPUs alone.
Gate-level simulation is embarrassingly parallel. Thousands of gates at the same logic level can be evaluated simultaneously. GPUs have thousands of cores. The match is natural.
Key characteristics:
- Two-Phase Architecture: Phase 1 for zero-delay functional simulation, Phase 2 for timing
- Levelization: Gates organized into dependency levels for parallel evaluation
- CUDA Backend: Native GPU kernels, not emulation
- Yosys Integration: Reads JSON netlists from Yosys synthesis
4. Architecture
5. The Journey — Real Numbers
We didn't start with 6.7 million gates. We started with 8,000. Here's the progression:
| Design | Gates | FFs | Levels | VHE Speed |
|---|---|---|---|---|
| PicoRV32 (RISC-V CPU) | 8,000 | 1,300 | ~50 | 11,063 cyc/s |
| mor1kx (OpenRISC SoC) | 1,250,000 | 596,000 | ~200 | 2,941 cyc/s |
| GEMMX (AI/ML NPU) | 1,400,000 | 10,866 | 403 | 1,465 cyc/s |
| WZ-NPU (16-tile NPU) | 6,747,799 | 41,771 | 447 | 3,444 cyc/s |
2 Billion gate-cycles per second
6.7M gates × 3,444 cycles/sec ≈ 2B gate-cycles/sec
The 6.7 million gate WZ-NPU was the real test. It took 8 minutes to load and levelize (447 levels, 100 iterations). But once on GPU, it ran at 3,444 cycles per second — fast enough for meaningful verification.
📦 WZ-NPU: Open Source NPU
The 6.7M gate design verified with VHE is now open source.
16 tiles • 8,192 MACs • Gate-level verified
github.com/wiowiz-tech/wz-npu6. Challenges & How We Solved Them
Challenge 1: Levelization at Scale
Our levelization algorithm hit a 100-iteration cap on large designs. The WZ-NPU with 447 logic levels was pushing limits.
Solution: We optimized the DAG traversal and increased iteration limits dynamically based on design complexity.
Challenge 2: Memory Management
6.7 million gates need ~7GB GPU memory. Not all GPUs have this.
Solution: Batched evaluation and careful memory layout. We also document minimum requirements clearly.
Challenge 3: SystemVerilog Parsing
Many modern designs use SystemVerilog. Standard Yosys struggles with some constructs.
Solution: We use KAALAIDE (our enhanced Yosys fork with Synlig integration) for SystemVerilog parsing before VHE synthesis.
7. Current Status
VHE is research-grade and continuously improving.
8. Part of the WIOWIZ EDA Stack
VHE is one component of our in-house EDA infrastructure:
Each tool was built because we needed it. No grand plan — just practical necessity driving development.