GPUPerft is the umbrella repo for exact GPU perft(), move-generation experiments, and frontier export. Its job is not only to count nodes quickly. Its job is to count them exactly, preserve enough structure to reuse the result later, and expose where the hardware or execution model is doing the real work.
Alpha status: Work in progress. These are research builds for controlled lab use, not finished releases. Command-line surface, file formats, and performance characteristics may change. Run them at your own risk.
The repo currently spans three active lanes:
CDP1, the historical CUDA lineCDP2, the current main CUDA lineOpenCL, the portable lane
It also carries the export work that bridges exact counting to downstream corpus storage.
Release Map
All current downloads here are Windows x64 lab builds.
| Area | Executable | Status | Downloads |
|---|---|---|---|
CDP1 |
GPUPerftCDP1.exe |
historical CUDA line; important reference point; not supported on Blackwell / CUDA 13+ | exe / sha256 |
CDP2 |
GPUPerftCDP2.exe |
current main CUDA path | exe / sha256 |
OpenCL |
GPUPerftOpenCL.exe |
portable OpenCL lane | exe / sha256 |
Headline Results
| Lane | Measured headline |
|---|---|
CDP1 |
The historical 3090-era reference recorded roughly 5.79T wall NPS / 5.98T GPU NPS on perft(12). That is the original 6T-class milestone for this project. |
CDP2 |
The current RTX 5090 pure-perft fork validated perft(11) = 2,097,651,003,696,806 in 226.716s, 9.252T wall NPS / 9.260T GPU NPS. |
OpenCL |
The portable OpenCL lane validated start-position perft(11) in 2377.963s, 882.1B wall NPS, on RTX 5090. |
Why The Repo Exists
perft() is valuable here because it is deterministic. A wrong result does not hide behind evaluation noise or search heuristics. That makes it a useful test bed for:
- move generation
- kernel launch strategy
- transposition-table design
- duplicate collapse
- frontier export
- storage formats for very large exact corpora
That is why GPUPerft grew into more than a benchmark executable. It became the exact-count side of a larger data pipeline.
CDP1
GPUPerftCDP1.exe is the historical CUDA line packaged for download. It matters because it proved that exact GPU perft could reach the multi-trillion-nodes-per-second range on older hardware.
The key point is not nostalgia. The point is that CDP1 established the performance bar the later work had to beat. The rebuild notes are explicit about it: the old 3090-era path was already in the 6T class, so any modern replacement that merely matched that number would be underperforming the hardware generation change.
Today, CDP1 is primarily a reference line:
- it explains the original dynamic-parallelism path
- it anchors the early performance history
- it is not the recommended build on current Blackwell-era hardware
Modern build notes also make the limitation clear: the branch rejects compute capability 9.0+ devices, so this is not the deployable path for the latest NVIDIA cards.
Command-Line Shape
.\GPUPerftCDP1.exe 6 --stats --gpus 1
.\GPUPerftCDP1.exe --verify --stats --gpus 1
.\GPUPerftCDP1.exe "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1" 6 --gpus 1
CDP2
GPUPerftCDP2.exe is the current CUDA line and the default recommendation on modern hardware. It is where the active exact-count work now lives.
The important change in the recent work is not a cosmetic retune. It is that the CDP2 path finally cleared the old CDP1 class and moved into its own performance tier on RTX 5090 hardware.
Current Pure-Perft Breakthrough
Validated default results from the current RTX 5090 CDP2 pure-perft fork:
| Depth | Result | Wall | Wall NPS | GPU NPS |
|---|---|---|---|---|
| 10 | 69,352,859,712,417 |
8.799s |
7.882T |
8.057T |
| 11 | 2,097,651,003,696,806 |
226.716s |
9.252T |
9.260T |
That depth-11 result is the current headline because it is exact, validated, and already past the original 8T target for the 5090-class path.
The older rebuild checkpoint still matters because it shows what changed. Early in the same engineering cycle, CDP2 was validating at about 5.228T wall NPS on d11. The later owner-band and bounded-prefix work is what moved the line into the 9T class.
Command-Line Shape
.\GPUPerftCDP2.exe 10 --engine cdp2 --gpus 1
.\GPUPerftCDP2.exe 11 --engine cdp2 --gpus 1
.\GPUPerftCDP2.exe export --output D:\gp_out --to-depth 6 --gpus 1
Export And Canonical Storage
The export lane is one of the most important parts of the repo because it turns exact counting into reusable data.
The design goal is straightforward:
- generate exact frontier output
- preserve multiplicity so logical node totals are not lost
- hand that corpus to downstream canonical storage instead of forcing later tools to reconstruct the tree
In practical terms, that means the exported records carry enough information for deduplicated corpus materialization: one unique position row plus the count information needed to preserve the full logical workload. The target is not just to save positions. The target is to store the full perft output in canonical form without expanding duplicate branches back into separate rows.
Current Export Baseline
The current validated NybbleBoard export baseline for the start position at depth 9 is:
| Metric | Result |
|---|---|
| Wall time | 242.802s |
perftNps |
10.047B |
storedPosNps |
159.904M |
| Known multiplicity total | 2,439,530,234,167 |
| Validation | passed |
The raw exporter compression sweep also shows the storage side is already meaningful on its own. In the measured zstd:4 configuration, the current exporter reached about 23.37 bits per stored position.
The deeper point, though, is what happens downstream. The exported frontier is designed to feed canonical CCCF / XCCF-style materialization rather than remain a one-off diagnostic artifact.
That downstream path is already visible in the broader toolchain. FENMaster’s validated canonical fen -> cccf matrix run reduced 67,108,864 input positions to 16,508,408 canonical output records at 5.783 bits per input FEN. That is the storage story this export lane is meant to serve.
OpenCL
GPUPerftOpenCL.exe is the portable lane. It matters because portability is not free, and it is useful to have a line that is not tied to CUDA-specific assumptions.
Current Validation Snapshot
RTX 5090 validation snapshot:
| Depth | Status | Wall | Wall NPS |
|---|---|---|---|
| 8 | PASS | 0.694s |
122.4B |
| 9 | PASS | 6.834s |
357.0B |
| 10 | PASS | 87.078s |
796.4B |
| 11 | PASS | 2377.963s |
882.1B |
Command-Line Shape
.\GPUPerftOpenCL.exe --probe
.\GPUPerftOpenCL.exe --self-test
.\GPUPerftOpenCL.exe --verify --max-depth 7
.\GPUPerftOpenCL.exe 9 --cdp2-shape --metrics --json D:\gpcl\cdp2_shape\d9.json
Practical Guidance
If the goal is to run the current main CUDA path on modern NVIDIA hardware, start with GPUPerftCDP2.exe.
If the goal is to understand the performance history, the original 6T-class CDP1 milestone still matters.
If the goal is portability, use GPUPerftOpenCL.exe.
If the goal is to turn exact counts into durable corpora, the export lane is the important part: it is what connects GPU perft to canonical storage, later statistics work, and the rest of the data pipeline.