GPUPerft – AI Chess Staging

GPUPerft is the umbrella repo for exact GPU perft(), move-generation experiments, and frontier export. Its job is not only to count nodes quickly. Its job is to count them exactly, preserve enough structure to reuse the result later, and expose where the hardware or execution model is doing the real work.

Alpha status: Work in progress. These are research builds for controlled lab use, not finished releases. Command-line surface, file formats, and performance characteristics may change. Run them at your own risk.

The repo currently spans three active lanes:

CDP1, the historical CUDA line
CDP2, the current main CUDA line, currently staged as 0.2.3 build 001
OpenCL, the portable lane

It also carries the export work that bridges exact counting to downstream corpus storage.

Release Map

All current downloads here are Windows x64 lab builds.

Area	Executable	Status	Downloads
`CDP1`	`GPUPerftCDP1.exe`	historical CUDA line; important reference point; not supported on Blackwell / CUDA 13+	exe / sha256
`CDP2`	`GPUPerftCDP2.exe`	current main CUDA path; `0.2.3 build 001`; standard and stats verification passed `19/19`	exe / sha256
`OpenCL`	`GPUPerftOpenCL.exe`	portable OpenCL lane	exe / sha256

Headline Results

Lane	Measured headline
`CDP1`	The historical 3090-era reference recorded roughly `5.79T` wall NPS / `5.98T` GPU NPS on `perft(12)`. That is the original 6T-class milestone for this project.
`CDP2`	Staged `0.2.3 build 001` validates the 19-position standard corpus and `--verify --stats`. The start-position d11 reference completed in `34.887s`, about `60.127T` nodes/sec, with the expected `2,097,651,003,696,806` result.
`OpenCL`	The portable OpenCL lane validated start-position `perft(11)` in `2377.963s`, `882.1B` wall NPS, on RTX `5090`.

Why The Repo Exists

perft() is valuable here because it is deterministic. A wrong result does not hide behind evaluation noise or search heuristics. That makes it a useful test bed for:

move generation
kernel launch strategy
transposition-table design
duplicate collapse
frontier export
storage formats for very large exact corpora

That is why GPUPerft grew into more than a benchmark executable. It became the exact-count side of a larger data pipeline.

CDP1

GPUPerftCDP1.exe is the historical CUDA line packaged for download. It matters because it proved that exact GPU perft could reach the multi-trillion-nodes-per-second range on older hardware.

The key point is not nostalgia. The point is that CDP1 established the performance bar the later work had to beat. The rebuild notes are explicit about it: the old 3090-era path was already in the 6T class, so any modern replacement that merely matched that number would be underperforming the hardware generation change.

Today, CDP1 is primarily a reference line:

it explains the original dynamic-parallelism path
it anchors the early performance history
it is not the recommended build on current Blackwell-era hardware

Modern build notes also make the limitation clear: the branch rejects compute capability 9.0+ devices, so this is not the deployable path for the latest NVIDIA cards.

Command-Line Shape

.\GPUPerftCDP1.exe 6 --stats --gpus 1
.\GPUPerftCDP1.exe --verify --stats --gpus 1
.\GPUPerftCDP1.exe "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1" 6 --gpus 1

CDP2

GPUPerftCDP2.exe is the current CUDA line and the default recommendation on modern hardware. It is where the active exact-count work now lives.

The important change in the recent work is not a cosmetic retune. It is that the CDP2 path finally cleared the old CDP1 class and moved into its own performance tier on RTX 5090 hardware.

Current Release Snapshot

The current staged CDP2 build is 0.2.3 build 001. Both standard verification and --verify --stats passed the full 19-position corpus. The start-position rows from the standard verification run are:

Depth	Result	Wall	Throughput	Status
9	`2,439,530,234,167`	`0.619s`	`3.941T` nodes/sec	PASS
10	`69,352,859,712,417`	`3.253s`	`21.323T` nodes/sec	PASS
11	`2,097,651,003,696,806`	`34.887s`	`60.127T` nodes/sec	PASS

That d11 run is the current staging headline because it is exact, repeatable through the built-in verification path, and tied to a named release artifact rather than an ad hoc benchmark binary.

Command-Line Shape

.\GPUPerftCDP2.exe 10 --engine cdp2 --gpus 1
.\GPUPerftCDP2.exe 11 --engine cdp2 --gpus 1
.\GPUPerftCDP2.exe export --output D:\gp_out --to-depth 6 --gpus 1

Export And Canonical Storage

The export lane is one of the most important parts of the repo because it turns exact counting into reusable data.

The design goal is straightforward:

generate exact frontier output
preserve multiplicity so logical node totals are not lost
hand that corpus to downstream canonical storage instead of forcing later tools to reconstruct the tree

In practical terms, that means the exported records carry enough information for deduplicated corpus materialization: one unique position row plus the count information needed to preserve the full logical workload. The target is not just to save positions. The target is to store the full perft output in canonical form without expanding duplicate branches back into separate rows.

Current Export Baseline

The current validated NybbleBoard export baseline for the start position at depth 9 is:

Metric	Result
Wall time	`242.802s`
`perftNps`	`10.047B`
`storedPosNps`	`159.904M`
Known multiplicity total	`2,439,530,234,167`
Validation	passed

The raw exporter compression sweep also shows the storage side is already meaningful on its own. In the measured zstd:4 configuration, the current exporter reached about 23.37 bits per stored position.

The deeper point, though, is what happens downstream. The exported frontier is designed to feed canonical CCCF / XCCF-style materialization rather than remain a one-off diagnostic artifact.

That downstream path is already visible in the broader toolchain. FENMaster’s validated canonical fen -> cccf matrix run reduced 67,108,864 input positions to 16,508,408 canonical output records at 5.783 bits per input FEN. That is the storage story this export lane is meant to serve.

OpenCL

GPUPerftOpenCL.exe is the portable lane. It matters because portability is not free, and it is useful to have a line that is not tied to CUDA-specific assumptions.

Current Validation Snapshot

RTX 5090 validation snapshot:

Depth	Status	Wall	Wall NPS
8	PASS	`0.694s`	`122.4B`
9	PASS	`6.834s`	`357.0B`
10	PASS	`87.078s`	`796.4B`
11	PASS	`2377.963s`	`882.1B`

Command-Line Shape

.\GPUPerftOpenCL.exe --probe
.\GPUPerftOpenCL.exe --self-test
.\GPUPerftOpenCL.exe --verify --max-depth 7
.\GPUPerftOpenCL.exe 9 --cdp2-shape --metrics --json D:\gpcl\cdp2_shape\d9.json

Practical Guidance

If the goal is to run the current main CUDA path on modern NVIDIA hardware, start with GPUPerftCDP2.exe.

If the goal is to understand the performance history, the original 6T-class CDP1 milestone still matters.

If the goal is portability, use GPUPerftOpenCL.exe.

If the goal is to turn exact counts into durable corpora, the export lane is the important part: it is what connects GPU perft to canonical storage, later statistics work, and the rest of the data pipeline.