AI Chess Lab AI Chess Software, research, lab notes, and durable public pages.

Static section

GPUPerft

GPUPerft is the umbrella repo for exact GPU perft(), move-generation experiments, and frontier export. Its job is not only to count nodes quickly. Its job is to count them exactly, preserve enough structure to reuse the result later, and expose where the hardware or execution model is doing the real work.

Alpha status: Work in progress. These are research builds for controlled lab use, not finished releases. Command-line surface, file formats, and performance characteristics may change. Run them at your own risk.

The repo currently spans three active lanes:

  • CDP1, the historical CUDA line
  • CDP2, the current main CUDA line
  • OpenCL, the portable lane

It also carries the export work that bridges exact counting to downstream corpus storage.

Release Map

All current downloads here are Windows x64 lab builds.

Area Executable Status Downloads
CDP1 GPUPerftCDP1.exe historical CUDA line; important reference point; not supported on Blackwell / CUDA 13+ exe / sha256
CDP2 GPUPerftCDP2.exe current main CUDA path exe / sha256
OpenCL GPUPerftOpenCL.exe portable OpenCL lane exe / sha256

Headline Results

Lane Measured headline
CDP1 The historical 3090-era reference recorded roughly 5.79T wall NPS / 5.98T GPU NPS on perft(12). That is the original 6T-class milestone for this project.
CDP2 The current RTX 5090 pure-perft fork validated perft(11) = 2,097,651,003,696,806 in 226.716s, 9.252T wall NPS / 9.260T GPU NPS.
OpenCL The portable OpenCL lane validated start-position perft(11) in 2377.963s, 882.1B wall NPS, on RTX 5090.

Why The Repo Exists

perft() is valuable here because it is deterministic. A wrong result does not hide behind evaluation noise or search heuristics. That makes it a useful test bed for:

  • move generation
  • kernel launch strategy
  • transposition-table design
  • duplicate collapse
  • frontier export
  • storage formats for very large exact corpora

That is why GPUPerft grew into more than a benchmark executable. It became the exact-count side of a larger data pipeline.

CDP1

GPUPerftCDP1.exe is the historical CUDA line packaged for download. It matters because it proved that exact GPU perft could reach the multi-trillion-nodes-per-second range on older hardware.

The key point is not nostalgia. The point is that CDP1 established the performance bar the later work had to beat. The rebuild notes are explicit about it: the old 3090-era path was already in the 6T class, so any modern replacement that merely matched that number would be underperforming the hardware generation change.

Today, CDP1 is primarily a reference line:

  • it explains the original dynamic-parallelism path
  • it anchors the early performance history
  • it is not the recommended build on current Blackwell-era hardware

Modern build notes also make the limitation clear: the branch rejects compute capability 9.0+ devices, so this is not the deployable path for the latest NVIDIA cards.

Command-Line Shape

.\GPUPerftCDP1.exe 6 --stats --gpus 1
.\GPUPerftCDP1.exe --verify --stats --gpus 1
.\GPUPerftCDP1.exe "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1" 6 --gpus 1

CDP2

GPUPerftCDP2.exe is the current CUDA line and the default recommendation on modern hardware. It is where the active exact-count work now lives.

The important change in the recent work is not a cosmetic retune. It is that the CDP2 path finally cleared the old CDP1 class and moved into its own performance tier on RTX 5090 hardware.

Current Pure-Perft Breakthrough

Validated default results from the current RTX 5090 CDP2 pure-perft fork:

Depth Result Wall Wall NPS GPU NPS
10 69,352,859,712,417 8.799s 7.882T 8.057T
11 2,097,651,003,696,806 226.716s 9.252T 9.260T

That depth-11 result is the current headline because it is exact, validated, and already past the original 8T target for the 5090-class path.

The older rebuild checkpoint still matters because it shows what changed. Early in the same engineering cycle, CDP2 was validating at about 5.228T wall NPS on d11. The later owner-band and bounded-prefix work is what moved the line into the 9T class.

Command-Line Shape

.\GPUPerftCDP2.exe 10 --engine cdp2 --gpus 1
.\GPUPerftCDP2.exe 11 --engine cdp2 --gpus 1
.\GPUPerftCDP2.exe export --output D:\gp_out --to-depth 6 --gpus 1

Export And Canonical Storage

The export lane is one of the most important parts of the repo because it turns exact counting into reusable data.

The design goal is straightforward:

  • generate exact frontier output
  • preserve multiplicity so logical node totals are not lost
  • hand that corpus to downstream canonical storage instead of forcing later tools to reconstruct the tree

In practical terms, that means the exported records carry enough information for deduplicated corpus materialization: one unique position row plus the count information needed to preserve the full logical workload. The target is not just to save positions. The target is to store the full perft output in canonical form without expanding duplicate branches back into separate rows.

Current Export Baseline

The current validated NybbleBoard export baseline for the start position at depth 9 is:

Metric Result
Wall time 242.802s
perftNps 10.047B
storedPosNps 159.904M
Known multiplicity total 2,439,530,234,167
Validation passed

The raw exporter compression sweep also shows the storage side is already meaningful on its own. In the measured zstd:4 configuration, the current exporter reached about 23.37 bits per stored position.

The deeper point, though, is what happens downstream. The exported frontier is designed to feed canonical CCCF / XCCF-style materialization rather than remain a one-off diagnostic artifact.

That downstream path is already visible in the broader toolchain. FENMaster’s validated canonical fen -> cccf matrix run reduced 67,108,864 input positions to 16,508,408 canonical output records at 5.783 bits per input FEN. That is the storage story this export lane is meant to serve.

OpenCL

GPUPerftOpenCL.exe is the portable lane. It matters because portability is not free, and it is useful to have a line that is not tied to CUDA-specific assumptions.

Current Validation Snapshot

RTX 5090 validation snapshot:

Depth Status Wall Wall NPS
8 PASS 0.694s 122.4B
9 PASS 6.834s 357.0B
10 PASS 87.078s 796.4B
11 PASS 2377.963s 882.1B

Command-Line Shape

.\GPUPerftOpenCL.exe --probe
.\GPUPerftOpenCL.exe --self-test
.\GPUPerftOpenCL.exe --verify --max-depth 7
.\GPUPerftOpenCL.exe 9 --cdp2-shape --metrics --json D:\gpcl\cdp2_shape\d9.json

Practical Guidance

If the goal is to run the current main CUDA path on modern NVIDIA hardware, start with GPUPerftCDP2.exe.

If the goal is to understand the performance history, the original 6T-class CDP1 milestone still matters.

If the goal is portability, use GPUPerftOpenCL.exe.

If the goal is to turn exact counts into durable corpora, the export lane is the important part: it is what connects GPU perft to canonical storage, later statistics work, and the rest of the data pipeline.