AI Chess Lab AI Chess Software, research, lab notes, and durable public pages.

Static section

FENMaster

FENMaster is the corpus compiler for the AI Chess lab. It is responsible for converting, filtering, validating, sharding, and replaying large chess-position archives without losing the metadata that later analysis depends on.

That makes it more than a converter. It is the part of the stack that turns raw position files and frontier exports into something that can be trusted, queried, and stored economically.

Alpha status: Work in progress. FENMaster is still an alpha corpus toolchain. Formats, switches, and validation surfaces may evolve as the pipeline hardens. Run it at your own risk.

Current Release

The current downloadable build is a Windows x64 release binary:

Executable Purpose Downloads
FENMaster.exe Convert, filter, validate, shard, and benchmark large position corpora. exe / sha256

Headline Results

Area Measured headline
Routine throughput 1,073,741,824 FENs in 16.781s, about 63.986M FEN/s, with full verify-shards passing.
Canonical CCCF density fen -> cccf reduced 67,108,864 input positions to 16,508,408 canonical output records at 5.783 bits per input FEN.
Full replay corpus Canonical CCCF sink replay verified 84,998,978,956 logical records collapsing to 988,187,354 unique at about 24.60 bits per stored position.
Frontier-aware filtering Frontier multiplicity filtering is validated, including EPD output that preserves requested multiplicity when the target format can store it.

What It Does

The current codebase covers four jobs that matter to the rest of the lab:

  • convert between position formats and archive layouts
  • filter large corpora before they become downstream noise
  • validate shard routing and round-trips so the output can be trusted
  • materialize canonical corpora for replay, export, and later statistics work

Supported Formats

Direction Formats
Source fen, epd, pgn, sfen, cccf, frontier
Output fen, epd, sfen, cccf

The frontier source support matters because it lets GPU-exported perft corpora flow into the same toolchain as ordinary position archives.

Filtering

Filtering is a real feature here, not an afterthought.

The branch history and current code together show three useful classes of filtering:

  • structural filters such as side to move, castling rights, piece placement, empty squares, and piece counts
  • numeric filters over fields such as hm, fm, depth, and mult
  • reproducible subset generation for benchmark corpora through sample-fens

Examples from the documented filter language include:

  • stm=b
  • K@e1,(ep!=none|castle=none)
  • count([Pp])>=10
  • depth=8

The more important detail is what happens after the filter matches. Current filter-output work preserves requested values when the destination format can store them. For example, if a filter depends on depth or mult, EPD output can retain those fields so the saved records still carry the information the filter used.

That behavior is tested explicitly in the repo for frontier multiplicity filtering and EPD output.

Validation And Determinism

The repo treats validation as part of the product, not as cleanup work:

  • verify-shards decodes output and checks shard routing
  • invalid input lines are rejected cleanly instead of poisoning the whole run
  • reject logs are supported
  • deterministic sharding is based on the first byte of a canonical occupancy map
  • staged and parallel runs emit manifests and runtime telemetry

That is why the throughput numbers matter. The fast runs are paired with validation instead of being presented in isolation.

CCCF And Canonical Storage

CCCF is one of the most important reasons FENMaster exists.

The canonical-storage problem is not just “compress the file.” It is:

  • collapse duplicates without losing multiplicity
  • preserve exact logical totals
  • make the result deterministic enough to replay and verify
  • store very large position corpora, including full perft-derived corpora, at densities that are worth keeping

The repo history is explicit about this. Canonical CCCF work added multiplicity-aware verification, canonical deduplication, universal sink stages, and replay tooling for large sort-input datasets.

Two measured checkpoints matter most:

  1. On the validated 512k matrix, canonical fen -> cccf ran at 18.037M FEN/s and stored the result at 5.783 bits per input FEN.
  2. On the larger canonical sink replay workload, the pipeline verified 84,998,978,956 logical records collapsing to 988,187,354 unique at about 24.60 bits per stored position.

That is the storage story that turns GPU frontier export and large corpus work into something sustainable.

Routine Dataset And Sampling

The sample-fens command exists for a reason. Large benchmark claims are only useful if the input set can be recreated.

The current routine dataset flow generates reproducible .fen.zst subsets from the depth-8 corpus. That matters because it gives the repo a standard workload for performance checks instead of relying on ad hoc samples that cannot be compared later.

Command-Line Shape

Convert

N:\Chess\repos\FENmaster\build\bin\Release\FENMaster.exe convert `
  G:\FENmaster-work\inputs\depth8_dev_16k `
  G:\FENmaster-work\outputs\depth8_dev_16k_staged `
  --staged `
  --max-threads 128 `
  --shards 256 `
  --compression-level 8 `
  --records-per-flush 4096 `
  --verify-output

Verify

N:\Chess\repos\FENmaster\build\bin\Release\FENMaster.exe verify-shards `
  G:\FENmaster-work\outputs\depth8_dev_16k_staged `
  --shards 256

Sample A Reproducible Benchmark Corpus

N:\Chess\repos\FENmaster\build\bin\Release\FENMaster.exe sample-fens `
  R:\FEN\depth_8_parts `
  G:\FENmaster-work\inputs\depth8_routine_8m `
  --max-lines-per-file 8388608 `
  --compression-level 8 `
  --line-batch-size 8192 `
  --manifest G:\FENmaster-work\manifests\depth8_routine_8m.txt

Why It Matters

Without FENMaster, the rest of the lab would keep generating more data than it can reliably organize.

With it, the workflow becomes much cleaner:

  • exact GPU work can export multiplicity-preserving corpora
  • those corpora can be filtered, canonicalized, and replayed
  • benchmark datasets can be reproduced
  • later statistics and distributed analysis can start from verified artifacts instead of one-off files

That is why FENMaster belongs in the software section alongside the executables that generate the data in the first place.