Pretrain (and scale)
Train the masked-language model — on CPU, on one GPU, or across many — with auto-config, gradient accumulation, and multi-node DDP.
Pretraining masks event values and reconstructs them, learning a representation with no labels.
Train
pragmatiq pretrain data/tok --name demo --model-size small --config configs/pretrain.yamlfrom pragmatiq import api
s = api.pretrain("data/tok", "demo", model_size="small",
config={"max_steps": 4000, "token_budget": 16384})
print(s["last_metrics"])The trainer auto-detects CUDA (bf16-mixed on GPU, fp32 on CPU) — the same command on both.
A one-line heartbeat reports step, loss, mlm_acc, tokens/s, ETA.
Scale to large books, hands-off
Pass config="auto" and pragmatiq sizes the batch and schedule from the data + device:
pragmatiq pretrain data/tok --name big --model-size medium --config autoThe three levers it sets are also usable by hand (see Configuration):
grad_accum_steps— micro-batches per optimizer step. Effective batch =token_budget × grad_accum × num_nodes·devices, so you reach a large, stable batch on a memory-bound GPU without raisingtoken_budget.1is byte-identical to no accumulation.devices/num_nodes— Fabric DDP across GPUs and hosts; the rank sampler shards the data per global rank with a per-rank masking seed.- truncation caps (on the tokenizer) keep heavy-tailed real histories tractable.
Resume & reproduce
pragmatiq pretrain data/tok --name demo --resume autoCheckpoints capture the model, both optimizers, the LR scheduler, the sampler position, all
RNG states, the tokenizer hash, and the resolved config — so an interrupted run resumed
mid-flight reproduces the exact batch and masking stream of an uninterrupted one (tested
bit-exactly). From a fixed seed, CPU runs are byte-identical; for a reproducible GPU run set
deterministic: true (see Reproducibility).
Renting a GPU
scripts/runpod_launch.py
is a turnkey path to validate on a rented A100/H100: it provisions a pod, syncs the repo over
SSH, installs, and runs the GPU end-to-end pipeline (including auto-config, the Nemotron
variant, serving, and the full-scale gates).
Next: Embed & evaluate.