Pretrain (and scale)

Train the masked-language model — on CPU, on one GPU, or across many — with auto-config, gradient accumulation, and multi-node DDP.

Pretraining masks event values and reconstructs them, learning a representation with no labels.

Train

pragmatiq pretrain data/tok --name demo --model-size small --config configs/pretrain.yaml

from pragmatiq import api
s = api.pretrain("data/tok", "demo", model_size="small",
                 config={"max_steps": 4000, "token_budget": 16384})
print(s["last_metrics"])

The trainer auto-detects CUDA (bf16-mixed on GPU, fp32 on CPU) — the same command on both. A one-line heartbeat reports step, loss, mlm_acc, tokens/s, ETA.

Scale to large books, hands-off

Pass config="auto" and pragmatiq sizes the batch and schedule from the data + device:

pragmatiq pretrain data/tok --name big --model-size medium --config auto

The three levers it sets are also usable by hand (see Configuration):

grad_accum_steps — micro-batches per optimizer step. Effective batch = token_budget × grad_accum × num_nodes·devices, so you reach a large, stable batch on a memory-bound GPU without raising token_budget. 1 is byte-identical to no accumulation.
devices / num_nodes — Fabric DDP across GPUs and hosts; the rank sampler shards the data per global rank with a per-rank masking seed.
truncation caps (on the tokenizer) keep heavy-tailed real histories tractable.

Resume & reproduce

pragmatiq pretrain data/tok --name demo --resume auto

Checkpoints capture the model, both optimizers, the LR scheduler, the sampler position, all RNG states, the tokenizer hash, and the resolved config — so an interrupted run resumed mid-flight reproduces the exact batch and masking stream of an uninterrupted one (tested bit-exactly). From a fixed seed, CPU runs are byte-identical; for a reproducible GPU run set deterministic: true (see Reproducibility).

Renting a GPU

scripts/runpod_launch.py is a turnkey path to validate on a rented A100/H100: it provisions a pod, syncs the repo over SSH, installs, and runs the GPU end-to-end pipeline (including auto-config, the Nemotron variant, serving, and the full-scale gates).

Next: Embed & evaluate.

Pretrain (and scale)

Train

Scale to large books, hands-off

Resume & reproduce

Renting a GPU

On this page