Model sizes
The nano / small / medium / large presets and how to pick one.
ModelConfig.preset defines four sizes. Depths are (profile / event / history) block
counts; every block uses ffn = 4·d and dropout 0.1. The table is rendered from the code:
| Preset | dim | heads | depth (profile / event / history) | Nominal |
|---|---|---|---|---|
| nano | 64 | 2 | 1 / 2 / 1 | ~1M · CPU/CI |
| small | 192 | 3 | 1 / 5 / 2 | 10M |
| medium | 512 | 8 | 3 / 16 / 6 | 100M |
| large | 1024 | 16 | 9 / 45 / 18 | 1B |
small / medium / large correspond to the paper's 10M / 100M / 1B parameter sizes.
nano is pragmatiq's own addition Our addition so the gates
and pragmatiq quickstart run end-to-end on a CPU in minutes.
Picking a size
nano— CPU smoke tests, CI, notebooks, and the quickstart.small— the default; a strong baseline that trains comfortably on a single GPU.medium/large— scale up when you have the data and multiple GPUs; pair withconfig="auto", gradient accumulation, and multi-node DDP (see Configuration and the Pretrain tutorial).
api.pretrain("data/tok", "demo", model_size="medium")Any architecture field (e.g. rope_base, dropout) can be overridden on top of a preset by
passing it in the pretrain config. The test suite checks the model and MLM-head parameter
counts against the nominal sizes, so the presets stay honest.
Architecture
How pragmatiq turns a key–value–time event history into a user embedding — the four-encoder stack, temporal encoding, and the pretraining objective.
PRAGMA+Nemotron variant
Embedding high-cardinality text with a frozen text model and reconstructing it with MSE — switchable, off by default.