pragmatiq
Concepts

Model sizes

The nano / small / medium / large presets and how to pick one.

ModelConfig.preset defines four sizes. Depths are (profile / event / history) block counts; every block uses ffn = 4·d and dropout 0.1. The table is rendered from the code:

Presetdimheadsdepth (profile / event / history)Nominal
nano6421 / 2 / 1~1M · CPU/CI
small19231 / 5 / 210M
medium51283 / 16 / 6100M
large1024169 / 45 / 181B

small / medium / large correspond to the paper's 10M / 100M / 1B parameter sizes. nano is pragmatiq's own addition Our addition so the gates and pragmatiq quickstart run end-to-end on a CPU in minutes.

Picking a size

  • nano — CPU smoke tests, CI, notebooks, and the quickstart.
  • small — the default; a strong baseline that trains comfortably on a single GPU.
  • medium / large — scale up when you have the data and multiple GPUs; pair with config="auto", gradient accumulation, and multi-node DDP (see Configuration and the Pretrain tutorial).
api.pretrain("data/tok", "demo", model_size="medium")

Any architecture field (e.g. rope_base, dropout) can be overridden on top of a preset by passing it in the pretrain config. The test suite checks the model and MLM-head parameter counts against the nominal sizes, so the presets stay honest.

On this page