pragmatiq
Reference

Configuration

Every tokenizer and training knob, with its default — rendered straight from the code so it never drifts.

These tables are generated from the dataclasses in the library (via scripts/docs_facts.py), so the defaults shown here are exactly the code's defaults. Any knob can be overridden in a YAML config (--config) or programmatically (api.pretrain(..., key=value)).

Tokenizer (TokenizerConfig)

Passed to pragmatiq tokenize --config / api.tokenize(config=...). See also configs/data/tokenizer.yaml.

FieldTypeDefault
n_bucketsint64
categorical_thresholdint1000
target_vocabint28000
bpe_min_frequencyint2
lowercase_textboolFalse
max_numeric_sampleint100000
numeric_min_cardinalityint | NoneNone
force_categoricaltuple[str, ...]()
force_numerictuple[str, ...]()
seedint0
value_encoderstrpercentile_binner
calendar_tzstrUTC
max_event_tokensint | None24
max_profile_tokensint | None200
max_events_per_userint | None6500
text_value_modestrbpe
text_encoderstrhash
text_encoder_dimint64

The last three rows are the PRAGMA+Nemotron variant switch: set text_value_mode="embed" and text_encoder="nemotron" (the configs/data/tokenizer_nemotron.yaml preset) and pretrain auto-wires the matching model branch.

Training (TrainConfig)

Passed to pragmatiq pretrain --config / api.pretrain(config=...). See also configs/pretrain.yaml.

FieldTypeDefault
max_stepsint1000
token_budgetint16384
grad_accum_stepsint1
lr_muonfloat0.003
lr_adamwfloat0.0003
weight_decayfloat0.01
warmup_stepsint100
grad_clipfloat1
checkpoint_every_minfloat15
log_everyint10
seedint0
deterministicboolFalse
nan_skipboolTrue
max_consecutive_skipsint50
verboseboolTrue
wandbboolFalse
wandb_projectstrpragmatiq
maskerstrpragma
p_tokenfloat0.15
p_eventfloat0.1
p_keyfloat0.1
p_unkfloat0.1
text_loss_weightfloat1
devicesint | strauto
num_nodesint1

A few that matter most for scale and reproducibility:

  • token_budget — tokens per packed forward (the per-device memory knob).
  • grad_accum_steps — micro-batches per optimizer step; effective batch is token_budget × grad_accum × num_nodes·devices. 1 is byte-identical to no accumulation.
  • devices / num_nodes — Fabric DDP across GPUs and hosts.
  • deterministic — opt-in reproducible CUDA path (fp32, see Determinism).
  • seed — same seed → byte-identical CPU output.

Auto-config

Pass config="auto" to pretrain and pragmatiq sizes token_budget, grad_accum_steps, max_steps, and warmup_steps from the shard index and the target device. Explicit overrides still win.

On this page