Configuration
Every tokenizer and training knob, with its default — rendered straight from the code so it never drifts.
These tables are generated from the dataclasses in the library (via
scripts/docs_facts.py), so the defaults shown here are exactly the code's defaults. Any
knob can be overridden in a YAML config (--config) or programmatically
(api.pretrain(..., key=value)).
Tokenizer (TokenizerConfig)
Passed to pragmatiq tokenize --config / api.tokenize(config=...). See also
configs/data/tokenizer.yaml.
| Field | Type | Default |
|---|---|---|
| n_buckets | int | 64 |
| categorical_threshold | int | 1000 |
| target_vocab | int | 28000 |
| bpe_min_frequency | int | 2 |
| lowercase_text | bool | False |
| max_numeric_sample | int | 100000 |
| numeric_min_cardinality | int | None | None |
| force_categorical | tuple[str, ...] | () |
| force_numeric | tuple[str, ...] | () |
| seed | int | 0 |
| value_encoder | str | percentile_binner |
| calendar_tz | str | UTC |
| max_event_tokens | int | None | 24 |
| max_profile_tokens | int | None | 200 |
| max_events_per_user | int | None | 6500 |
| text_value_mode | str | bpe |
| text_encoder | str | hash |
| text_encoder_dim | int | 64 |
The last three rows are the PRAGMA+Nemotron variant switch:
set text_value_mode="embed" and text_encoder="nemotron" (the
configs/data/tokenizer_nemotron.yaml
preset) and pretrain auto-wires the matching model branch.
Training (TrainConfig)
Passed to pragmatiq pretrain --config / api.pretrain(config=...). See also
configs/pretrain.yaml.
| Field | Type | Default |
|---|---|---|
| max_steps | int | 1000 |
| token_budget | int | 16384 |
| grad_accum_steps | int | 1 |
| lr_muon | float | 0.003 |
| lr_adamw | float | 0.0003 |
| weight_decay | float | 0.01 |
| warmup_steps | int | 100 |
| grad_clip | float | 1 |
| checkpoint_every_min | float | 15 |
| log_every | int | 10 |
| seed | int | 0 |
| deterministic | bool | False |
| nan_skip | bool | True |
| max_consecutive_skips | int | 50 |
| verbose | bool | True |
| wandb | bool | False |
| wandb_project | str | pragmatiq |
| masker | str | pragma |
| p_token | float | 0.15 |
| p_event | float | 0.1 |
| p_key | float | 0.1 |
| p_unk | float | 0.1 |
| text_loss_weight | float | 1 |
| devices | int | str | auto |
| num_nodes | int | 1 |
A few that matter most for scale and reproducibility:
token_budget— tokens per packed forward (the per-device memory knob).grad_accum_steps— micro-batches per optimizer step; effective batch istoken_budget × grad_accum × num_nodes·devices.1is byte-identical to no accumulation.devices/num_nodes— Fabric DDP across GPUs and hosts.deterministic— opt-in reproducible CUDA path (fp32, see Determinism).seed— same seed → byte-identical CPU output.
Auto-config
Pass config="auto" to pretrain and pragmatiq sizes token_budget, grad_accum_steps,
max_steps, and warmup_steps from the shard index and the target device. Explicit
overrides still win.