Configuration

Every tokenizer and training knob, with its default — rendered straight from the code so it never drifts.

These tables are generated from the dataclasses in the library (via scripts/docs_facts.py), so the defaults shown here are exactly the code's defaults. Any knob can be overridden in a YAML config (--config) or programmatically (api.pretrain(..., key=value)).

Tokenizer (`TokenizerConfig`)

Passed to pragmatiq tokenize --config / api.tokenize(config=...). See also configs/data/tokenizer.yaml.

Field	Type	Default
n_buckets	int	`64`
categorical_threshold	int	`1000`
target_vocab	int	`28000`
bpe_min_frequency	int	`2`
lowercase_text	bool	`False`
max_numeric_sample	int	`100000`
numeric_min_cardinality	int \| None	`None`
force_categorical	tuple[str, ...]	`()`
force_numeric	tuple[str, ...]	`()`
seed	int	`0`
value_encoder	str	`percentile_binner`
calendar_tz	str	`UTC`
max_event_tokens	int \| None	`24`
max_profile_tokens	int \| None	`200`
max_events_per_user	int \| None	`6500`
text_value_mode	str	`bpe`
text_encoder	str	`hash`
text_encoder_dim	int	`64`

The last three rows are the PRAGMA+Nemotron variant switch: set text_value_mode="embed" and text_encoder="nemotron" (the configs/data/tokenizer_nemotron.yaml preset) and pretrain auto-wires the matching model branch.

Training (`TrainConfig`)

Passed to pragmatiq pretrain --config / api.pretrain(config=...). See also configs/pretrain.yaml.

Field	Type	Default
max_steps	int	`1000`
token_budget	int	`16384`
grad_accum_steps	int	`1`
lr_muon	float	`0.003`
lr_adamw	float	`0.0003`
weight_decay	float	`0.01`
warmup_steps	int	`100`
grad_clip	float	`1`
checkpoint_every_min	float	`15`
log_every	int	`10`
seed	int	`0`
deterministic	bool	`False`
nan_skip	bool	`True`
max_consecutive_skips	int	`50`
verbose	bool	`True`
wandb	bool	`False`
wandb_project	str	`pragmatiq`
masker	str	`pragma`
p_token	float	`0.15`
p_event	float	`0.1`
p_key	float	`0.1`
p_unk	float	`0.1`
text_loss_weight	float	`1`
devices	int \| str	`auto`
num_nodes	int	`1`

A few that matter most for scale and reproducibility:

token_budget — tokens per packed forward (the per-device memory knob).
grad_accum_steps — micro-batches per optimizer step; effective batch is token_budget × grad_accum × num_nodes·devices. 1 is byte-identical to no accumulation.
devices / num_nodes — Fabric DDP across GPUs and hosts.
deterministic — opt-in reproducible CUDA path (fp32, see Determinism).
seed — same seed → byte-identical CPU output.

Auto-config

Pass config="auto" to pretrain and pragmatiq sizes token_budget, grad_accum_steps, max_steps, and warmup_steps from the shard index and the target device. Explicit overrides still win.

Tokenizer (TokenizerConfig)

Training (TrainConfig)

On this page

Tokenizer (`TokenizerConfig`)

Training (`TrainConfig`)