pragmatiq
Design decisions

Decisions log

The non-obvious choices pragmatiq made where the paper is silent — each with its rationale and the alternative considered.

Where the PRAGMA paper leaves an engineering detail unspecified, pragmatiq picks a default, exposes it in config, and records the reasoning here. Exact default values live in the Configuration reference (rendered from the code); this page is the why. Source-level guesses are marked # GUESS in the code.

Decision

Two optimizers: Muon for 2-D weights, AdamW for the rest

Why: Muon's orthogonalized updates suit the large 2-D hidden weight matrices; embeddings, norms, and biases train better under AdamW. Clipping is per-optimizer.

Alternative considered: A single AdamW over everything — simpler, but leaves the hidden-weight conditioning Muon gives on the table.

Decision

Gradient boosting as the default probe head

Why: A learned embedding has non-linear structure a linear head misses; HistGradientBoostingClassifier captures it, and PR-AUC is reported alongside ROC-AUC for the low-prevalence risk tasks. The raw-count baseline uses the SAME classifier, so the gap is about the representation, not the model family.

Alternative considered: Logistic regression (still selectable via probe_model='logistic') — a cleaner linear read, but a weaker ceiling on a rich embedding.

Decision

Percentile bucketing with a dedicated zero bucket

Why: Banking magnitudes are heavy-tailed and zero is semantically special (no balance / no spend) and over-represented; percentile edges + a separate zero bucket keep both ends meaningful, and out-of-range values clip rather than fail.

Alternative considered: Fixed-width bins (poor on heavy tails) or standardized continuous inputs (loses the robustness of binning to outliers).

Decision

Pre-training truncation caps (24 / 200 / 6500), on by default

Why: Real histories are heavy-tailed; capping per-event tokens, profile tokens, and events-per-user keeps training tractable and the batch shapes bounded. At synthetic scale none of them bind, so default output is unchanged.

Alternative considered: No caps — simpler, but a few pathological users dominate memory and batch time at real scale.

Decision

RoPE frequency base is a tunable GUESS (10000)

Why: The paper does not specify the RoPE base for the continuous time axis; 10000 is the conventional default and is exposed as rope_base so it can be tuned per dataset.

Alternative considered: Hard-coding it — fine until someone's time scales differ and they need to tune it.

Decision

Masking 15 / 10 / 10 with 10% [UNK]-as-dropout

Why: The token/event/key union matches the paper; the 10% of selected positions replaced by [UNK] and excluded from the loss act as input dropout, so the model can't lean on a token that is sometimes simply absent.

Alternative considered: Token-only masking — simpler, but misses the event- and key-level structure that makes the objective informative for whole-field prediction.

For the larger architectural choices (continuous-time RoPE, the zero bucket, the Nemotron default-off), see the Architecture and Nemotron pages, which carry their own decision cards inline.