Architecture
How pragmatiq turns a key–value–time event history into a user embedding — the four-encoder stack, temporal encoding, and the pretraining objective.
This is the "how it works" tour: from a raw banking event history to a single dense
user embedding, and the reasoning behind each choice. Every claim here is implemented
in the library (pragmatiq/models/, pragmatiq/data/).
The stack at a glance
pragmatiq compresses one customer's entire history with a four-encoder stack on top of
a tokenizer that turns every field into a (key, value, time) triple. Click through it:
Every field becomes a (key, value, time) token. One shared table embeds both keys and values; a sinusoidal within-field position is added for multi-piece text.
- x = E(key) + E(value) + sinusoidal(position)
- The table is tied to the MLM output projection
- Numeric → percentile bucket, categorical → id, text → BPE pieces (or one frozen-embedding sentinel in the Nemotron variant)
Tokenization: key–value–time
The tokenizer fits one vocabulary over the whole dataset, then turns each user's history into flat token arrays. It emits one key token per field, paired with a value representation chosen by the field's kind; keys and values share a single embedding space.
Conceptual ids/buckets — the real tokenizer fits the vocabulary and bucket edges from your data. Toggle Nemotron text mode to see high-cardinality text collapse from BPE pieces to a single sentinel a frozen encoder maps to a vector.
- Numeric values are percentile-binned (with a dedicated zero bucket), so an unseen magnitude clips into the end buckets rather than failing.
- Low-cardinality strings become one categorical token per value.
- High-cardinality text is byte-level BPE by default — or, in the Nemotron variant, a single sentinel token whose raw string a frozen text model maps to a vector.
A dedicated zero bucket for numeric fields
Why: Zero is semantically special in banking (no balance, no spend) and is over-represented; giving it its own bucket keeps the percentile edges meaningful for the non-zero mass.
Alternative considered: Folding zero into the lowest percentile bucket, which blurs 'absent' with 'very small'.
Temporal encoding
Time is the spine of the model. The core transform compresses raw elapsed seconds with a log curve so recent and distant events both stay informative:
That value is then used as a continuous position for rotary embeddings (TimeRoPE) — not an integer index. A token at log-seconds rotates frequency pair by , so attention scores encode the elapsed time between two events rather than their ordinal distance.
RoPE over a continuous time position, not token index
Why: Banking events are wildly irregular — seconds apart, then months apart. Continuous-time RoPE makes 'one second ago' and 'one month ago' genuinely different relative rotations, which integer positions cannot express.
Alternative considered: Standard integer-position RoPE or learned absolute positions, which treat 'the previous event' identically regardless of elapsed time.
Two axes use it: the profile encoder positions each lifelong milestone by log-seconds since it occurred; the history encoder positions each event by log-seconds to the most recent event. Calendar features (hour, day-of-week, day-of-month) take a separate sin/cos → MLP path and are added to each event vector, so day/night and weekday/weekend structure is available independently of elapsed time.
The encoder stack
gives per-token vectors. Then:
- The event encoder encodes each event independently (within-event attention, a
prepended
[EVT]marker); the[EVT]output plus calendar features is the per-event vector . - The profile-state encoder runs over each user's profile tokens under a
[USR]marker → the profile state . - The history encoder runs over the sequence → ; the
[USR]slot output is the user embedding.
All three encoders are bidirectional, pre-norm, GELU, with and dropout 0.1.
The whole batch is packed without padding using cu_seqlens, so the CPU SDPA fallback
matches a flash-attn varlen forward exactly.
Pretraining objective
For each masked token the MLM head concatenates three views —
—
projects , and scores against the tied embedding table with cross-entropy and
label smoothing 0.1. Masking unions three modes: 15% per token, 10% per whole event, 10% per
(user, key); of the selected positions, 10% become [UNK] and are excluded from the loss.
In the PRAGMA+Nemotron variant, text tokens carry a frozen embedding instead of a vocab id, so a masked text token is reconstructed with MSE against that frozen vector. The loss becomes (default ), and the variant is off by default — the BPE path stays byte-identical.
Pre-training caps
Real histories are heavy-tailed, so the tokenizer caps them (the paper defaults below). At synthetic scale none of these bind, so default output is unchanged:
| Field | Type | Default | Notes |
|---|---|---|---|
| max_event_tokens | int | None | 24 | Tokens kept per event (first N). |
| max_profile_tokens | int | None | 200 | Tokens kept for the profile state (first whole items fitting N). |
| max_events_per_user | int | None | 6500 | A longer history keeps only its most recent N events. |
What is ours, not the paper
The core representation above — key–value–time tokenization, the time transform, continuous
TimeRoPE, the encoder stack, the 3d MLM head, the masking scheme, and the Nemotron variant —
follows the PRAGMA paper. The AML transfer-graph GraphSAGE work
(AML over the graph) is pragmatiq's own extension, built
standalone on top of these embeddings — see AML over the transfer graph;
so are the synthetic data generator and the nano size. Our addition
pragmatiq is an independent implementation inspired by the PRAGMA paper (arXiv 2604.08649) and is not affiliated with or endorsed by Revolut.