Generate & tokenize
Produce a synthetic dataset (or bring your own) and fit the key–value–time tokenizer into training shards.
The first two stages turn raw histories into tokenized parquet shards the model trains on.
Generate synthetic data
pragmatiq synth generate --out data/synth --config configs/data/synthetic.yaml
# or inline:
pragmatiq synth generate --out data/synth --n-users 50000 --seed 0 --n-workers 8The generator is deterministic: the same seed produces byte-identical output for any
worker count. It writes events.parquet, profiles.parquet, transfers.parquet, and
labels/*.parquet (the data contract) plus a realism report.
Fit to your bank's statistics, no raw data
pragmatiq synth calibrate --stats configs/data/aggregates.example.yaml fits the generator's
priors to aggregate statistics a bank can share — so the synthetic book resembles yours
without moving any raw records.
Tokenize
pragmatiq tokenize data/synth --out data/tok --n-workers 8This fits the key–value–time vocabulary (numeric percentile buckets, categorical ids, text
BPE) in one streaming pass, then writes tokenized parquet shards plus an LMDB user-index.
The tokenizer fit is worker-count-invariant — byte-identical regardless of --n-workers.
Customize it with a config (see Configuration):
pragmatiq tokenize data/synth --out data/tok --config configs/data/tokenizer.yamlFor the PRAGMA+Nemotron variant, tokenize with
configs/data/tokenizer_nemotron.yaml instead.
What you get
data/tok/ holds the shards + index + the saved tokenizer (with a content hash). The hash is
embedded in every checkpoint; from_pretrained refuses to run a model against a mismatched
tokenizer. Next: Pretrain.
AML over the transfer graph
pragmatiq's own extension (not in the PRAGMA paper) — a GraphSAGE ablation that recovers money-mule rings a per-user embedding cannot see.
Pretrain (and scale)
Train the masked-language model — on CPU, on one GPU, or across many — with auto-config, gradient accumulation, and multi-node DDP.