Bring your own data
Point pragmatiq at real bank records — produce the parquet contract, validate, then run the same pipeline.
pragmatiq trains on the same parquet contract whether the data is synthetic or yours. The only new work is producing those four files; everything downstream is identical.
Produce the four files
Write events.parquet, profiles.parquet, and (optionally) transfers.parquet +
labels/*.parquet with the exact dtypes in the data contract.
The key idea: event fields and profile attributes are map<string,string> — pass raw
string values; the tokenizer infers numeric / categorical / text per field. Do not pre-bin or
pre-encode.
import pyarrow as pa, pyarrow.parquet as pq
events = pa.table({
"user_id": [...],
"ts": pa.array([...], type=pa.timestamp("us")),
"source": [...], # e.g. "transaction", "app", "transfer"
"fields": [...], # list of {str: str}
})
pq.write_table(events, "data/mybank/events.parquet")Validate, then run the pipeline
pragmatiq validate data/mybank # fails loudly on dtype / integrity issues
pragmatiq tokenize data/mybank --out data/tok --n-workers 8
pragmatiq pretrain data/tok --name mybank --model-size small --config auto
pragmatiq probe data/tok --run runs/mybank --label data/mybank/labels/<task>.parquetAt 1M–26M records, --config auto sizes the batch/schedule for you, and the truncation caps
keep heavy-tailed histories tractable (see Pretrain).
Things to know
- Time zones: calendar features default to UTC; set
calendar_tz(e.g.Europe/London) if your timestamps are UTC instants but day/night and payday structure is local. - Unseen values at inference map to
[UNK]with a logged warning — never aKeyError. Refit the tokenizer when your value distribution shifts materially. - Labels need an
eval_tsper user so histories are truncated before embedding (forecast, not hindcast).
Data never leaves your environment
Everything runs on CPU first and locally. The synthetic generator and
synth calibrate exist so you can develop and benchmark against a realistic book without
moving raw records.