pragmatiq
Reference

Data contract

The four-file parquet contract pragmatiq trains on — exact columns and dtypes, enforced by `pragmatiq validate`.

pragmatiq trains on a small parquet contract — four files with strict dtypes, defined in pragmatiq/data/schema.py and checked by pragmatiq validate. Synthetic data already conforms; this is what you produce to bring your own.

Files

FileRequired columns
events.parquetuser_id (string), ts (timestamp[us]), source (string), fields (map<string,string>)
profiles.parquetuser_id (string), as_of (timestamp[us]), attributes (map<string,string>), lifelong (list<struct<key, ts>>)
transfers.parquetoptional, for the AML GNN: from_user, to_user (string), ts (timestamp[us]), amount (float64)
labels/*.parquetoptional task tables: user_id (string), eval_ts (timestamp[us]), label (int8)

Everything is a string map

Event fields and profile attributes are map<string,string>. You do not pre-bin numbers or pre-encode categories — the tokenizer infers each field's kind (numeric / categorical / text) from the data. Pass values as strings (e.g. "42.10", "5411").

Events

One row per event. source groups schemas (e.g. transaction, app, transfer); fields holds that event's key/value pairs. Events do not need a fixed schema across sources — the tokenizer learns each key.

import pyarrow as pa
events = pa.table({
    "user_id": ["u1", "u1"],
    "ts": pa.array([1_700_000_000_000_000, 1_700_003_600_000_000], type=pa.timestamp("us")),
    "source": ["transaction", "app"],
    "fields": [
        {"amount": "42.10", "mcc": "5411", "merchant": "TESCO STORES 4521"},
        {"screen": "home", "action": "view"},
    ],
})

Profiles

One row per user: slowly-changing attributes plus lifelong milestones (each with its own first-occurrence ts, e.g. account opening). as_of is the snapshot moment the profile is valid at; lifelong recency is measured from it.

Transfers and labels

transfers.parquet is the directed money-movement graph the AML GNN runs over (optional). Label tables are one row per user with an eval_ts; histories are truncated at it before embedding so metrics are forecasts, not hindcasts.

Validate

pragmatiq validate data/mybank

validate checks dtypes, referential integrity, and timestamp sanity, and fails loudly with an actionable message before you spend time tokenizing. Unseen keys/values at encode time map to [UNK] with a logged warning — never a KeyError.

On this page