Python API
The public pragmatiq.api surface — generated from the live docstrings so signatures never drift.
pragmatiq/api.py is the single library surface behind the CLI, the notebooks, and
production callers. Notebooks also get PragmaModel.from_pretrained(run) and
model.embed_records(list_of_dicts) for interactive use without the shard pipeline.
The reference below is generated from the live function signatures and docstrings by
scripts/docs_facts.py (and checked in CI), so it always matches the installed code.
api.synthesize
Generate a synthetic dataset (Phase 1).
synthesize(config: 'str | Path | dict[str, Any] | None' = None, out: 'str | Path' = 'data/synth', n_users: 'int | None' = None, seed: 'int | None' = None, n_workers: 'int' = 0, write_report: 'bool' = True, **overrides: 'Any') -> 'dict[str, Any]'api.tokenize
Fit (or load) the tokenizer and write tokenized parquet shards + index.
tokenize(data_dir: 'str | Path', out: 'str | Path', config: 'str | Path | dict[str, Any] | None' = None, tokenizer_dir: 'str | Path | None' = None, max_users: 'int | None' = None, rows_per_shard: 'int' = 4096, n_workers: 'int' = 0) -> 'dict[str, Any]'api.pretrain
Pretrain a pragmatiq model on tokenized shards (Phase 5).
pretrain(shard_dir: 'str | Path', run_name: 'str', model_size: 'str' = 'small', config: 'str | Path | dict[str, Any] | None' = None, runs_root: 'str | Path' = 'runs', resume: 'str | None' = None, **overrides: 'Any') -> 'dict[str, Any]'api.finetune
LoRA fine-tune a trained model's adapters + head on a label table.
finetune(shard_dir: 'str | Path', run: 'str | Path', label_path: 'str | Path', config: 'str | Path | dict[str, Any] | None' = None, device: 'str' = 'auto', **overrides: 'Any') -> 'dict[str, Any]'api.embed
Embed every user in ``shard_dir`` with a trained model.
embed(shard_dir: 'str | Path', run: 'str | Path', out: 'str | Path | None' = None, token_budget: 'int' = 16384, device: 'str' = 'auto') -> 'dict[str, Any]'api.probe
Probe a trained model on a label table; compares to a raw-count baseline.
probe(shard_dir: 'str | Path', run: 'str | Path', label_path: 'str | Path', device: 'str' = 'auto', token_budget: 'int' = 16384, seed: 'int' = 0, with_baseline: 'bool' = True, probe_model: 'str' = 'gbdt') -> 'dict[str, Any]'api.uplift
Evaluate communication-campaign uplift on a trained model (Phase 5).
uplift(shard_dir: 'str | Path', run: 'str | Path', label_path: 'str | Path', device: 'str' = 'auto', token_budget: 'int' = 16384, seed: 'int' = 0, learner: 'str' = 't') -> 'dict[str, Any]'api.gnn
Run the three-way AML GNN ablation (Phase 6).
gnn(shard_dir: 'str | Path', run: 'str | Path', transfers_path: 'str | Path', aml_label_path: 'str | Path', seeds: 'tuple[int, ...]' = (0, 1, 2), device: 'str' = 'auto', epochs: 'int' = 150) -> 'dict[str, Any]'api.quickstart
End-to-end smoke: synth → tokenize → nano pretrain → probe (Phase 8).
quickstart(out: 'str | Path' = 'runs/quickstart', n_users: 'int' = 50000, seed: 'int' = 0, model_size: 'str' = 'small', max_steps: 'int' = 400, n_workers: 'int' = 0) -> 'dict[str, Any]'Interactive entry points
Beyond api, the model exposes two notebook-friendly methods:
from pragmatiq.models.pragmatiq import PragmaModel
model = PragmaModel.from_pretrained("runs/demo") # loads model + tokenizer
emb = model.embed_records([ # plain dicts, no shards needed
{"user_id": "u1", "events": [
{"ts": 1_700_000_000_000_000, "source": "transaction",
"fields": {"amount": "42.10", "merchant": "TESCO"}}],
"attributes": {"country": "GB"}, "lifelong": []},
]) # -> np.ndarray [n_users, dim]from_pretrained verifies the checkpoint's tokenizer hash against the run's tokenizer and
refuses to run on a mismatch.