Python API

The public pragmatiq.api surface — generated from the live docstrings so signatures never drift.

pragmatiq/api.py is the single library surface behind the CLI, the notebooks, and production callers. Notebooks also get PragmaModel.from_pretrained(run) and model.embed_records(list_of_dicts) for interactive use without the shard pipeline.

The reference below is generated from the live function signatures and docstrings by scripts/docs_facts.py (and checked in CI), so it always matches the installed code.

api.synthesize

Generate a synthetic dataset (Phase 1).

synthesize(config: 'str | Path | dict[str, Any] | None' = None, out: 'str | Path' = 'data/synth', n_users: 'int | None' = None, seed: 'int | None' = None, n_workers: 'int' = 0, write_report: 'bool' = True, **overrides: 'Any') -> 'dict[str, Any]'

api.tokenize

Fit (or load) the tokenizer and write tokenized parquet shards + index.

tokenize(data_dir: 'str | Path', out: 'str | Path', config: 'str | Path | dict[str, Any] | None' = None, tokenizer_dir: 'str | Path | None' = None, max_users: 'int | None' = None, rows_per_shard: 'int' = 4096, n_workers: 'int' = 0) -> 'dict[str, Any]'

api.pretrain

Pretrain a pragmatiq model on tokenized shards (Phase 5).

pretrain(shard_dir: 'str | Path', run_name: 'str', model_size: 'str' = 'small', config: 'str | Path | dict[str, Any] | None' = None, runs_root: 'str | Path' = 'runs', resume: 'str | None' = None, **overrides: 'Any') -> 'dict[str, Any]'

api.finetune

LoRA fine-tune a trained model's adapters + head on a label table.

finetune(shard_dir: 'str | Path', run: 'str | Path', label_path: 'str | Path', config: 'str | Path | dict[str, Any] | None' = None, device: 'str' = 'auto', **overrides: 'Any') -> 'dict[str, Any]'

api.embed

Embed every user in ``shard_dir`` with a trained model.

embed(shard_dir: 'str | Path', run: 'str | Path', out: 'str | Path | None' = None, token_budget: 'int' = 16384, device: 'str' = 'auto') -> 'dict[str, Any]'

api.probe

Probe a trained model on a label table; compares to a raw-count baseline.

probe(shard_dir: 'str | Path', run: 'str | Path', label_path: 'str | Path', device: 'str' = 'auto', token_budget: 'int' = 16384, seed: 'int' = 0, with_baseline: 'bool' = True, probe_model: 'str' = 'gbdt') -> 'dict[str, Any]'

api.uplift

Evaluate communication-campaign uplift on a trained model (Phase 5).

uplift(shard_dir: 'str | Path', run: 'str | Path', label_path: 'str | Path', device: 'str' = 'auto', token_budget: 'int' = 16384, seed: 'int' = 0, learner: 'str' = 't') -> 'dict[str, Any]'

api.gnn

Run the three-way AML GNN ablation (Phase 6).

gnn(shard_dir: 'str | Path', run: 'str | Path', transfers_path: 'str | Path', aml_label_path: 'str | Path', seeds: 'tuple[int, ...]' = (0, 1, 2), device: 'str' = 'auto', epochs: 'int' = 150) -> 'dict[str, Any]'

api.quickstart

End-to-end smoke: synth → tokenize → nano pretrain → probe (Phase 8).

quickstart(out: 'str | Path' = 'runs/quickstart', n_users: 'int' = 50000, seed: 'int' = 0, model_size: 'str' = 'small', max_steps: 'int' = 400, n_workers: 'int' = 0) -> 'dict[str, Any]'

Interactive entry points

Beyond api, the model exposes two notebook-friendly methods:

from pragmatiq.models.pragmatiq import PragmaModel

model = PragmaModel.from_pretrained("runs/demo")     # loads model + tokenizer
emb = model.embed_records([                           # plain dicts, no shards needed
    {"user_id": "u1", "events": [
        {"ts": 1_700_000_000_000_000, "source": "transaction",
         "fields": {"amount": "42.10", "merchant": "TESCO"}}],
     "attributes": {"country": "GB"}, "lifelong": []},
])  # -> np.ndarray [n_users, dim]

from_pretrained verifies the checkpoint's tokenizer hash against the run's tokenizer and refuses to run on a mismatch.