Quickstart
Install pragmatiq and run the full pipeline — generate data, tokenize, pretrain, embed, and probe — in a few minutes on CPU.
This runs the entire pragmatiq pipeline end to end on synthetic data, on CPU, in a few minutes. By the end you will have a trained model and a credit-risk probe score measured against a raw-count baseline — proof the embedding carries signal beyond trivial counts.
Requirements
Python 3.11+. A GPU is optional — everything runs on CPU (slower but correct). The commands below use a virtual environment so the install is isolated.
Install
git clone https://github.com/dynamiq-ai/pragmatiq.git
cd pragmatiq
python3.11 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"Optional extras pull in focused tooling: .[gnn] for the AML graph, .[serve] for
ONNX/Triton export, .[nemotron] for the frozen text embedder, .[demo] for the
Streamlit app.
Run the whole pipeline
pragmatiq quickstart
# smaller/faster local smoke:
pragmatiq quickstart --n-users 2000 --max-steps 80from pragmatiq import api
result = api.quickstart(n_users=2000, max_steps=80)
print(result["message"])quickstart runs five stages in order:
- generate synthetic users and event histories,
- fit the key–value–time tokenizer,
- pretrain a small masked-language model,
- embed users,
- probe a gradient-boosting credit-risk classifier against a raw-count baseline.
Read the result
The run prints a one-line summary like:
credit probe AUC 0.71 vs raw-count baseline 0.54 · run: runs/quickstartThe probe head is gradient boosting by default (HistGradientBoostingClassifier),
and the raw-count baseline uses the same classifier — so the gap reflects the
representation, not the model family. Both ROC-AUC and PR-AUC are reported (PR-AUC is
the honest headline on low-prevalence risk tasks).
Why this is a forecast, not a hindcast
When a label table carries an eval_ts, each user's history is truncated at that point
before embedding — for both the probe and the baseline — so metrics never peek at the
outcome window.
Where to go next
How it works
The encoder stack, temporal encoding, and the objective — what each piece means and why.
Bring your own data
The parquet data contract and how to point pragmatiq at real records.
The same library surface (pragmatiq/api.py) backs the CLI, notebooks, and production
callers, so anything quickstart does, you can drive stage by stage.