I built an experimental orchestration language for reproducible data science called 'T'
Hey r/datascience,
I've been working on a side project called T (or tlang) for the past year or so, and I've just tagged the v0.51.2 "Sangoku" public beta. The short pitch: it's a small functional DSL for orchestrating polyglot data science pipelines, with Nix as a hard dependency.
What problem it's trying to solve
The "works on my machine" problem for data science is genuinely hard. R and Python projects accumulate dependency drift quietly until something breaks six months later, or on someone else's machine. `uv` for Python is great and{renv}helps in R-land, but they don't cross language boundaries cleanly, and they don't pin system dependencies. Most orchestration tools are language-specific and require some work to make cross languages.
T's thesis is: what if reproducibility was mandatory by design? You can't run a T script without wrapping it in a pipeline {} block. Every node in that pipeline runs in its own Nix sandbox. DataFrames move between R, Python, and T via Apache Arrow IPC. Models move via PMML. The environment is a Nix flake, so it's bit-for-bit reproducible.
What it looks like
p = pipeline { -- Native T node data = node(command = read_csv("data.csv") |> filter($age > 25)) -- rn defines an R node; pyn() a Python node model_r = rn( -- Python or R code gets wrapped inside a <{}> block command = <{ lm(score ~ age, data = data) }>, serializer = ^pmml, deserializer = ^csv ) -- Back to T for predictions (which could just as well have been -- done in another R node) predictions = node( command = data |> mutate($pred = predict(data, model_r)), deserializer = ^pmml ) } build_pipeline(p) The ^pmml, ^csv etc. are first-class serializers from a registry. They handle data interchange contracts between nodes so the pipeline builder can catch mismatches at build time rather than at runtime.
What's in the language itself
- Strictly functional: no loops, no mutable state, immutable by default (
:=to reassign,rm()to delete) - Errors are values, not exceptions.
|>short-circuits on errors;?|>forwards them for recovery - NSE column syntax (
$col) inside data verbs, heavily inspired by dplyr - Arrow-backed DataFrames, native CSV/Parquet/Feather I/O
- A native PMML evaluator so you can train in Python or R and predict in T without a runtime dependency
- A REPL for interactive exploration
What it's missing
- Users ;)
- Julia support (but it's planned)
What I'm looking for
Honest feedback, especially:
- Are there obvious workflow patterns that the pipeline model doesn't support?
- Any rough edges in the installation or getting-started experience?
You can try it with:
nix shell github:b-rodrigues/tlang t init --project my_test_project (Requires Nix with flakes enabled — the Determinate Systems installer is the easiest path if you don't have it.)
Repo: https://github.com/b-rodrigues/tlang
Docs: https://tstats-project.org
Happy to answer questions here!
[link] [comments]
Want to read more?
Check out the full article on the original site