I built an experimental orchestration language for reproducible data science called 'T'

Hey r/datascience,

I've been working on a side project called T (or tlang) for the past year or so, and I've just tagged the v0.51.2 "Sangoku" public beta. The short pitch: it's a small functional DSL for orchestrating polyglot data science pipelines, with Nix as a hard dependency.

What problem it's trying to solve

The "works on my machine" problem for data science is genuinely hard. R and Python projects accumulate dependency drift quietly until something breaks six months later, or on someone else's machine. `uv` for Python is great and{renv}helps in R-land, but they don't cross language boundaries cleanly, and they don't pin system dependencies. Most orchestration tools are language-specific and require some work to make cross languages.

T's thesis is: what if reproducibility was mandatory by design? You can't run a T script without wrapping it in a pipeline {} block. Every node in that pipeline runs in its own Nix sandbox. DataFrames move between R, Python, and T via Apache Arrow IPC. Models move via PMML. The environment is a Nix flake, so it's bit-for-bit reproducible.

What it looks like

p = pipeline { -- Native T node data = node(command = read_csv("data.csv") |> filter($age > 25)) -- rn defines an R node; pyn() a Python node model_r = rn( -- Python or R code gets wrapped inside a <{}> block command = <{ lm(score ~ age, data = data) }>, serializer = ^pmml, deserializer = ^csv ) -- Back to T for predictions (which could just as well have been -- done in another R node) predictions = node( command = data |> mutate($pred = predict(data, model_r)), deserializer = ^pmml ) } build_pipeline(p)

The ^pmml, ^csv etc. are first-class serializers from a registry. They handle data interchange contracts between nodes so the pipeline builder can catch mismatches at build time rather than at runtime.

What's in the language itself

Strictly functional: no loops, no mutable state, immutable by default (:= to reassign, rm() to delete)
Errors are values, not exceptions. |> short-circuits on errors; ?|> forwards them for recovery
NSE column syntax ($col) inside data verbs, heavily inspired by dplyr
Arrow-backed DataFrames, native CSV/Parquet/Feather I/O
A native PMML evaluator so you can train in Python or R and predict in T without a runtime dependency
A REPL for interactive exploration

What it's missing

Users ;)
Julia support (but it's planned)

What I'm looking for

Honest feedback, especially:

Are there obvious workflow patterns that the pipeline model doesn't support?
Any rough edges in the installation or getting-started experience?

You can try it with:

nix shell github:b-rodrigues/tlang t init --project my_test_project

(Requires Nix with flakes enabled — the Determinate Systems installer is the easiest path if you don't have it.)

Repo: https://github.com/b-rodrigues/tlang
Docs: https://tstats-project.org

Happy to answer questions here!

submitted by /u/brodrigues_co
[link] [comments]

I built an experimental orchestration language for reproducible data science called 'T'

Want to read more?

Tagged with