Reinforcement learning environments that train your models.
We build the environments where AI models do real work, graded against ground truth as they go. Run the same task and you get the same number, whichever lab is running it.
The corpus · success rate across 18 tests, by areaaverage 0.74
01Environments
What every environment is, before the score.
Verifiable tasks
A real engineering task with a checkable outcome.
Dense reward
Every step scored against ground truth.
Real rollouts
Grounded in production work, not invented.
02Method
One process, from a capability to a graded environment. The same five steps every time.
01Capability
Perceive
Map a capability and its failure modes until the reward is well defined.
02Rubric
Represent
Formalize it into a task distribution with a verifiable rubric.
03No contamination
Build
Stand up environments that separate cleanly from eval and resist contamination.
04Distribution
Scale
Mass-produce variants across the distribution. Early environments become training data.
05pass@k
Choose
Score pass@k by model. Point the next environment at what they fail.
03Domains
Where the method is pointed. In priority order, by stakes.
Safety
Alignment and oversight. The first call on everything.
PriorityDefense
High-stakes capability and red-team work.
High-stakesScience
Bio, pharma, research automation.
ResearchCommerce
Agentic work on real company operations. Live today.
Live04Why Idler
Grounded, broad, frontier.
A
Grounded
Environments from real production data, not invented. Less reward hacking, better transfer.
B
Broad
Coverage across coding, tool use, long-horizon, error recovery.
C
Frontier
Built for the models clearing the hardest evals, on the work they fail next.
05Contact
Name the capability your models miss. We build the environment, graded against ground truth.