/

Research

RL Environments are Worthless

Jan 29, 2026

The market has it backwards

Over the last 12 months, dozens of “RL environment” companies have emerged to serve frontier labs. The pitch: photo-realistic clones of Slack, DoorDash, Salesforce. We will give you a digital playground where your model can learn.

The public conversation has fixated on environments - this is wrong.

Labs are not buying environments, they are buying tasks.

We’d be more accurate calling ourselves “task providers”, but I suppose that doesn’t make the work sound mysterious and important.

Labs are not buying environments, they want tasks

The environment itself is a conduit to the tasks that are being delivered. We often compare online RL to prepping for a standardized test like the SAT.

  1. The benchmark: the real SAT test. As a student, you want to ace the real test. And when practicing for the test, you learn skills, like manipulating algebra, that improve your performance on the real test, but also generalize to real world problem solving.

  2. The environment: practice test books like Kaplan. This is the training simulator and you may want multiple books to ace the exam.

  3. Tasks: questions like algebra, reading comprehension, and grammar - you practice these in the Kaplan book.

  4. Verifications: the answer keys and explanations. You need immediate feedback on the correctness of your actions in the tasks to update your learnings

  5. Policy: your test-taking strategies, pattern recognition, and reasoning capabilities that improve after taking enough practice questions and retracing your steps.

The Kaplan book is only as valuable as it has quality practice questions - similarly, an RL environment is only as useful as it has quality tasks.

Bubbling out, a DoorDash replica in absolute means nothing to a lab. It must come packaged with a realistic data distribution and “quality” tasks for RL. Mechanize correctly surmises that cheap RL tasks will waste compute, but what do quality tasks look like?

What are quality RL tasks?

  1. Difficult: They are at the “edge” of the model’s abilities. If your baseline score on the SAT is a 1500, taking only the first or second question in math over and over again doesn’t meaningfully improve your performance. You want to practice the hard questions, similarly, quality tasks should be “challenging” enough to the model.

    Codifying what “hard enough” means depends on the labs, but empirically we find a pass@5 <30% to be a reasonable range of what labs expect.

  2. Realistic: Tasks are in distribution of what the users ultimately will do. If every practice question is about obscure poetry while the real SAT tests prose, you're training on the wrong distribution. If you have a DoorDash clone but none of your tasks involve search, discovery, or placing orders, your task likely doesn't meet a realistic distribution of how end users expect to leverage the model.

  3. Reliably Verifiable: The verification mechanisms are not noisy. Imagine if the Kaplan answer key had typos - you'd learn the wrong lessons. Noisy verifications are like corrupted answer keys. Aside: why build photo-realistic clones instead of using real DoorDash?. Ignoring legal implications, the DoorDash experience changes. The assets, stores, menu-items, they all change. Browserbase summarizes this problem well when they reviewed the WebVoyager benchmark and found tens of tasks rely on stale data or broken assumptions. How do you verify a task when all of the assumptions on the task no longer apply? In a real production application - you can’t, which is why everyone builds clones.

    Quality tasks are reliably verifiable - you can run them today, tomorrow, next year - and you expect the score to stay the same across the same model configuration.

  4. Hard to hack: when in doubt, pick C. On standardized testing, there is an adage that if you don’t know, just choose an answer in the middle. Observably, test writers do “bury it in the middle.” If your DoorDash verification just checks "did a confirmation modal appear," the model might learn to click buttons that trigger modals without completing a valid order. These are bad tasks because the model doesn’t actually learn, it learns to “hack”.

Ultimately, the environment is the delivery mechanism. The tasks are the product.

We obsess over task design at Chakra Labs.

If you’re interested in pushing the edges of model capacity with frontier lab researchers, please reach out.

Nirmal Krishnan

Chakra Team

Nirmal Krishnan is a co-founder at Chakra Labs. Prior to starting the company, he spent time in data, markets, and early-stage startups. He studied computer science and machine learning at Johns Hopkins, where he pursued a bachelor’s and master's degree in computational genomics, publishing papers on prostate cancer and induced pluripotent stem cells.

Request
Platform Access

Access research-grade infrastructure for agent development. Deterministic environments with frame-accurate state control, high-fidelity trajectory datasets, and mixed-modality training capability.

Frontier Data Laboratory

Contact

Social Channels

Company Resources

Newsletter

Copyright ©2026 Chakra Labs. Unauthorized duplication or use of the content of this website is prohibited.

Request
Platform Access

Access research-grade infrastructure for agent development. Deterministic environments with frame-accurate state control, high-fidelity trajectory datasets, and mixed-modality training capability.

Frontier Data Laboratory

Contact

Social Channels

Company Resources

Newsletter

Copyright ©2026 Chakra Labs. Unauthorized duplication or use of the content of this website is prohibited.

Request
Platform Access

Access research-grade infrastructure for agent development. Deterministic environments with frame-accurate state control, high-fidelity trajectory datasets, and mixed-modality training capability.

Frontier Data Laboratory

Contact

Social Channels

Company Resources

Newsletter

Copyright ©2026 Chakra Labs. Unauthorized duplication or use of the content of this website is prohibited.