Ryan Scott Brown

I build cloud-based systems for startups and enterprises. My background in operations gives me a unique focus on writing observable, reliable software and automating maintenance work.

I love learning and teaching about Amazon Web Services, automation tools such as Ansible, and the serverless ecosystem. I most often write code in Python, TypeScript, and Rust.

B.S. Applied Networking and Systems Administration, minor in Software Engineering from Rochester Institute of Technology.

Reader Mode

Revisited: Delegating Authority to Capricious Agents

Earlier this year I wrote a post about semi-trusting “agentic” development tools. My own workflow has evolved since, but container-use is still my favorite way to let agents work with minimal supervision and --dangerously-allow-all.

Since August, Opus 4.5 was released and included in Claude Code. This felt like a bigger leap than 3 to 4, or Sonnet 4 to Opus 4 for software development. The main change I noticed in my own usage was in how much more closely 4.5 adhered to specs like OpenAPI and JSON Schema. Tool usage also improved, instruction adherence seems better, and I have switched to using skills/ for everything except container-use and Playwright.

4 Months of container-use

Since I found out about container-use, lots more “orchestration” tools are fighting to be top of the stack for software development. It reminds me of Benedict Evans’ work on bundling, unbundling, and distribution for smartphone / mobile web applications. The article is from 2014, but the same thing is playing out again for development tools right now. At the start, OpenAI was happy to sell API usage to any tool. There was a generation of (some now acquired) OpenAI wrappers. Now OpenAI and Anthropic both build their own tools (Codex & Claude Code). Other companies are now trying to move up the stack and orchestrate “any” agent. So far, none of the current agent orchestration tools have struck me as an improvement over container-use with separate terminals.

I haven’t personally reached the stage of LLM-assisted development where I’m managing fleets of agents. I still use container-use and work on one feature branch where I merge in changes from a few agents each with their own branches.

Wrangling more than two or three agents makes it hard to review changes as they go, and keep a mental model of how to test what is happening. Agents can (usually) get a small feature done with minimal iterations but my own code review becomes a bottleneck for more complex features.

Context - Human & LLM

Even though LLM context windows are bigger, context rot degrades performance quickly. Human context windows are the same size they’ve always been, and degrade with multi-tasking. Supervising agents in real time feels productive (look how much I’m typing!). I would rather do more design up-front and let the LLM spin without me over its shoulder, letting me focus on one thing at a time and reviewing code later.

Fully asynchronous agents using PR’s as their interface to your code seem like the final form on small, scoped tasks. Models have been improving and in my anecdotal experience one-shotting more tasks. This seems like a better developer and increasingly non-developer experience: prompts, specs, and context in and a single artifact out. The “year of reasoning” 2025 has been all about using more compute at test-time, and synchronous sessions place natural limits on that.

Developer tools that keep interactive components (Cursor, Windsurf, Ona-formerly-Gitpod) seem like the short-term winners. Supervision and interactive sessions have been the main way I’ve used these tools. AI optimists claim that linear improvements in model performance will yield super-linear coding performance. I don’t think that’s likely, and I’m confident the future of development with LLM tooling will heavily rely on test-time compute and improved harnesses.

Disposable Code & Avoiding Blunders

A big lesson LLM’s have reinforced for me is advice from Dan Luu:

I feel like it would be useful for programmers, as a field, to acknowledge that humans are bad at programming.

This is because techniques for improving at things you’re bad at are different from techniques for improving at things you’re good at. -@danluu 2021-09-28

E.g., blunder avoidance is generally high ROI when you’re bad and I’ve gotten a lot of mileage from trying to avoid blunders.

If I look at how other people operate, they often do really sophisticated/complex stuff that’s net ineffective because it increases the rate of blunders. -@danluu 2021-09-28

The main change is how much more heavily I lean on testing and automatic remediation tools. For TypeScript, that has meant learning GritQL for Biome to find bad practices and remediate problems.

Focusing on blunder avoidance combined with the drastic decrease in cost of first-draft code, I have had to change my mindset. When I only wrote code manually, I treated it like a precious investment of my time. I used stacked git to work on multiple related changes and carefully send them for human review. With LLM’s, when I see blunders I am more apt to throw out the branch and start again with better verification tools.

Using Specs to Maintain Quality

Development in 2026 will change, but humans are poor multi-taskers and hate waiting for things and pressing “approve”. Adversarial LLM agents could compete in controlled ways using high-level verifiable formats (OpenAPI, JSONSchema, TLA+) and use tools to iterate towards the spec. Because of the need for more compute, that will probably look more like fully asynchronous agents than interactive sessions. One (or more) agents will build increasingly automated specs, tests, and validators while the other(s) build the code.

Something like AWS’ automated reasoning checks seem like a fruitful path, I’ve dabbled in TLA+ but always hit the same problem: checking the TLA+ proof against the code as-implemented. A 2024 paper fusing formal methods and LLMs uses Coq and proposes generating proofs from natural language, then iterating against the theorem prover with a separate LLM thread. My experience with LLM-written tests has been that they produce happy-path cases easily, basic error cases with some prompting, and often create tautological test cases. My stance on code generation used to be “LLM’s can write the code, or the tests: never both.”

Similar to the paper, I now use LLM’s for both tests and code in separate contexts. Unlike the authors, I haven’t learned Coq (yet). Instead, I start with a spec to build the test suite then start a separate session with a fresh context window to implement the code, running tests as it goes. For personal projects I’ve leaned progressively harder on Rust which grants more guarantees by default. This cuts down on simple tests (is thing actually a number?), and the dead code detection helps delete swaths of unused helpers.

Coming in 2026

You can’t do anything lately without hearing about “agentic software development” so there will obviously be a lot more development over the coming year. For now, I’ll be keeping my local container-based workflow and focus on verification techniques. 2026 may be the year formal methods become cool.