TDD for AI-driven infrastructure

Posted in: AI
Tags: AI, Jarvis, tips

28 Jun2025

Earlier today, Nick Rempel mentioned in a tweet how important it is to set up tight feedback loops for AI agents when you’re working on something:

Ive been using @cursor_ai heavily and I’ve developed a workflow that makes coding with LLMs way less painful.

I thought I’d share here since I think it’s a big productivity boost that many not be aware of.

It’s basically three steps:

1. Draft: Use Calude Opus 3 to scaffold…
— Nick Rempel (@nbrempel) June 27, 2025

I could definitely relate to his sentiment. In all of my projects over the past few weeks I have now switched to a TDD approach where I explicitly ask an agent to write a set of tests describing the feature they are going to build (step 1) and then ask them to ensure that all of the tests fail in exactly the way they expected them to fail (step 2).

This essentially sets up the first loop of writing tests: spell out your assumptions, execute them, see them fail, repeat until your model of the world aligns with the reality you’re going to be changing. After the tests are done I ask the model to commit them and start fixing them until they pass. That is the second loop.

It is an extremely effective way of keeping the agent from making wildly wrong assumptions about the system that usually lead to poor results.

But that was me working on code in isolated environments. And then today I had a really cool related experience while working on real infrastructure.

I had a problem with some software on a personal server that we have deployed yesterday as a part of the my secret “Jarvis home assistant” project. Before attempting a fix myself, I decided to see if the agent could figure it out.

Setting up the “test”

First, I told it that the app was not opening in Chrome and gave it the error I was seeing (connection refused error). This set up the goal for the agent – find and fix the problem.

Then I asked the agent to quickly create me a comprehensive health check script that would verify every thing the agent can imagine breaking within that specific component and display health information to the console in a clean and easy to understand form.

If you ever worked with an agent on building tests, you know how weird they are in coming up with insane edge cases and generally in covering the possibility space. So, that’s exactly what happened here: The agent came up with an insane thousand line script that checked more than a dozen different aspects of the system end-to-end. From DNS resolution to TCP to SSL cert validity and expiration, remote SSH checks for disk space and and memory, firewall checks, docker open port verification, errors or warnings in logs, etc.

When it ran the script for the first time, it was pretty clear that the service was very broken, almost all the checks came back red.

The tight fixing/implementation loop

The agent quickly noticed that the service was dead (TCP connection failure), saw the error captured from the logs and immediately knew where to go next. After addressing the configuration issue, it re-deployed the software, noticed other checks fail (unexpected HTTP redirect), fixed them… and within two minutes the agent had a full solution working and verified because it could use a tight feedback loop with the script and quickly fix all the issues end to end.

And what’s even cooler is that now it has that script documented in cursor rules and I can see that it started using it every time it needs to check if the service is OK. So the script has turned into a permanent safety check the agent can rely on before and after any important change in the system. We didn’t just fix a bug; we permanently upgraded the agent’s ability to manage this system.

New important behaviour

This experience wasn’t just a cool one-off. It immediately created a new default behavior for how I work with the agent. The rule is now simple: before you try to fix something based on incomplete data, you must first write a script that defines what ‘correct’ looks like. Then, you run it to make sure it fails in the way you expect. Only then, with a clear target and a reliable check, do you start making changes and use that script to guide you to the goal.

Seeing the Future

At this stage, I cannot help but feel that this will become a pattern for our AI use and, in general, an important aspect of AI behavior in complex systems going forward.

The agent spends time building a narrow-use tool to gain a new capability (like reliably performing a 20-point check on a remote service) and then uses the tool to get better at performing its job. And, of course, nothing prevents them from potentially combining the tools into more and more complex and powerful capabilities that reduce the risk from guessing and codify reliable patterns of operating a complex system.

Exciting times!

If you have any suggestions for improvement or comments on the described approach, let me know! If you are interested in content like this, feel free to join my free Telegram channel where I share my thoughts on AI-related topics and relevant content I find interesting.

Homo-Adminus Blog

Yet Another Admin’s Blog

Setting up the “test”

The tight fixing/implementation loop

New important behaviour

Seeing the Future