Tag: AI
TDD for AI-driven infrastructure
28 Jun2025

Earlier today, Nick Rempel mentioned in a tweet how important it is to set up tight feedback loops for AI agents when you’re working on something:

I could definitely relate to his sentiment. In all of my projects over the past few weeks I have now switched to a TDD approach where I explicitly ask an agent to write a set of tests describing the feature they are going to build (step 1) and then ask them to ensure that all of the tests fail in exactly the way they expected them to fail (step 2).

This essentially sets up the first loop of writing tests: spell out your assumptions, execute them, see them fail, repeat until your model of the world aligns with the reality you’re going to be changing. After the tests are done I ask the model to commit them and start fixing them until they pass. That is the second loop.

It is an extremely effective way of keeping the agent from making wildly wrong assumptions about the system that usually lead to poor results.

But that was me working on code in isolated environments. And then today I had a really cool related experience while working on real infrastructure.


I had a problem with some software on a personal server that we have deployed yesterday as a part of the my secret “Jarvis home assistant” project. Before attempting a fix myself, I decided to see if the agent could figure it out.

Setting up the “test”

First, I told it that the app was not opening in Chrome and gave it the error I was seeing (connection refused error). This set up the goal for the agent – find and fix the problem.

Then I asked the agent to quickly create me a comprehensive health check script that would verify every thing the agent can imagine breaking within that specific component and display health information to the console in a clean and easy to understand form.

If you ever worked with an agent on building tests, you know how weird they are in coming up with insane edge cases and generally in covering the possibility space. So, that’s exactly what happened here: The agent came up with an insane thousand line script that checked more than a dozen different aspects of the system end-to-end. From DNS resolution to TCP to SSL cert validity and expiration, remote SSH checks for disk space and and memory, firewall checks, docker open port verification, errors or warnings in logs, etc.

When it ran the script for the first time, it was pretty clear that the service was very broken, almost all the checks came back red.

The tight fixing/implementation loop

The agent quickly noticed that the service was dead (TCP connection failure), saw the error captured from the logs and immediately knew where to go next. After addressing the configuration issue, it re-deployed the software, noticed other checks fail (unexpected HTTP redirect), fixed them… and within two minutes the agent had a full solution working and verified because it could use a tight feedback loop with the script and quickly fix all the issues end to end.

And what’s even cooler is that now it has that script documented in cursor rules and I can see that it started using it every time it needs to check if the service is OK. So the script has turned into a permanent safety check the agent can rely on before and after any important change in the system. We didn’t just fix a bug; we permanently upgraded the agent’s ability to manage this system.

New important behaviour

This experience wasn’t just a cool one-off. It immediately created a new default behavior for how I work with the agent. The rule is now simple: before you try to fix something based on incomplete data, you must first write a script that defines what ‘correct’ looks like. Then, you run it to make sure it fails in the way you expect. Only then, with a clear target and a reliable check, do you start making changes and use that script to guide you to the goal.

Seeing the Future

At this stage, I cannot help but feel that this will become a pattern for our AI use and, in general, an important aspect of AI behavior in complex systems going forward.

The agent spends time building a narrow-use tool to gain a new capability (like reliably performing a 20-point check on a remote service) and then uses the tool to get better at performing its job. And, of course, nothing prevents them from potentially combining the tools into more and more complex and powerful capabilities that reduce the risk from guessing and codify reliable patterns of operating a complex system.

Exciting times!


If you have any suggestions for improvement or comments on the described approach, let me know! If you are interested in content like this, feel free to join my free Telegram channel where I share my thoughts on AI-related topics and relevant content I find interesting.


Align, Plan, Ship: From Ideas to Iterations with PRD-Driven AI Agents
  • Posted in: AI
  • Tags:
20 Jun2025

After my last post on how the PRD->Tasklist->Work process I’ve been using for the past month has been instrumental for my being able to effectively context switch between a dozen different projects I have received a number of requests about the details of the process. To be honest, I’ve got so used to it in these weeks of daily use that I did not realize how new this was for me and how a lot of people may never have heard about it. Additionally, the fact that I don’t have a good name for the process did not make it easier for people to find (if you have a good name for it, please let me know).

Sorry about that. Today I will try to describe the process and how I have been using it. But first, a story.


As I mentioned in my post on the history of my relationships with AI agents, I originally started with Tab-completions in Cursor and eventually ended up with a more and more sophisticated setup for my daily coding. The most recent dramatic shift in my approach has happened about a month ago, at the end of May 2025. Several developers told me that they were using a structured approach to prompting their AI agents which leads to a lot more reliable results. I did not immediately follow up on those ideas, but started trying to center my conversations with the agent around a single large file where I would ask the model to keep the context of the project at hand, would work with the model to create a rough plan, mark things as completed, etc.

It worked and definitely helped keep the agent on track a lot better.

Then I stumbled upon a Youtube video from the podcast How To AI in which Ryan Carson explained an approach to using AI agents in a very structured multi-step process that uses the amazing planning capabilities of powerful LLMs (like o3, Gemini 2.5, Opus 4) to create a detailed plan that could be executed by LLM agents much more reliably and then systematically using AI agents to keep track of the progress, etc.

As I was just ramping up yet another hobby project, I decided to use the process there to plan a pretty complicated feature that would have taken me at least a few weeks of weekend coding to get to any reasonable shape.

And holy shit! That weekend evening for sure was my “feel the AGI” moment as the future suddenly felt a lot closer.

The model asked a ton of extremely insightful questions that made me think deeply about many aspects of the project that I would otherwise never have considered or would have only discovered late in the project leading to costly fixes, rewrites or having to live with bad decisions.

After about an hour with that process I ended up with an artifact containing insane amounts of very dense context about the project including a clear and detailed plan of action from start to completion of the feature I wanted to build. If I were responsible for creating that plan, I probably would not have done a better job.

Since then, the process has been truly transformative to how I view my interaction with AI agents and how I approach solving any even remotely non-trivial work. I have used the process on a dozen different projects at home and at work, slightly improved a bunch of aspects of it and I don’t see ever going back to the previous life of naive attempts at one-shotting a solution or “vibe coding” my way to a completed feature.

Below is my attempt at defining the process as of June 19, 2025. I am fairly certain it will improve and change over time, but it may act as an example for anybody who wants to attempt it for their work.

Note: If you want to see what the end results look like, I’ve uploaded a few example PRDs and task lists to GitHub that show the actual artifacts this process generates for real projects.


Process Overview

The goal of this process is to help models deal with their limited context window (the amount of text they can “remember” in a single conversation) and work around the unpredictable nature of trying to randomly prompt to a working application using an unguided LLM agent.

There are three pillars to the process:

  • PRD (aka Product Requirements Document) – a detailed document containing as much detail as possible about the problem at hand. It should give the reader a clear understanding of why we are working on the problem, what the problem is, what kind of solution we are hoping to get at the end, a list of success criteria for the project, etc.
  • Task list – a separate document containing a very detailed multi-level plan for implementing the PRD along with all the detailed “persistent storage” where the agent keeps track of low-level implementation details of the solution (which files we touched, any unexpected findings from our implementation so far, links to any useful documentation or other sources of context), etc.
  • A step-by-step process of executing small sub-sections of the task list (often down to a single item) that always starts with a clean agent that knows about the PRD and the task list and is required to focus on a single simple step. This helps ground the agent and significantly reduces the scope of what the agent is required to understand.

PRD Creation Process

For any new project/feature/problem, anything non-trivial that may take me more than a couple of hours of work, I go through the following set of steps that rely on a set of Cursor rules I have added to the system and can reference in my chat.

Initial prompt

I use a special “Create a PRD” prompt to generate my PRD by opening a new chat, referencing the prompt by name and then asking the agent to create a PRD for me.

Note: I always use the biggest/smartest model I have access to (Gemini 2.5, o3, Opus 4, always in MAX mode), this is one step where there is absolutely no reason to try to save on tokens.

I often spend up to 20-30 minutes talking into my microphone with MacWhisper to brain-dump every single piece of context I have on the reasoning for the project, the context around it, my preferred technical details of the solution, any links to relevant pieces of context (docs, project-related Cursor rules, reference to source code, online URLs for articles, etc).

The more context I give the model at this step, the smoother everything goes later on.

Follow-up Questions and Final Results Tuning

After submitting the “Create a PRD”  prompt, the model comes back with up to a dozen clarifying questions, which I copy into a file and answer each of them by talking into the microphone for a while. There is no structure to it, just a bunch of thoughts (including “I don’t know, you make the call“). I always try to answer as much as possible often including links to more resources I feel may be useful for the model.

Then I respond to the model with something like “here are my answers @answers.md“. At this point the model will think for a while and come back with a detailed PRD document for the project. I often do not accept the first draft right away and instead carefully work through it with the model to improve or clarify it.

Task List Generation

After I have a PRD, I go through the step of analyzing it to generate a detailed list of steps that would lead the project to completion. This step is a lot more iterative in nature because a lot of implementation details will depend on the details of your particular project and the quality of the context captured in the PRD document.

First, I start a new chat in Cursor (with a big model again!), reference the PRD file the model generated in the previous step and the “Generate Task List” prompt I have stored as a separate Cursor rule.

The model will generate a new file with a short description of the problem and a set of top-level tasks needed to execute the project to completion. I carefully review and manually edit the list until I believe it completely covers all the things I want the model to do (top-level steps/phases only, not too specific). This usually takes ~5-10 minutes.

After I am happy with the task list, I tell the model to continue the process to the problem breakdown phase where it will take the list and generate a very detailed step-by-step plan for executing the project. The model is explicitly guided to keep the tasks at a level where each one can be executed by an AI agent operating at a level of a junior engineer.

At the end of the process I end up with a detailed task list that I review and commit into my repository.

Step-by-Step Tasklist Execution

From this point forward, I operate in a loop:

  • Ensure the git status is clean for the project – I want to be able to reset to this point at will or ask the model to look at the diff from the last known stable state.
  • I open a new chat and reference the “Process Task List” prompt stored as a Cursor rule. Then I either ask the model to execute a specific portion of the Task List or just tell it to do the next item on the list.
  • From this point forward, all work is focused on executing the selected scope of work and verifying it works. It could take up to an hour to finish the item with me guiding the agent through problem completion, but in the majority of cases it produces a working solution on the first attempt (given a good set of Cursor rules for the project).
  • After the work is done, the model marks the item as complete in the task list and we commit the results.

Adapting the Plan During Execution

Sometimes when I go through the process of executing tasks, I may notice that the task uncovers some piece of context that I myself was missing, or I remember or realize a detail that we want to add to the project. For those cases, I would just ask the model to adjust the plan in the task list and generate new sections as needed describing additional steps we want to take.

And it goes the other way as well. Sometimes I would notice that a specific step that I thought was important becomes irrelevant as the definition of the problem changes during development. Then I just abandon those parts of the task list and remove them from the file. The task list is a living document, not a rigid contract.

Important: If I had to intervene during the execution of a given task, I always follow-up with the following prompt:

I am going to close this chat session soon and you will lose all memory of this conversation. Please reflect on your progress so far and update the task list document (@tasks-prd-my-feature.md) with any details that would be helpful for you to perform the next steps in our plan more effectively. Anything that surprised you, anything that prevented your solution from working and required debugging or troubleshooting – include it all. Do not go into specifics of the current task, no need for a progress report, focus on distilling your experience into a set of general learnings for the future.

Continuous Improvement

As I work through the task list, the document fills with all kinds of useful details needed to make the work on the project easier for the agent. Finally, when I notice something that would be helpful for all agents to know when working on this repository, I ask the model to update Cursor rules with relevant information. Similarly, I sometimes ask the model to update the docs used to generate PRD and the task list if I notice adjusting the task list along the way too much and would prefer the agent to do something differently next time. This is the key point of constant agent training and improvement I mentioned in my previous post and the PRD-based process makes this improvement much easier to execute by providing the space for reflection at the end of each step and at the end of a given feature project.

When NOT to Use This Process

The process works best when the task is complex enough that it would take an agent at least a few hours to complete. If you’re dealing with something that can be knocked out in 30 minutes of coding, the overhead of creating a PRD and task list is probably overkill – just go ahead and build it.

There’s also a top limit to its usefulness. If the feature you want to build or the problem you want to solve is extremely complicated and would require months of work, the model will probably not be able to plan it out effectively in one shot. The context window limitations and the sheer complexity of long-term planning make it nearly impossible for even the best models to create a coherent multi-month plan that won’t fall apart when it hits reality.

For those cases, I would create a top-level PRD, split it into a set of build stages, and then create a separate PRD for each stage and go through the whole process per stage. Think of it as applying the same approach recursively – break the massive problem down into smaller, more manageable chunks that the model can actually handle.

The cutoff for the top limit is currently unclear to me, but I have successfully used the process on tasks that take me a couple of weeks of full-time work to finish. Beyond that, I start to see the quality of the planning degrade significantly, and the task lists become either too vague to be useful or so detailed that they become brittle and break as soon as you start implementing.

A Note on Task-Automation Tools

There’s been a wave of tools lately that promise to handle AI task planning and execution automatically – things like Task Master, which has become one of the more popular examples. These tools typically rely on CLI workflows or MCP servers to generate and process task lists end-to-end.

I tried using some of them.

In my experience, they look great in demos and work okay for isolated projects where you don’t care too much about the implementation details—basically “vibe coding” with LLMs on steroids. But when I tried applying them to real projects with rich context (and lots of expectations around structure and quality), they fell short.

The models running these tools didn’t have access to my Cursor rules, project-specific docs, or even a shared understanding of past design decisions. As a result, they’d often hallucinate steps based on their own assumptions rather than actual requirements. Editing or course-correcting those hallucinations ended up being more work than just writing the plan myself.

Also – and maybe this is just me – but remembering the right CLI incantations for single-task execution, in just the right format, was more cognitive load than simply editing a markdown file.

So while those tools are impressive technically, I’ve found a manual PRD + task list process to be much more reliable and controllable, especially when I actually care about what gets built and how.


If you have any suggestions for improvement or comments on the described approach, let me know! If you are interested in content like this, feel free to join my free Telegram channel where I share my thoughts on AI-related topics and relevant content I find interesting.


Context Switching with AI: The PRD → Tasklist → Work Loop
  • Posted in: AI
  • Tags:
18 Jun2025

Another observation from using AI agents for multiple projects – at work and for a bunch of pet projects at home:

The PRD → Tasklist → Work loop makes context switching surprisingly painless. I can drop a project for days, sometimes weeks, and then come back and pick it up almost instantly.

Why? Because the PRD and tasklist hold all the context the model needs to keep working – and when I read them, they page that same context back into my brain too. It’s like shared memory between me and the AI.

If I ever come back to something and parts of it don’t make sense to me, that’s a red flag. The model wouldn’t get it either. So I work with it to figure things out and update the docs with what we learned.

Over time, this creates a really effective flywheel:

  • I clarify something
  • The model gets smarter
  • The docs get better
  • I ramp up faster next time

It’s a simple loop. But it works stupidly well.


If you are interested in content like this, feel free to join my free Telegram channel where I share my thoughts on AI-related topics and relevant content I find interesting.


From Autocomplete to Apprentice: Training AI to Work in Our Codebase
  • Posted in: AI
  • Tags:
5 Jun2025

I’ve spent the last year turning large‑language‑model agents into productive teammates. Around 80 % of the code I ship nowadays is written by an LLM, yet it still reflects the unspoken project rules, patterns, and habits our team has baked into the environment. This post documents the recipe that took me there.


From Tab Autocomplete to Autonomous Agents

It started with a curiosity.

July 2024 – I installed Cursor and quickly fell in love with the autocomplete. After a year on GitHub Copilot, the Tab‑Tab‑Tab feature felt like a step change. It was like having a sharp intern finish your thoughts. The suggestions were fast, helpful, and mostly correct—as long as you gave it enough local context to make an educated guess.

Early 2025 – I gave Chat a try on a personal project. For the first time, I saw what it could do across a whole file. I had to tightly manage the context and double‑check everything, but something clicked. It wasn’t just useful—it was promising, and I felt in control of the results produced by the model.

March 2025 – I turned on Agent Mode, and my process fell apart at once. The Sonnet 3.7 model would charge ahead, cheerfully rewriting parts of the system it barely understood, hallucinating non‑existing APIs. It was chaos—the kind that feels overwhelming at first, but also oddly instructive if you paid attention. Debugging became a game of whack‑a‑mole. Some days, I spent more time undoing changes than moving forward. But under the mess, I saw potential. I started to understand the reasons why the model failed—and it all boiled down to context.

April 2025 to now – That’s when I discovered Cursor Rules. One by one, I started adding bits of project‑specific context to the system: naming conventions, testing quirks, deployment rituals. And just like that, the agent stopped acting like a rogue junior developer with full access and no supervision. It started to feel like a teammate with some tenure and reasonable understanding of the system, capable of implementing large, complex changes end‑to‑end without much involvement from my side.


Why Cursor Rules Matter

LLMs arrive pre‑trained on the internet. They don’t know your domain language, naming conventions, or deployment rituals. Cursor rules are small Markdown files that pin that tribal knowledge right next to the code. Add them, and the agent’s context window is always seeded with the right cues, ensuring your LLM partner starts each task aligned with your preferences.


My Six‑Step Onboarding Recipe

Step 1 – Start With an Empty Rulebook

When our team onboards an agent into our application, we skip the generic rulepacks. Every mature codebase bends the rules somewhere, and starting with an existing ruleset cements someone else’s preferences into your agent’s behavior, making it harder to steer.

Step 2 – Dump Context Into Chat

Open a fresh chat with a capable model (o3, Gemini 2.5, Anthropic Opus). Brain‑dump everything you know:

  • project purpose
  • domain terminology
  • architectural quirks
  • links to docs (@docs/…), READMEs, dashboards

Keep talking until you run dry, don’t worry about structure. I’ve spent over an hour at times, speaking nonstop into MacWhisper.

Step 3 – Generate the 000‑project‑info Rule

Ask the model to condense that chat into .cursor/rules/000-project-info.mdc and mark it Always. I use the numeric prefix so @0 autocompletes it later.

Step 4 – Keep a Living Knowledge Base

If you’re still onboarding yourself into the project, this is where AI shines. Ask it every question you can think of: what does this part do, how are things usually named, why is this structured that way? Every time you discover something new together that feels like it might help an agent make better decisions, capture it. Either update your main project info file or create a new rule file for it.

Here are some rules I have created in most of my projects:

  • 001-tech-guidelines.mdc – languages, frameworks, linters, dependency conventions.
  • 002-testing-guidelines.mdc – how to run all tests, a single file, or one example; test types; preferred TDD style.
  • 003-data-model.mdc (Agent‑requested) – list of models, relationships, invariants (generated by having the model parse schema.rb and the app/models folder).

Mark the first two Always, the rest Agent‑requested so they load on demand. Some other things I find useful to include (in agent-requested mode):

  • Show a page with API docs for an obscure dependency to the agent, ask it to generate a rule explaining the usage of that API.
  • For any unusual pattern within the codebase like an internal abstraction layer for a database or an external service, an internal library, etc explain to the agent why the abstraction exists and how it is used, then point it at important pieces of relevant code (both implementation and usage), then ask for a rule guiding an AI model in using that piece of technology.
  • Internal tooling: explain all the tools you have available for the agent to do its job and when and how to use those. Think linters, code quality and coverage controls, different types of tests and other ways to get feedback on the quality of AI’s solution.

Step 5 – Let the Agent Struggle (Then Capture the Lesson)

Pick a trivial task you already know how to implement. Let the agent attempt. When it stumbles, nudge it forward. Not by coding, but by asking questions and pointing to clues. After the fix ships, ask:

Look back at the history of our conversation and see if you notice any patterns that would be helpful for an AI coding agent to know next time we work on something similar. Update your existing cursor rules or create new ones to persist your findings.

Then review and commit the changes. This will help the model get better at solving problems similar to what you have just done.

Step 6 – Rinse, Repeat, Refine

Repeat this for a few weeks and the model will start making fewer obvious mistakes, build things in ways that match your expectations, and often pre‑empt your next move before you’ve even typed it out.


One surprising effect of this process is how much of my tacit knowledge I’ve had to put into words —decades of habits, intuition, and project‑specific judgment calls now live in Markdown files. That knowledge doesn’t just help my agent; it lifts the whole team. As we all work with our agents, we start seeing them act on rules someone else introduced, surfacing insights and patterns we hadn’t shared before. It’s low‑friction knowledge transfer, and it works.


The Apprenticeship Model

At some point while documenting this process, I realized what it resembled: an apprenticeship. You’re bringing on a new team member, and instead of throwing manuals at them, you teach by pairing on real tasks. You guide, correct, explain. The model’s pre-training is its education, sure — but adapting it to your environment, your tools, your expectations — that part is still on us. That’s the job of a mentor, and that’s how I see this work now. Our job is changing and we may all eventually become PMs managing teams of AI agents, but today we need to be mentors first.

If you would like to hear more about my adventures with modern AI systems, feel free to join my Telegram channel where I try to share more of my experiences along with interesting content I discover daily.