Align, Plan, Ship: From Ideas to Iterations with PRD-Driven AI Agents
20 Jun2025

After my last post on how the PRD->Tasklist->Work process I’ve been using for the past month has been instrumental for my being able to effectively context switch between a dozen different projects I have received a number of requests about the details of the process. To be honest, I’ve got so used to it in these weeks of daily use that I did not realize how new this was for me and how a lot of people may never have heard about it. Additionally, the fact that I don’t have a good name for the process did not make it easier for people to find (if you have a good name for it, please let me know).

Sorry about that. Today I will try to describe the process and how I have been using it. But first, a story.


As I mentioned in my post on the history of my relationships with AI agents, I originally started with Tab-completions in Cursor and eventually ended up with a more and more sophisticated setup for my daily coding. The most recent dramatic shift in my approach has happened about a month ago, at the end of May 2025. Several developers told me that they were using a structured approach to prompting their AI agents which leads to a lot more reliable results. I did not immediately follow up on those ideas, but started trying to center my conversations with the agent around a single large file where I would ask the model to keep the context of the project at hand, would work with the model to create a rough plan, mark things as completed, etc.

It worked and definitely helped keep the agent on track a lot better.

Then I stumbled upon a Youtube video from the podcast How To AI in which Ryan Carson explained an approach to using AI agents in a very structured multi-step process that uses the amazing planning capabilities of powerful LLMs (like o3, Gemini 2.5, Opus 4) to create a detailed plan that could be executed by LLM agents much more reliably and then systematically using AI agents to keep track of the progress, etc.

As I was just ramping up yet another hobby project, I decided to use the process there to plan a pretty complicated feature that would have taken me at least a few weeks of weekend coding to get to any reasonable shape.

And holy shit! That weekend evening for sure was my “feel the AGI” moment as the future suddenly felt a lot closer.

The model asked a ton of extremely insightful questions that made me think deeply about many aspects of the project that I would otherwise never have considered or would have only discovered late in the project leading to costly fixes, rewrites or having to live with bad decisions.

After about an hour with that process I ended up with an artifact containing insane amounts of very dense context about the project including a clear and detailed plan of action from start to completion of the feature I wanted to build. If I were responsible for creating that plan, I probably would not have done a better job.

Since then, the process has been truly transformative to how I view my interaction with AI agents and how I approach solving any even remotely non-trivial work. I have used the process on a dozen different projects at home and at work, slightly improved a bunch of aspects of it and I don’t see ever going back to the previous life of naive attempts at one-shotting a solution or “vibe coding” my way to a completed feature.

Below is my attempt at defining the process as of June 19, 2025. I am fairly certain it will improve and change over time, but it may act as an example for anybody who wants to attempt it for their work.

Note: If you want to see what the end results look like, I’ve uploaded a few example PRDs and task lists to GitHub that show the actual artifacts this process generates for real projects.


Process Overview

The goal of this process is to help models deal with their limited context window (the amount of text they can “remember” in a single conversation) and work around the unpredictable nature of trying to randomly prompt to a working application using an unguided LLM agent.

There are three pillars to the process:

  • PRD (aka Product Requirements Document) – a detailed document containing as much detail as possible about the problem at hand. It should give the reader a clear understanding of why we are working on the problem, what the problem is, what kind of solution we are hoping to get at the end, a list of success criteria for the project, etc.
  • Task list – a separate document containing a very detailed multi-level plan for implementing the PRD along with all the detailed “persistent storage” where the agent keeps track of low-level implementation details of the solution (which files we touched, any unexpected findings from our implementation so far, links to any useful documentation or other sources of context), etc.
  • A step-by-step process of executing small sub-sections of the task list (often down to a single item) that always starts with a clean agent that knows about the PRD and the task list and is required to focus on a single simple step. This helps ground the agent and significantly reduces the scope of what the agent is required to understand.

PRD Creation Process

For any new project/feature/problem, anything non-trivial that may take me more than a couple of hours of work, I go through the following set of steps that rely on a set of Cursor rules I have added to the system and can reference in my chat.

Initial prompt

I use a special “Create a PRD” prompt to generate my PRD by opening a new chat, referencing the prompt by name and then asking the agent to create a PRD for me.

Note: I always use the biggest/smartest model I have access to (Gemini 2.5, o3, Opus 4, always in MAX mode), this is one step where there is absolutely no reason to try to save on tokens.

I often spend up to 20-30 minutes talking into my microphone with MacWhisper to brain-dump every single piece of context I have on the reasoning for the project, the context around it, my preferred technical details of the solution, any links to relevant pieces of context (docs, project-related Cursor rules, reference to source code, online URLs for articles, etc).

The more context I give the model at this step, the smoother everything goes later on.

Follow-up Questions and Final Results Tuning

After submitting the “Create a PRD”  prompt, the model comes back with up to a dozen clarifying questions, which I copy into a file and answer each of them by talking into the microphone for a while. There is no structure to it, just a bunch of thoughts (including “I don’t know, you make the call“). I always try to answer as much as possible often including links to more resources I feel may be useful for the model.

Then I respond to the model with something like “here are my answers @answers.md“. At this point the model will think for a while and come back with a detailed PRD document for the project. I often do not accept the first draft right away and instead carefully work through it with the model to improve or clarify it.

Task List Generation

After I have a PRD, I go through the step of analyzing it to generate a detailed list of steps that would lead the project to completion. This step is a lot more iterative in nature because a lot of implementation details will depend on the details of your particular project and the quality of the context captured in the PRD document.

First, I start a new chat in Cursor (with a big model again!), reference the PRD file the model generated in the previous step and the “Generate Task List” prompt I have stored as a separate Cursor rule.

The model will generate a new file with a short description of the problem and a set of top-level tasks needed to execute the project to completion. I carefully review and manually edit the list until I believe it completely covers all the things I want the model to do (top-level steps/phases only, not too specific). This usually takes ~5-10 minutes.

After I am happy with the task list, I tell the model to continue the process to the problem breakdown phase where it will take the list and generate a very detailed step-by-step plan for executing the project. The model is explicitly guided to keep the tasks at a level where each one can be executed by an AI agent operating at a level of a junior engineer.

At the end of the process I end up with a detailed task list that I review and commit into my repository.

Step-by-Step Tasklist Execution

From this point forward, I operate in a loop:

  • Ensure the git status is clean for the project – I want to be able to reset to this point at will or ask the model to look at the diff from the last known stable state.
  • I open a new chat and reference the “Process Task List” prompt stored as a Cursor rule. Then I either ask the model to execute a specific portion of the Task List or just tell it to do the next item on the list.
  • From this point forward, all work is focused on executing the selected scope of work and verifying it works. It could take up to an hour to finish the item with me guiding the agent through problem completion, but in the majority of cases it produces a working solution on the first attempt (given a good set of Cursor rules for the project).
  • After the work is done, the model marks the item as complete in the task list and we commit the results.

Adapting the Plan During Execution

Sometimes when I go through the process of executing tasks, I may notice that the task uncovers some piece of context that I myself was missing, or I remember or realize a detail that we want to add to the project. For those cases, I would just ask the model to adjust the plan in the task list and generate new sections as needed describing additional steps we want to take.

And it goes the other way as well. Sometimes I would notice that a specific step that I thought was important becomes irrelevant as the definition of the problem changes during development. Then I just abandon those parts of the task list and remove them from the file. The task list is a living document, not a rigid contract.

Important: If I had to intervene during the execution of a given task, I always follow-up with the following prompt:

I am going to close this chat session soon and you will lose all memory of this conversation. Please reflect on your progress so far and update the task list document (@tasks-prd-my-feature.md) with any details that would be helpful for you to perform the next steps in our plan more effectively. Anything that surprised you, anything that prevented your solution from working and required debugging or troubleshooting – include it all. Do not go into specifics of the current task, no need for a progress report, focus on distilling your experience into a set of general learnings for the future.

Continuous Improvement

As I work through the task list, the document fills with all kinds of useful details needed to make the work on the project easier for the agent. Finally, when I notice something that would be helpful for all agents to know when working on this repository, I ask the model to update Cursor rules with relevant information. Similarly, I sometimes ask the model to update the docs used to generate PRD and the task list if I notice adjusting the task list along the way too much and would prefer the agent to do something differently next time. This is the key point of constant agent training and improvement I mentioned in my previous post and the PRD-based process makes this improvement much easier to execute by providing the space for reflection at the end of each step and at the end of a given feature project.

When NOT to Use This Process

The process works best when the task is complex enough that it would take an agent at least a few hours to complete. If you’re dealing with something that can be knocked out in 30 minutes of coding, the overhead of creating a PRD and task list is probably overkill – just go ahead and build it.

There’s also a top limit to its usefulness. If the feature you want to build or the problem you want to solve is extremely complicated and would require months of work, the model will probably not be able to plan it out effectively in one shot. The context window limitations and the sheer complexity of long-term planning make it nearly impossible for even the best models to create a coherent multi-month plan that won’t fall apart when it hits reality.

For those cases, I would create a top-level PRD, split it into a set of build stages, and then create a separate PRD for each stage and go through the whole process per stage. Think of it as applying the same approach recursively – break the massive problem down into smaller, more manageable chunks that the model can actually handle.

The cutoff for the top limit is currently unclear to me, but I have successfully used the process on tasks that take me a couple of weeks of full-time work to finish. Beyond that, I start to see the quality of the planning degrade significantly, and the task lists become either too vague to be useful or so detailed that they become brittle and break as soon as you start implementing.

A Note on Task-Automation Tools

There’s been a wave of tools lately that promise to handle AI task planning and execution automatically – things like Task Master, which has become one of the more popular examples. These tools typically rely on CLI workflows or MCP servers to generate and process task lists end-to-end.

I tried using some of them.

In my experience, they look great in demos and work okay for isolated projects where you don’t care too much about the implementation details—basically “vibe coding” with LLMs on steroids. But when I tried applying them to real projects with rich context (and lots of expectations around structure and quality), they fell short.

The models running these tools didn’t have access to my Cursor rules, project-specific docs, or even a shared understanding of past design decisions. As a result, they’d often hallucinate steps based on their own assumptions rather than actual requirements. Editing or course-correcting those hallucinations ended up being more work than just writing the plan myself.

Also – and maybe this is just me – but remembering the right CLI incantations for single-task execution, in just the right format, was more cognitive load than simply editing a markdown file.

So while those tools are impressive technically, I’ve found a manual PRD + task list process to be much more reliable and controllable, especially when I actually care about what gets built and how.


If you have any suggestions for improvement or comments on the described approach, let me know! If you are interested in content like this, feel free to join my free Telegram channel where I share my thoughts on AI-related topics and relevant content I find interesting.


Context Switching with AI: The PRD → Tasklist → Work Loop
18 Jun2025

Another observation from using AI agents for multiple projects – at work and for a bunch of pet projects at home:

The PRD → Tasklist → Work loop makes context switching surprisingly painless. I can drop a project for days, sometimes weeks, and then come back and pick it up almost instantly.

Why? Because the PRD and tasklist hold all the context the model needs to keep working – and when I read them, they page that same context back into my brain too. It’s like shared memory between me and the AI.

If I ever come back to something and parts of it don’t make sense to me, that’s a red flag. The model wouldn’t get it either. So I work with it to figure things out and update the docs with what we learned.

Over time, this creates a really effective flywheel:

  • I clarify something
  • The model gets smarter
  • The docs get better
  • I ramp up faster next time

It’s a simple loop. But it works stupidly well.


If you are interested in content like this, feel free to join my free Telegram channel where I share my thoughts on AI-related topics and relevant content I find interesting.


From Autocomplete to Apprentice: Training AI to Work in Our Codebase
5 Jun2025

I’ve spent the last year turning large‑language‑model agents into productive teammates. Around 80 % of the code I ship nowadays is written by an LLM, yet it still reflects the unspoken project rules, patterns, and habits our team has baked into the environment. This post documents the recipe that took me there.


From Tab Autocomplete to Autonomous Agents

It started with a curiosity.

July 2024 – I installed Cursor and quickly fell in love with the autocomplete. After a year on GitHub Copilot, the Tab‑Tab‑Tab feature felt like a step change. It was like having a sharp intern finish your thoughts. The suggestions were fast, helpful, and mostly correct—as long as you gave it enough local context to make an educated guess.

Early 2025 – I gave Chat a try on a personal project. For the first time, I saw what it could do across a whole file. I had to tightly manage the context and double‑check everything, but something clicked. It wasn’t just useful—it was promising, and I felt in control of the results produced by the model.

March 2025 – I turned on Agent Mode, and my process fell apart at once. The Sonnet 3.7 model would charge ahead, cheerfully rewriting parts of the system it barely understood, hallucinating non‑existing APIs. It was chaos—the kind that feels overwhelming at first, but also oddly instructive if you paid attention. Debugging became a game of whack‑a‑mole. Some days, I spent more time undoing changes than moving forward. But under the mess, I saw potential. I started to understand the reasons why the model failed—and it all boiled down to context.

April 2025 to now – That’s when I discovered Cursor Rules. One by one, I started adding bits of project‑specific context to the system: naming conventions, testing quirks, deployment rituals. And just like that, the agent stopped acting like a rogue junior developer with full access and no supervision. It started to feel like a teammate with some tenure and reasonable understanding of the system, capable of implementing large, complex changes end‑to‑end without much involvement from my side.


Why Cursor Rules Matter

LLMs arrive pre‑trained on the internet. They don’t know your domain language, naming conventions, or deployment rituals. Cursor rules are small Markdown files that pin that tribal knowledge right next to the code. Add them, and the agent’s context window is always seeded with the right cues, ensuring your LLM partner starts each task aligned with your preferences.


My Six‑Step Onboarding Recipe

Step 1 – Start With an Empty Rulebook

When our team onboards an agent into our application, we skip the generic rulepacks. Every mature codebase bends the rules somewhere, and starting with an existing ruleset cements someone else’s preferences into your agent’s behavior, making it harder to steer.

Step 2 – Dump Context Into Chat

Open a fresh chat with a capable model (o3, Gemini 2.5, Anthropic Opus). Brain‑dump everything you know:

  • project purpose
  • domain terminology
  • architectural quirks
  • links to docs (@docs/…), READMEs, dashboards

Keep talking until you run dry, don’t worry about structure. I’ve spent over an hour at times, speaking nonstop into MacWhisper.

Step 3 – Generate the 000‑project‑info Rule

Ask the model to condense that chat into .cursor/rules/000-project-info.mdc and mark it Always. I use the numeric prefix so @0 autocompletes it later.

Step 4 – Keep a Living Knowledge Base

If you’re still onboarding yourself into the project, this is where AI shines. Ask it every question you can think of: what does this part do, how are things usually named, why is this structured that way? Every time you discover something new together that feels like it might help an agent make better decisions, capture it. Either update your main project info file or create a new rule file for it.

Here are some rules I have created in most of my projects:

  • 001-tech-guidelines.mdc – languages, frameworks, linters, dependency conventions.
  • 002-testing-guidelines.mdc – how to run all tests, a single file, or one example; test types; preferred TDD style.
  • 003-data-model.mdc (Agent‑requested) – list of models, relationships, invariants (generated by having the model parse schema.rb and the app/models folder).

Mark the first two Always, the rest Agent‑requested so they load on demand. Some other things I find useful to include (in agent-requested mode):

  • Show a page with API docs for an obscure dependency to the agent, ask it to generate a rule explaining the usage of that API.
  • For any unusual pattern within the codebase like an internal abstraction layer for a database or an external service, an internal library, etc explain to the agent why the abstraction exists and how it is used, then point it at important pieces of relevant code (both implementation and usage), then ask for a rule guiding an AI model in using that piece of technology.
  • Internal tooling: explain all the tools you have available for the agent to do its job and when and how to use those. Think linters, code quality and coverage controls, different types of tests and other ways to get feedback on the quality of AI’s solution.

Step 5 – Let the Agent Struggle (Then Capture the Lesson)

Pick a trivial task you already know how to implement. Let the agent attempt. When it stumbles, nudge it forward. Not by coding, but by asking questions and pointing to clues. After the fix ships, ask:

Look back at the history of our conversation and see if you notice any patterns that would be helpful for an AI coding agent to know next time we work on something similar. Update your existing cursor rules or create new ones to persist your findings.

Then review and commit the changes. This will help the model get better at solving problems similar to what you have just done.

Step 6 – Rinse, Repeat, Refine

Repeat this for a few weeks and the model will start making fewer obvious mistakes, build things in ways that match your expectations, and often pre‑empt your next move before you’ve even typed it out.


One surprising effect of this process is how much of my tacit knowledge I’ve had to put into words —decades of habits, intuition, and project‑specific judgment calls now live in Markdown files. That knowledge doesn’t just help my agent; it lifts the whole team. As we all work with our agents, we start seeing them act on rules someone else introduced, surfacing insights and patterns we hadn’t shared before. It’s low‑friction knowledge transfer, and it works.


The Apprenticeship Model

At some point while documenting this process, I realized what it resembled: an apprenticeship. You’re bringing on a new team member, and instead of throwing manuals at them, you teach by pairing on real tasks. You guide, correct, explain. The model’s pre-training is its education, sure — but adapting it to your environment, your tools, your expectations — that part is still on us. That’s the job of a mentor, and that’s how I see this work now. Our job is changing and we may all eventually become PMs managing teams of AI agents, but today we need to be mentors first.

If you would like to hear more about my adventures with modern AI systems, feel free to join my Telegram channel where I try to share more of my experiences along with interesting content I discover daily.


Thinking of the person who pressed Go on today’s Crowdstrike release
20 Jul2024

Today’s tweet about the Crowdstrike incident, which seemingly brought the modern IT world to a standstill, reminded me of the darkest day of my professional life — when I accidentally knocked out internet access in a city of over 200,000 people.


It was my second year of university and I worked for a the largest local ISP in my home city as a junior system administrator. We had a large wireless network (~100km in diameter) covering our whole city and many surrounding rural areas. This network was used by all major commercial banks and many large enterprises in the area (bank branches, large factories, radio stations, etc).

To cover such a large area (in Ukraine in early 2000s), about 50% of which were rural villages and towns, we basically had to build a huge wifi network, that had a very powerful antenna in the center and many smaller regional points of presence would connect to it using directional wifi antennas and then distribute the traffic locally. The core router connected to the central antenna was located at the top floor of the highest building in the area about 20 min away from our office.

One day I was working on some monitoring scripts for the central router (which was basically a custom-built FreeBSD server). I’d run those scripts on a local stand I had on my table, make some changes, run it again, etc. We did not have VMs back then, so experimental work would happen on real hardware that was a clone of a production box. In the middle of my local debugging, I received a monitoring alert from our production saying that our core router had some (non-critical) issues. Since I was on-call that day, I decided take a look. Fixing the issue on the router, I went back to my debugging and successfully finished the job after about an hour.

And that’s where things went wrong… When I wanted to shut down my local machine, I switched to a terminal that was connected to the box, typed “poweroff”, pressed Enter… and only then realized that I did it on a wrong server! 🤦🏻‍♂️ I had that second terminal window opened ever since the monitoring alert an hour ago, and now I ended up shutting down the core router for our whole city-wide network!

What’s cool is that there was no blame in the aftermath of the incident. The team understood the mistake and focused on fixing the problem. We ended up having to drive to the central station and manually power the router back on. Back then we did not have any remote power management set up for that server and IPMI did not exist back then. Dark times indeed! 😉

As a result of that mistake, our whole city’s banking infrastructure and a bunch of other important services were down for ~30 minutes. Following the incident, we have made a number of improvements to our infrastructure and our processes (I don’t remember the details now) making the system a lot more resilient to similar errors.

Looking back now, huge kudos to my bosses for not firing me back then! This incident profoundly influenced my career in many ways:

First, the thrill of managing such vast infrastructures made me want to stay in technical operations rather than shifting to pure software development, a path many of my peers chose at the time. Then, having experienced such a massive error firsthand, I’ve always done my absolute best to safeguard my systems against failures, optimizing for quick recovery and being paranoid about backups and redundancy. Finally, it was a pivotal moment in my understanding of the value of blameless incident process long before the emergence of the modern blameless DevOps and SRE cultures — a management lesson that has deeply informed my approach to leadership and system design ever since.


Interview: Inside Shopify’s Modular Monolith
16 Jun2024

This is my interview with Dr. Milan Milanovic originally published on his newsletter Tech World With Milan where we discussed Shopify  architecture, tech stack, testing, culture, and more.

1.  Who is Oleksiy?

I have spent most of my career in technical operations (system administration, later called DevOps, nowadays encompassed by platform engineering and SRE disciplines). Along the way, I worked at Percona as a MySQL performance consultant and then operated some of the largest Ruby on Rails applications in the world, all the while following the incredible story of Shopify’s development and growth.

Finally, after decades of work in operations, when a startup I was at got acquired by Elastic, I decided to move into software engineering. After 5 years there, I needed a bigger challenge, which felt like the right moment to join Shopify.

I started with the Storefronts group (the team responsible for Storefront themes, all the related infrastructure, and the Storefront rendering infrastructure) at Shopify at the beginning of 2022. Two years later, I can confidently say that Shopify’s culture is unique. I enjoy working with the team here due to the incredible talent density I have never encountered. Every day, I am humbled by the caliber of people I can work with and the level of problems I get to solve.

2.  What is the role of the Principal Engineer at Shopify?

Before joining Shopify, I was excited about all the possibilities associated with the Principal Engineer role. Immediately, I was surprised at how diverse the Principal Engineering discipline was at the company. We have a range of engineers here, from extremely deep and narrow experts to amazing architects coordinating challenging projects across the company. Even more impressive is that you have a lot of agency in the shape of a Principal Engineer you will be, provided that the work aligns with the overarching mission of making commerce better for everyone. After 2 years with the company, I found myself in a sweet spot of spending ~75% of my time doing deep technical work across multiple areas of Storefronts infrastructure, and the rest is spent on project leadership, coordination, etc.

3.  The recent tweet by Shopify Engineering shows impressive results achieved by your system. What is Shopify’s overall architecture?

The infrastructure at Shopify was one of the most surprising parts of the company for me. I have spent my whole career building large, heavily loaded systems based on Ruby on Rails. Joining Shopify and knowing upfront a lot about the amount of traffic they handled during Black Friday, Cyber Monday (BFCM), and flash sales, I was half-expecting to find some magic sauce inside. But the reality turned out to be very different: the team here is extremely pragmatic when building anything. It comes from Shopify’s Founder and CEO Tobi Lütke himself: if something can be made simpler, we try to make it so. As a result, the whole system behind those impressive numbers is built on top of fairly common components: Ruby, Rails, MySQL/Vitess, Memcached/Redis, Kafka, Elasticsearch, etc., scaled horizontally.

Shopify Engineering Tweet about the amount of traffic they handled during Black Friday

What makes Shopify unique is the level of mastery the teams have built around those key components: we employ Ruby core contributors (who keep making Ruby faster), Rails core contributors (improving Rails), MySQL experts (who know how to operate MySQL at scale), and we contribute to and maintain all kinds of open-source projects that support our infrastructure. As a result, even the simplest components in our infrastructure tend to be deployed, managed, and scaled exceptionally well, leading to a system that can scale to many orders of magnitude over the baseline capacity and still perform well.

4.  What is Shopify’s tech stack?

Given that databases (and stateful systems in general) are the most complex components to scale, we focus our scaling on MySQL first. All shops on the platform are split into groups, each hosted on a dedicated set of database servers called a pod. Each pod is wholly isolated from the rest of the database infrastructure, limiting the blast radius of most database-related incidents to a relatively small group of shops. Some more prominent merchants get their dedicated pods that guarantee complete resource isolation.

Over the past year, some applications started relying on Vitess to help with the horizontal sharding of their data.

On top of the database layer is a reasonably standard Ruby on Rails stack: Ruby and Rails applications running on Puma, using Memcached for ephemeral storage needs and Elasticsearch for full-text search. Nginx + Lua is used for sophisticated tasks, from smart routing across multiple regions to rate limiting, abuse protection, etc.

This runs on top of Kubernetes hosted on Google Cloud in many regions worldwide, making the infrastructure extremely scalable and responsive to wild traffic fluctuations.

Check the full Shopify tech stack at Stackshare.

A Pods Architecture To Allow Shopify To Scale (Source: Shopify Engineering)

What are Pods exactly?

The idea behind pods at Shopify is to split all of our data into a set of completely independent database (MySQL) clusters using shop_id as the sharding key to ensure resource isolation between different tenants and localize the impact of a “noisy neighbor” problem across the platform. 

Only the databases are podded since they are the hardest component to scale. Everything else that is stateless is scaled automatically according to the incoming traffic levels and other load parameters using a custom Kubernetes autoscale.

5. Is the monolith going to be broken into microservices?

Shopify fully embraces the idea of a Majestic Monolith—most user-facing functionality people tend to associate with the company is served by a single large Ruby on Rails application called “Shopify Core.” Internally, the monolith is split into multiple components focused on different business domains. Many custom (later open-sourced) machinery have been built to enforce coding standards, API boundaries between components, etc.

The rendering application behind all Shopify storefronts is completely separate from the monolith. This was one of the cases where it made perfect sense to split functionality from Core because it is relatively simple. Load data from a database, render Liquid code, and send the HTML back to the user – the absolute majority of requests it handles. Given the amount of traffic on this application, even a small improvement in its efficiency results in enormous resource savings. So, when it was initially built, the team set several strict constraints on how the code is written, what features of Ruby we prefer to avoid, how we deal with memory usage, etc. This allowed us to build a pretty efficient application in a language we love while carefully controlling memory allocation and the resources we spend rendering storefronts.

Shopify application components

In parallel with this effort, the Ruby infrastructure team (working on YJIT, among other things) has made the language significantly faster with each release. Finally, in the last year, we started rewriting parts of this application in Rust to improve efficiency further.

Answering your question about the future of the monolith, I think outside of a few other localized cases, most of the functionality of the Shopify platform will probably be handled by the Core monolith for a long time, given how well it has worked for us so far using relatively standard horizontal scalability techniques.

6. How do you do testing?

Our testing infrastructure is a multi-layered set of checks that allows us to deploy hundreds of times daily while keeping the platform safe. It starts with a set of tests on each application: your typical unit/integration tests, etc. Those are required for a change to propagate into a deployment pipeline (based on the Shipit engine, created by Shopify and open-sourced years ago.

Shopify overall infrastructure

During the deployment, a very important step is canary testing: a change will be deployed onto a small subset of production instances, and automation will monitor a set of key health metrics for the platform. If any metrics move in the wrong direction, the change is automatically reverted and removed from production immediately, allowing developers to figure out what went wrong and try again when they fix the problem. Only after testing a change on canaries for some time the deployment pipeline performs a full deployment. The same approach is used for significant schema changes, etc.

7. How do you do deployments?

All Shopify deployments are based on Kubernetes (running on GCP), so each application is a container (or a fleet of containers) somewhere in one of our clusters. Our deployment pipeline is built on the Shipit engine (created by Shopify and open-sourced years ago). Deployment pipelines can get pretty complex, but it mostly boils down to building an image, deploying it to canaries, waiting to ensure things are healthy, and gradually rolling out the change wider across the global fleet of Kubernetes clusters.

Shipit also maintains the deployment queue and merges multiple pull requests into a single deployment to increase the pipeline’s throughput.

Shipit open-source deployment tool by Shopify (Source)

8. How do you handle failures in the system? 

The whole system is built with many redundancy and horizontal auto-scaling (if possible), which helps prevent large-scale outages. But there are always big and small fires to handle. So, we have a dedicated site reliability team responsible for keeping the platform healthy in the face of constant change and adversarial problems like bots and DDoS attacks. They have built many automated tools to help us handle traffic flashes and, if needed, degrade gracefully. Some interesting examples: they have automated traffic analysis tools helping them scope ongoing incidents down to specific pods, shops, page types, or traffic sources; then the team can control the flow of traffic by pod or shop, re-route traffic between regions, block or slow down requests from specific parts of the world, prioritize particular types of traffic and apply anti-adversarial measures across our network to mitigate attacks.

Finally, each application has an owner team (or a set of teams) that can be paged if their application gets unhealthy. They help troubleshoot and resolve incidents around the clock (being a distributed company helps a lot here since we have people across many time zones).

9. What challenges are you working on right now in your team?

We have just finished a large project to increase the global footprint of our Storefront rendering infrastructure, rolling out new regions in Europe, Asia, Australia, and North America. The project required coordination across many different teams (from networking to databases to operations, etc.) and involved building completely new tools for filtered database replication (since we cannot replicate all of our data into all regions due to cost and data residency requirements), making changes in the application itself to allow for rendering without having access to all data, etc. This large effort has reduced latency for our buyers worldwide and made their shopping experiences smoother.

Next on our radar are further improvements in Liquid rendering performance, database access optimization, and other performance-related work.