Category: Development
Interview: Inside Shopify’s Modular Monolith
16 Jun2024

This is my interview with Dr. Milan Milanovic originally published on his newsletter Tech World With Milan where we discussed Shopify  architecture, tech stack, testing, culture, and more.

1.  Who is Oleksiy?

I have spent most of my career in technical operations (system administration, later called DevOps, nowadays encompassed by platform engineering and SRE disciplines). Along the way, I worked at Percona as a MySQL performance consultant and then operated some of the largest Ruby on Rails applications in the world, all the while following the incredible story of Shopify’s development and growth.

Finally, after decades of work in operations, when a startup I was at got acquired by Elastic, I decided to move into software engineering. After 5 years there, I needed a bigger challenge, which felt like the right moment to join Shopify.

I started with the Storefronts group (the team responsible for Storefront themes, all the related infrastructure, and the Storefront rendering infrastructure) at Shopify at the beginning of 2022. Two years later, I can confidently say that Shopify’s culture is unique. I enjoy working with the team here due to the incredible talent density I have never encountered. Every day, I am humbled by the caliber of people I can work with and the level of problems I get to solve.

2.  What is the role of the Principal Engineer at Shopify?

Before joining Shopify, I was excited about all the possibilities associated with the Principal Engineer role. Immediately, I was surprised at how diverse the Principal Engineering discipline was at the company. We have a range of engineers here, from extremely deep and narrow experts to amazing architects coordinating challenging projects across the company. Even more impressive is that you have a lot of agency in the shape of a Principal Engineer you will be, provided that the work aligns with the overarching mission of making commerce better for everyone. After 2 years with the company, I found myself in a sweet spot of spending ~75% of my time doing deep technical work across multiple areas of Storefronts infrastructure, and the rest is spent on project leadership, coordination, etc.

3.  The recent tweet by Shopify Engineering shows impressive results achieved by your system. What is Shopify’s overall architecture?

The infrastructure at Shopify was one of the most surprising parts of the company for me. I have spent my whole career building large, heavily loaded systems based on Ruby on Rails. Joining Shopify and knowing upfront a lot about the amount of traffic they handled during Black Friday, Cyber Monday (BFCM), and flash sales, I was half-expecting to find some magic sauce inside. But the reality turned out to be very different: the team here is extremely pragmatic when building anything. It comes from Shopify’s Founder and CEO Tobi Lütke himself: if something can be made simpler, we try to make it so. As a result, the whole system behind those impressive numbers is built on top of fairly common components: Ruby, Rails, MySQL/Vitess, Memcached/Redis, Kafka, Elasticsearch, etc., scaled horizontally.

Shopify Engineering Tweet about the amount of traffic they handled during Black Friday

What makes Shopify unique is the level of mastery the teams have built around those key components: we employ Ruby core contributors (who keep making Ruby faster), Rails core contributors (improving Rails), MySQL experts (who know how to operate MySQL at scale), and we contribute to and maintain all kinds of open-source projects that support our infrastructure. As a result, even the simplest components in our infrastructure tend to be deployed, managed, and scaled exceptionally well, leading to a system that can scale to many orders of magnitude over the baseline capacity and still perform well.

4.  What is Shopify’s tech stack?

Given that databases (and stateful systems in general) are the most complex components to scale, we focus our scaling on MySQL first. All shops on the platform are split into groups, each hosted on a dedicated set of database servers called a pod. Each pod is wholly isolated from the rest of the database infrastructure, limiting the blast radius of most database-related incidents to a relatively small group of shops. Some more prominent merchants get their dedicated pods that guarantee complete resource isolation.

Over the past year, some applications started relying on Vitess to help with the horizontal sharding of their data.

On top of the database layer is a reasonably standard Ruby on Rails stack: Ruby and Rails applications running on Puma, using Memcached for ephemeral storage needs and Elasticsearch for full-text search. Nginx + Lua is used for sophisticated tasks, from smart routing across multiple regions to rate limiting, abuse protection, etc.

This runs on top of Kubernetes hosted on Google Cloud in many regions worldwide, making the infrastructure extremely scalable and responsive to wild traffic fluctuations.

Check the full Shopify tech stack at Stackshare.

A Pods Architecture To Allow Shopify To Scale (Source: Shopify Engineering)

What are Pods exactly?

The idea behind pods at Shopify is to split all of our data into a set of completely independent database (MySQL) clusters using shop_id as the sharding key to ensure resource isolation between different tenants and localize the impact of a “noisy neighbor” problem across the platform. 

Only the databases are podded since they are the hardest component to scale. Everything else that is stateless is scaled automatically according to the incoming traffic levels and other load parameters using a custom Kubernetes autoscale.

5. Is the monolith going to be broken into microservices?

Shopify fully embraces the idea of a Majestic Monolith—most user-facing functionality people tend to associate with the company is served by a single large Ruby on Rails application called “Shopify Core.” Internally, the monolith is split into multiple components focused on different business domains. Many custom (later open-sourced) machinery have been built to enforce coding standards, API boundaries between components, etc.

The rendering application behind all Shopify storefronts is completely separate from the monolith. This was one of the cases where it made perfect sense to split functionality from Core because it is relatively simple. Load data from a database, render Liquid code, and send the HTML back to the user – the absolute majority of requests it handles. Given the amount of traffic on this application, even a small improvement in its efficiency results in enormous resource savings. So, when it was initially built, the team set several strict constraints on how the code is written, what features of Ruby we prefer to avoid, how we deal with memory usage, etc. This allowed us to build a pretty efficient application in a language we love while carefully controlling memory allocation and the resources we spend rendering storefronts.

Shopify application components

In parallel with this effort, the Ruby infrastructure team (working on YJIT, among other things) has made the language significantly faster with each release. Finally, in the last year, we started rewriting parts of this application in Rust to improve efficiency further.

Answering your question about the future of the monolith, I think outside of a few other localized cases, most of the functionality of the Shopify platform will probably be handled by the Core monolith for a long time, given how well it has worked for us so far using relatively standard horizontal scalability techniques.

6. How do you do testing?

Our testing infrastructure is a multi-layered set of checks that allows us to deploy hundreds of times daily while keeping the platform safe. It starts with a set of tests on each application: your typical unit/integration tests, etc. Those are required for a change to propagate into a deployment pipeline (based on the Shipit engine, created by Shopify and open-sourced years ago.

Shopify overall infrastructure

During the deployment, a very important step is canary testing: a change will be deployed onto a small subset of production instances, and automation will monitor a set of key health metrics for the platform. If any metrics move in the wrong direction, the change is automatically reverted and removed from production immediately, allowing developers to figure out what went wrong and try again when they fix the problem. Only after testing a change on canaries for some time the deployment pipeline performs a full deployment. The same approach is used for significant schema changes, etc.

7. How do you do deployments?

All Shopify deployments are based on Kubernetes (running on GCP), so each application is a container (or a fleet of containers) somewhere in one of our clusters. Our deployment pipeline is built on the Shipit engine (created by Shopify and open-sourced years ago). Deployment pipelines can get pretty complex, but it mostly boils down to building an image, deploying it to canaries, waiting to ensure things are healthy, and gradually rolling out the change wider across the global fleet of Kubernetes clusters.

Shipit also maintains the deployment queue and merges multiple pull requests into a single deployment to increase the pipeline’s throughput.

Shipit open-source deployment tool by Shopify (Source)

8. How do you handle failures in the system? 

The whole system is built with many redundancy and horizontal auto-scaling (if possible), which helps prevent large-scale outages. But there are always big and small fires to handle. So, we have a dedicated site reliability team responsible for keeping the platform healthy in the face of constant change and adversarial problems like bots and DDoS attacks. They have built many automated tools to help us handle traffic flashes and, if needed, degrade gracefully. Some interesting examples: they have automated traffic analysis tools helping them scope ongoing incidents down to specific pods, shops, page types, or traffic sources; then the team can control the flow of traffic by pod or shop, re-route traffic between regions, block or slow down requests from specific parts of the world, prioritize particular types of traffic and apply anti-adversarial measures across our network to mitigate attacks.

Finally, each application has an owner team (or a set of teams) that can be paged if their application gets unhealthy. They help troubleshoot and resolve incidents around the clock (being a distributed company helps a lot here since we have people across many time zones).

9. What challenges are you working on right now in your team?

We have just finished a large project to increase the global footprint of our Storefront rendering infrastructure, rolling out new regions in Europe, Asia, Australia, and North America. The project required coordination across many different teams (from networking to databases to operations, etc.) and involved building completely new tools for filtered database replication (since we cannot replicate all of our data into all regions due to cost and data residency requirements), making changes in the application itself to allow for rendering without having access to all data, etc. This large effort has reduced latency for our buyers worldwide and made their shopping experiences smoother.

Next on our radar are further improvements in Liquid rendering performance, database access optimization, and other performance-related work.


Edge Web Server Testing at Swiftype
28 Apr2018

This article has been originally posted on Swiftype Engineering blog.


For any modern technology company, a comprehensive application test suite is an absolute necessity. Automated testing suites allow developers to move faster while avoiding any loss of code quality or system stability. Software development has seen great benefit come from the adoption of automated testing frameworks and methodologies, however, the culture of automated testing has neglected one key area of modern web application serving stack: web application edge routing and multiplexing rulesets.

From modern load balancer appliances that allow for TCL based rule sets; local or remotely hosted varnish VCL rules; or in the power and flexibility that Nginx and OpenResty make available through LUA, edge routing rulesets have become a vital part of application serving controls.

Over the past decade or so, it has become possible to incorporate more and more logic into edge web server infrastructures. Almost every modern web server has support for scripting, enabling developers to make their edge servers smarter than ever before. Unfortunately, the application logic configured within web servers is often much harder to test than that hosted directly in application code, and thus too often software teams resort to manual testing, or worse, customers as testers, by shipping their changes to production without edge routing testing having been performed.

In this post, I would like to explain the approach Swiftype has taken to ensure that our test suites account for our use of complex edge web server logic
to manage our production traffic flow, and thus that we can confidently deploy changes to our application infrastructure with little or no risk.

Read the rest of this entry

DbCharmer Development: I Give Up
14 Nov2014

About 6 years ago (feels like an eternity in Rails world) working at Scribd I’ve started working on porting our codebase from some old version or Rails to a slightly newer one. That’s when I realized, that there wasn’t a ruby gem to help us manage MySQL connections for our vertically sharded databases (different models on different servers). I’ve started hacking on some code to replace whatever we were using back then, finished the first version of the migration branch and then decided to open the code for other people to use. That’s how the DbCharmer ruby gem was born.

For the next few years a lot of new functionality we needed has been added to the gem, making it more complex and immensely more powerful. I’ve enjoyed working on it, developing those features, contributing to the community. But then I left Scribd, stopped being a user of DbCharmer and the situation drastically changed. For quite some time (years) I would keep fighting to make the code work with newer and newer versions of Rails, struggling to wrap my head around more and more (sometimes useless) abstractions Rails Core team decided to throw into ActiveRecord.

Finally, in the last 2 years (while trying to make DbCharmer compatible with Rails 4.0) it has become more and more apparent, that I simply do not want to do this anymore. I do not need DbCharmer to support Rails 4.0+, while it is very clear that many users need it and constant nagging in the issues and the mailing list, asking for updates generated a lot of anxiety for me, anxiety I couldn’t do much about (the worst kind). As the result, since I simply do not see any good reasons to keep fighting this uphill battle (and developing stuff like this for ActiveRecord IS a constant battle!) I officially give up.

Read the rest of this entry


Adding Custom Hive SerDe and UDF Libraries to Cloudera Hadoop 4.3
26 Jul2013

Yet another small note about Cloudera Hadoop Distribution 4.3.

This time I needed to deploy some custom JAR files to our Hive cluster so that we wouldn’t need to do “ADD JAR” commands in every Hive job (especially useful when using HiveServer API).

Here is the process of adding a custom SerDE or a UDF jar to your Cloudera Hadoop cluster:

  • First, we have built our JSON SerDe and got a json-serde-1.1.6.jar file.
  • To make this file available to Hive CLI tools, we need to copy it to /usr/lib/hive/lib on every server in the cluster (I have prepared an rpm package to do just that).
  • To make sure Hive map-reduce jobs would be able to read/write JSON tables, we needed to copy our JAR file to /usr/lib/hadoop/lib directory on all task tracker servers in the cluster (the same rpm does that).
  • And last, really important step: To make sure your TaskTracker servers know about the new jar, you need to restart your tasktracker services (we use Cloudera Manager, so that was just a few mouse clicks ;-))

And this is it for today.


DbCharmer 1.7.0 Release: Rails 3.0 Support and Forced Slave Reads
1 Sep2011

This week, after 3 months in the works, we’ve finally released version 1.7.0 of DbCharmer ruby gem – Rails plugin that significantly extends ActiveRecord’s ability to work with multiple databases and/or database servers by adding features like multiple databases support, master/slave topologies support, sharding, etc.

New features in this release:

  • Rails 3.0 support. We’ve worked really hard to bring all the features we supported in Rails 2.X to the new version of Rails and now I’m proud that we’ve implemented them all and the implementation looks much cleaner and more universal (all kinds of relations in rails 3 work in exactly the same way and we do not need to implement connection switching for all kinds of weird corner-cases in ActiveRecord).
  • Forced Slave Reads functionality. Now we could have models with slaves that are not used by default, but could be turned on globally (per-controller, per-action or in a block). This is a new feature that brings our master/slave routing capabilities to a really new level – we could now use it for a really mission-critical models on demand and not be afraid of breaking major functionality of our applications by switching them to slave reads.
  • Lots of changes were made in the structure of our code and tests to make sure it would be much easier for new developers to understand DbCharmer internals and make changes in its code.

Along with the new release we’ve got a brand new web site. You can find much better, cleaner and, most importantly, correct documentation for the library on the web site. We’ll be adding more examples, will try to add more in-depth explanation of our core functions, etc.

If you have any questions about the release, feel free to ask them in our new mailing list: DbCharmer Users Group.

For more updates on our releases, you can follow @DbCharmer on Twitter.