Interview: Inside Shopify’s Modular Monolith
16 Jun2024

This is my interview with Dr. Milan Milanovic originally published on his newsletter Tech World With Milan where we discussed Shopify  architecture, tech stack, testing, culture, and more.

1.  Who is Oleksiy?

I have spent most of my career in technical operations (system administration, later called DevOps, nowadays encompassed by platform engineering and SRE disciplines). Along the way, I worked at Percona as a MySQL performance consultant and then operated some of the largest Ruby on Rails applications in the world, all the while following the incredible story of Shopify’s development and growth.

Finally, after decades of work in operations, when a startup I was at got acquired by Elastic, I decided to move into software engineering. After 5 years there, I needed a bigger challenge, which felt like the right moment to join Shopify.

I started with the Storefronts group (the team responsible for Storefront themes, all the related infrastructure, and the Storefront rendering infrastructure) at Shopify at the beginning of 2022. Two years later, I can confidently say that Shopify’s culture is unique. I enjoy working with the team here due to the incredible talent density I have never encountered. Every day, I am humbled by the caliber of people I can work with and the level of problems I get to solve.

2.  What is the role of the Principal Engineer at Shopify?

Before joining Shopify, I was excited about all the possibilities associated with the Principal Engineer role. Immediately, I was surprised at how diverse the Principal Engineering discipline was at the company. We have a range of engineers here, from extremely deep and narrow experts to amazing architects coordinating challenging projects across the company. Even more impressive is that you have a lot of agency in the shape of a Principal Engineer you will be, provided that the work aligns with the overarching mission of making commerce better for everyone. After 2 years with the company, I found myself in a sweet spot of spending ~75% of my time doing deep technical work across multiple areas of Storefronts infrastructure, and the rest is spent on project leadership, coordination, etc.

3.  The recent tweet by Shopify Engineering shows impressive results achieved by your system. What is Shopify’s overall architecture?

The infrastructure at Shopify was one of the most surprising parts of the company for me. I have spent my whole career building large, heavily loaded systems based on Ruby on Rails. Joining Shopify and knowing upfront a lot about the amount of traffic they handled during Black Friday, Cyber Monday (BFCM), and flash sales, I was half-expecting to find some magic sauce inside. But the reality turned out to be very different: the team here is extremely pragmatic when building anything. It comes from Shopify’s Founder and CEO Tobi Lütke himself: if something can be made simpler, we try to make it so. As a result, the whole system behind those impressive numbers is built on top of fairly common components: Ruby, Rails, MySQL/Vitess, Memcached/Redis, Kafka, Elasticsearch, etc., scaled horizontally.

Shopify Engineering Tweet about the amount of traffic they handled during Black Friday

What makes Shopify unique is the level of mastery the teams have built around those key components: we employ Ruby core contributors (who keep making Ruby faster), Rails core contributors (improving Rails), MySQL experts (who know how to operate MySQL at scale), and we contribute to and maintain all kinds of open-source projects that support our infrastructure. As a result, even the simplest components in our infrastructure tend to be deployed, managed, and scaled exceptionally well, leading to a system that can scale to many orders of magnitude over the baseline capacity and still perform well.

4.  What is Shopify’s tech stack?

Given that databases (and stateful systems in general) are the most complex components to scale, we focus our scaling on MySQL first. All shops on the platform are split into groups, each hosted on a dedicated set of database servers called a pod. Each pod is wholly isolated from the rest of the database infrastructure, limiting the blast radius of most database-related incidents to a relatively small group of shops. Some more prominent merchants get their dedicated pods that guarantee complete resource isolation.

Over the past year, some applications started relying on Vitess to help with the horizontal sharding of their data.

On top of the database layer is a reasonably standard Ruby on Rails stack: Ruby and Rails applications running on Puma, using Memcached for ephemeral storage needs and Elasticsearch for full-text search. Nginx + Lua is used for sophisticated tasks, from smart routing across multiple regions to rate limiting, abuse protection, etc.

This runs on top of Kubernetes hosted on Google Cloud in many regions worldwide, making the infrastructure extremely scalable and responsive to wild traffic fluctuations.

Check the full Shopify tech stack at Stackshare.

A Pods Architecture To Allow Shopify To Scale (Source: Shopify Engineering)

What are Pods exactly?

The idea behind pods at Shopify is to split all of our data into a set of completely independent database (MySQL) clusters using shop_id as the sharding key to ensure resource isolation between different tenants and localize the impact of a “noisy neighbor” problem across the platform. 

Only the databases are podded since they are the hardest component to scale. Everything else that is stateless is scaled automatically according to the incoming traffic levels and other load parameters using a custom Kubernetes autoscale.

5. Is the monolith going to be broken into microservices?

Shopify fully embraces the idea of a Majestic Monolith—most user-facing functionality people tend to associate with the company is served by a single large Ruby on Rails application called “Shopify Core.” Internally, the monolith is split into multiple components focused on different business domains. Many custom (later open-sourced) machinery have been built to enforce coding standards, API boundaries between components, etc.

The rendering application behind all Shopify storefronts is completely separate from the monolith. This was one of the cases where it made perfect sense to split functionality from Core because it is relatively simple. Load data from a database, render Liquid code, and send the HTML back to the user – the absolute majority of requests it handles. Given the amount of traffic on this application, even a small improvement in its efficiency results in enormous resource savings. So, when it was initially built, the team set several strict constraints on how the code is written, what features of Ruby we prefer to avoid, how we deal with memory usage, etc. This allowed us to build a pretty efficient application in a language we love while carefully controlling memory allocation and the resources we spend rendering storefronts.

Shopify application components

In parallel with this effort, the Ruby infrastructure team (working on YJIT, among other things) has made the language significantly faster with each release. Finally, in the last year, we started rewriting parts of this application in Rust to improve efficiency further.

Answering your question about the future of the monolith, I think outside of a few other localized cases, most of the functionality of the Shopify platform will probably be handled by the Core monolith for a long time, given how well it has worked for us so far using relatively standard horizontal scalability techniques.

6. How do you do testing?

Our testing infrastructure is a multi-layered set of checks that allows us to deploy hundreds of times daily while keeping the platform safe. It starts with a set of tests on each application: your typical unit/integration tests, etc. Those are required for a change to propagate into a deployment pipeline (based on the Shipit engine, created by Shopify and open-sourced years ago.

Shopify overall infrastructure

During the deployment, a very important step is canary testing: a change will be deployed onto a small subset of production instances, and automation will monitor a set of key health metrics for the platform. If any metrics move in the wrong direction, the change is automatically reverted and removed from production immediately, allowing developers to figure out what went wrong and try again when they fix the problem. Only after testing a change on canaries for some time the deployment pipeline performs a full deployment. The same approach is used for significant schema changes, etc.

7. How do you do deployments?

All Shopify deployments are based on Kubernetes (running on GCP), so each application is a container (or a fleet of containers) somewhere in one of our clusters. Our deployment pipeline is built on the Shipit engine (created by Shopify and open-sourced years ago). Deployment pipelines can get pretty complex, but it mostly boils down to building an image, deploying it to canaries, waiting to ensure things are healthy, and gradually rolling out the change wider across the global fleet of Kubernetes clusters.

Shipit also maintains the deployment queue and merges multiple pull requests into a single deployment to increase the pipeline’s throughput.

Shipit open-source deployment tool by Shopify (Source)

8. How do you handle failures in the system? 

The whole system is built with many redundancy and horizontal auto-scaling (if possible), which helps prevent large-scale outages. But there are always big and small fires to handle. So, we have a dedicated site reliability team responsible for keeping the platform healthy in the face of constant change and adversarial problems like bots and DDoS attacks. They have built many automated tools to help us handle traffic flashes and, if needed, degrade gracefully. Some interesting examples: they have automated traffic analysis tools helping them scope ongoing incidents down to specific pods, shops, page types, or traffic sources; then the team can control the flow of traffic by pod or shop, re-route traffic between regions, block or slow down requests from specific parts of the world, prioritize particular types of traffic and apply anti-adversarial measures across our network to mitigate attacks.

Finally, each application has an owner team (or a set of teams) that can be paged if their application gets unhealthy. They help troubleshoot and resolve incidents around the clock (being a distributed company helps a lot here since we have people across many time zones).

9. What challenges are you working on right now in your team?

We have just finished a large project to increase the global footprint of our Storefront rendering infrastructure, rolling out new regions in Europe, Asia, Australia, and North America. The project required coordination across many different teams (from networking to databases to operations, etc.) and involved building completely new tools for filtered database replication (since we cannot replicate all of our data into all regions due to cost and data residency requirements), making changes in the application itself to allow for rendering without having access to all data, etc. This large effort has reduced latency for our buyers worldwide and made their shopping experiences smoother.

Next on our radar are further improvements in Liquid rendering performance, database access optimization, and other performance-related work.


Farewell to Elastic and The New Chapter
23 Feb2022

Today was my last day at Elastic. After 4.5 years with the company and almost 9 years with the Swiftype (and later Enterprise Search) team, I have decided to move on and see what else is out there. I wanted to use this post to clarify the reasoning behind the decision because a lot of people have been reaching out over the past month wondering about the details.

For most of my career (for at least 15-17 years before joining Elastic) I have worked in small to medium-sized startups, always in SaaS, moving really fast and having my impact on the business be mostly tied to my ability to ship. I loved that, even though it was often painful and stressful. My brain ended up being trained to derive dopamine from the constant feeling of shipping, constant feeling of overcoming challenges and solving problems.

Then, we got acquired by Elastic – a truly amazing company, honestly, the best company I have ever worked for and, unfortunately, a company that ships packaged software with its inherent effects on development process. At first, while my projects revolved around security and compliance, then around internal code migrations, integration into the Elastic ecosystem, etc, I felt really happy – I would do the really challenging work and derive pleasure from overcoming those challenges.

Unfortunately, over time the initial rush of excitement faded away and I have realized, that the most challenging problems within our product have been solved and I ended up in a position of working on a packaged product, building features with a release cycle of 6-8 weeks. The product is amazing, the features are really exciting, but the very long feedback cycle simply did not work for me after so many years in SaaS.

I still believe in Elastic, I love the company, the team and the amazing culture we have built over the years. But I need to ship faster, move faster, get feedback from my users sooner. After considering different options, I have decided to join Shopify – a large and fast moving Rails-based company, where I hope to get a chance to once again experience the thrill of fast delivery and tight feedback loops. Let’s see how it goes 🙂


Edge Web Server Testing at Swiftype
28 Apr2018

This article has been originally posted on Swiftype Engineering blog.


For any modern technology company, a comprehensive application test suite is an absolute necessity. Automated testing suites allow developers to move faster while avoiding any loss of code quality or system stability. Software development has seen great benefit come from the adoption of automated testing frameworks and methodologies, however, the culture of automated testing has neglected one key area of modern web application serving stack: web application edge routing and multiplexing rulesets.

From modern load balancer appliances that allow for TCL based rule sets; local or remotely hosted varnish VCL rules; or in the power and flexibility that Nginx and OpenResty make available through LUA, edge routing rulesets have become a vital part of application serving controls.

Over the past decade or so, it has become possible to incorporate more and more logic into edge web server infrastructures. Almost every modern web server has support for scripting, enabling developers to make their edge servers smarter than ever before. Unfortunately, the application logic configured within web servers is often much harder to test than that hosted directly in application code, and thus too often software teams resort to manual testing, or worse, customers as testers, by shipping their changes to production without edge routing testing having been performed.

In this post, I would like to explain the approach Swiftype has taken to ensure that our test suites account for our use of complex edge web server logic
to manage our production traffic flow, and thus that we can confidently deploy changes to our application infrastructure with little or no risk.

Read the rest of this entry


Compliance-Driven Development or the Story Behind Swiftype’s SOC2 Certification
18 Jan2018
Based on my experience, just a decade ago not many people within the Silicon Valley startup community considered compliance an important stepping stone in a company’s development roadmap. And when it came to compliance for startups, it was nearly synonymous with PCI/DSS — mandatory certification used by the credit card industry. Over the last few years though, the rise in the number of startups working with large amounts of private and confidential data (fintech, healthcare, etc) and subsequently the rise in the magnitude of data breaches, led our industry to accept the idea that compliance and certifications are not just for the “big guys”. Nowadays, even very small companies are pressed to go through formal certifications if they want people to trust them with private or confidential data.

That is exactly what happened to Swiftype at the beginning of 2017. While preparing for a public release of our latest product (Swiftype Enterprise Search), we understood that it was going to involve a lot of confidential information and we would need to be able to assure our customers of our capabilities to protect their data. In addition to the marketing aspect, there was a security angle to the problem as well: we were looking for a standard framework that could be used by our small team to ensure the safety of customer data, guiding us through the process. Based on those considerations, we decided to go through a formal SOC 2 certification. In this article, I will describe our journey towards the certification and our findings along the way.

 

Read the rest of this entry


My Favourite Books in 2017
2 Jan2018

Following the very ambitious and successful 2016 challenge, I have decided to keep the goal at the same level of 36 books for 2017 to prove to myself that it is sustainable and wasn’t a one-off success. Surprising myself, I have crushed the goal and finished 39 books this year. Below is summary of the best of those books.

Business, Management and Leadership

After changing my job at the beginning of 2017 and returning to Swiftype to focus on Technical Operations team leadership, I continued working on improving my skills in this area and read a number of truly awesome books:

  • The Effective Executive: The Definitive Guide to Getting the Right Things Done” by Peter F. Drucker — this classic has immediately become one of my favourite leadership books of all time. There are many useful lessons I learned from it (like the notion that all knowledge workers should consider themselves executives in some sense), but the most powerful was the part on executive time management.
  • Hatching Twitter: A True Story of Money, Power, Friendship, and Betrayal” by Nick Bilton — A truly horrifying “Game of Thrones”-like story behind the early years of Twitter. I didn’t think shit like that actually happened in real life… I guess the book made me grow up a little and realize, that simply doing your best to push your company forward is not always enough. I’d highly recommend this book to anybody working in a fast growing company or thinking about starting a VC-backed business.
  • Shoe Dog: A Memoir by the Creator of NIKE” by Phil Knight — a great story of a great company built by regular people striving for quality results. Heavily reinforces the notion that to be an entrepreneur you need to be a bit crazy and slightly masochistic. Overall, a very fascinating tale of a multi-decade development of a company — a strong contrast with all the modern stories about internet businesses. A must read for people thinking about starting a business.

Health, Medicine and Mortality

I have always been fascinated by the history of medicine, medical stories and the inner workings of the modern medical system. Unfortunately, this year I’ve had to interact with it a lot and that made me seriously consider the fact of our mortality. This has led me upon a quest to learn more about the topics of medicine, mortality and philosophy.

  • When Breath Becomes Air” by Paul Kalanithi — Fantastic memoir! Terrifying, depressing, beautifully described story of a young neurosurgeon, his cancer diagnosis, his battle with the horrible disease and up to the very end of his life. I found the story of Paul very relatable and just like with Atul Gawande’s book I’ve read last year, it brought forth very important questions on how should we deal with our own mortality. Paul gave us a great example of one of the options for how we may choose to spend our last days — the same way we may want to spend our lives: “You can’t reach perfection, but you can believe in an asymptote toward which you are ceaselessly striving”.
  • The Emperor of All Maladies” by Siddhartha Mukherjee — probably the best book on cancer out there (based on my limited research). The author takes us on a long, very interesting and terrifying trip through the dark ages of human war against cancer and explains why after so much time we are still only starting to understand how to deal with it and there is still a long road ahead. Highly recommended to anybody interested in the history of medicine or wants to understand more about the reason behind a malady that kills more than 8 million people each year.
  • Complications: A Surgeon’s Notes on an Imperfect Science” by Atul Gawande — once again, one of my favourite authors manages to explain a hard problem of complications in healthcare and give us a sobering look at the limits and fallibilities of modern medicine.
  • Bonus: “On The Shortness Of Life” by Seneca — It is amazing how something written 2000 years ago can have such profound relevance today. I found this short book really inspiring and it has led me to start my road to adapting some of Stoic techniques including mindfulness and meditation.

Miscellaneous

Few more books I found very interesting:

  • Born a Crime: Stories From a South African Childhood” by Trevor Noah — Listened to this book on Audible and absolutely loved it! Hearing Noah’s voice describing his crazy childhood in South Africa mixing fun and absolutely horrifying details of his life there and the struggles he had to endure being a coloured kid under and right after Apartheid.
    Even though it was never as scary as what Noah is describing in his book, I have found in his stories a lot of things I could relate to based on my childhood in late USSR and then in 1990s Ukraine which was going through an economic meltdown with all of the usual attributes like crime and crazy unemployment.
  • I Can’t Make This Up: Life Lessons” by Kevin Hart — I have never been a particular fan of Kevin Hart. Not that I disliked him, just didn’t really follow his career. This book (I absolutely recommend the audiobook version!) ended up being one of the biggest literary surprises ever for me: it is the funniest inspirational read and the most inspiring comic memoir I’ve ever read (or, in this case, listened to). Kevin’s dedication to his craft, his work ethic and perseverance are truly inspiring and his success is absolutely well-earned.
  • Kingpin: How One Hacker Took Over the Billion-Dollar Cybercrime Underground” by Kevin Poulsen — Terrifying read… I’ve never realized how close the early years of my career as a systems administrator and developer took me to the crazy world of underground computer crime that was unfolding around us.
    I’ve spent a few weeks week wondering if doing what Max and other people in this story did is the result of an innate personality trait or just a set of coincidences, a bad hand the life deals a computer specialist, turning them into a criminal. For many people working in this industry, it is always about the craft, the challenge of building systems (just like the bind hack was for Max) and I am not sure there is a point in one’s career when you make a conscious decision to become a criminal. Unfortunately, even after finishing the book I don’t have an answer to this question.
    The book is a fascinating primer on the effects of bad and the need for good security in today’s computerized society and I’d highly recommend it to everybody working with computers on a daily basis.
  • Modern Romance” by Aziz Ansari — very interesting insight into the crazy modern world of dating and romance. Made me really appreciate the fact that I have already found the love of my life and hope will never need to participate in the technology-driven culture today’s singles have to deal with. Really recommend listening to the audiobook, Aziz is very funny even when he’s talking about a serious topic like this.
  • The Year of Living Danishly: My Twelve Months Unearthing the Secrets of the World’s Happiest Country” by Helen Russell — Really liked this book. It offers a glimpse into a society surprisingly different from what many modern North Americans would consider normal. Reading about all kinds of Danish customs, I would think back to the times I grew up in USSR and realize, that modern Danish life is very close to what was promised by the party back then. The only difference — they’ve managed to make it work long term.
    Even though not many of us could or want to relocate to Denmark or to affect our government policies, there is a lot in this book that many of us could apply in our lives: trusting people more, striving for a better work-life balance, exercising more, surrounding ourselves with beautiful things, etc.

I hope you enjoyed this overview of the best books I’ve read in 2017. Let me know you liked it!