The Multi-Monorepo
tl;dr: Use top-level folders to group your codebase based on access rights (such as open source vs proprietary) and use a `git-subtree` approach to sync these folders to their own fully-functional monorepos.
In this post I will walk through the options and drawbacks of working with many-repos, and introduce the idea of the multi-monorepo.
Monorepos are great
You checkout a single repo. Tooling configuration (lint, style, etc.) can be shared across all packages. Make a change to some code in a package/service. Use your IDE’s refactoring to find all usages of the modified code. All dependent packages’ tests are run. Publishable packages are rebuilt and published. Deployable packages are deployed. And all this can be done in a feature branch. You’ve heard this all before.
In a “many-repo” setup, the need to modify code in just one external repo quickly becomes a nightmare.
Problems with “many-repos”
Let’s take the example of needing to make a breaking API change to a transitive dependency of a service.
E.g. Service depends on Foo which depends on Bar. You make a change in Bar, which requires a change to Foo, which requires a change to Service.
Service depends on Foo depends on Bar
The key is being able to make these three changes on an atomic feature branch that is deployable to a test environment and can be checked-out, manually tested and reviewed on a co-workers computer, and discussed in an online PR tool.
To access dependencies in other repos your options are:
- git submodules
Your repo stores a reference to a commit of another repo and pulls in the files via git cli commands. - many-repo management cli tool
You checkout multiple repos and symlink the packages between repos. These cli tools allow you to sync changes.
E.g. meta, gita, gr - package registries
Importing published packages via a package registry like npm. - git subtrees
Store a copy of the code from the other repo in your git tree and sync changes back and forth.
Git submodules
Only store a reference to a commit so if you wanted to make a feature branch, you would need to manually create that branch in two other repos, and keep them in sync. This can be tedious and error-prone.
Many-repo management tools + symlinks
If you checkout multiple repos, you can use the aforementioned tools to branch multiple repos with one command. You would then symlink the packages together.
Symlinks in Node.js present many difficulties due to the way that the Node.js module resolver uses the `realpath` rather than the logical path to resolve dependencies, breaking peer dependency resolution in some cases.
And what happens if you need to branch from a branch? Or pull a deployed tag when fixing a production bug? It starts getting insane rather quickly. Whereas with a single repo everything is extremely simple.
Package registries
This is the most painful way. Sharing code via package registries means you would probably check out two repos for Foo and Bar, have a script to symlink everything together, make your changes, run tests in three separate repos, then unlink, publish alpha versions, and hope everything works.
The mentioned issues with symlinks above leads to tons of risk that what you have tested and was working locally will break when using published packages. If a bug appears that wasn’t covered by a test, then you have two more packages to publish and you will lean heavily on your tooling for this. If anything goes wrong, you are in trouble.
Git subtrees
Git subtrees involve syncing commits from other repos inside your actual repo. When a new commit is made in another repo, you run a command that pulls in those commits to a sub-folder in your code base. When you want to push updates back out, you filter all commits that touched that sub-folder and merge them back.
The advantage here is that there is a single commit SHA identifying all your deployed code and its dependencies. If you need to make changes to Foo and Bar, then you simply make the changes as one commit, test, push, deploy, done. No complicated tooling to think about, or anything that will prevent you from fixing your bug.
The question to be asked at this point though, is what is the point of us even having other repos? Why can’t we use just a monorepo?
What can’t we use just a monorepo?
Monorepos have their disadvantages too. Noisy commit logs, slower CI pipelines, confusing and unfriendly for open-source contributors (must checkout and build entire codebase usually).
But there is one main thing they cannot do. Access rights per package/service.
The most common example would be a company open-sourcing a package that is a dependency in their private codebase.
When this is necessary it brings us back to all the downfalls of many-repos mentioned above. I would argue that this is a huge disincentive to more companies open-sourcing more stuff.
The “multi-monorepo”
The multi-monorepo approach is simply a monorepo, with one or more other repos synced in using git-subtree
.
You would create a private repository similar to this:
- package.json
- repos
- org
- services
- service-a
- libs
- foo
- bar
- org-open-source
- libs
- some-open-source-thing
- org-experiments
- libs
- some-unstable-thing
- client-project
The idea is to separate your code into top-level folders based upon access rights and sync these to other monorepos. I keep these folders in a folder called repos
. (NOTE: Some may not necessarily be separate git repos as I mention below but I couldn’t think of a better name.)
Splitting at the top-level makes it clear to your team what is open-source and what is confidential, as well as to simple tooling to verify no confidential code leaks into open-source world.
There are other reasons to split code at the top-level, not all of which would need to be synced to other monorepos, but more-so to replicate “many-repo” ergonomics (vague word but will make sense when you read below). These reasons include:
- Code stability/quality. In a many-repo world you would create a new repo for code experiments. Experiments might not require as high a level of test coverage or code quality checks as the rest of your code base. You would probably keep this in an ‘experiments’ branch and rebase often. Even so, the mental load is greatly reduced from having an entire other folder and the code quality rules can easily be set to ignore code in this folder.
- Working with other companies. Say your company is a web design agency. Maybe you have some libraries or a framework in active development that you use to support the projects you build for clients. You could work on your client’s project in your main monorepo alongside your framework or internal tooling. This framework or tooling would be published to a package registry. You would then sync your client’s code to a repo for them to use which would reference your open-source framework packages from the package registry.
- Solo developer working on many projects. Say you have a few private side-projects you are working on outside of your day job. You probably will want to share some code or tooling at one point between them. For example say you have 5 small websites and you don’t want to rewrite the same authentication code over and over. You could create a “repo” folder for each of the projects. If one starts to take off and you decide to hire someone to work on it — you just sync that to another repo.
Implementation
Currently I use the excellent git-subrepo project which is a shell script similar to subtree but easier to use. It’s quite a complicated script and when things go wrong it can be quite difficult to repair without spending a good bit of time grokking how things work. For this reason I was planning on rewriting it in TypeScript to make it easier to recover from bad state, and more helpful error messages — but I recently discovered adeira/shipit which I will try first.
Some other interesting similar tools:
- Facebook built FBShipIt to help them open-source code from their monorepo.
- I recently found adeira/shipit (from Kiwi.com devs) which is a FBShipIt implementation in JavaScript which I am very eager to try out. The project is active since Jun 2020.
They call the master source-of-truth repo the “universe” repository. - The PHP Symphony framework uses splitsh/lite shell script. There is a good presentation on that page about this topic and lots of good thinking.
They call it a “mono/manyrepo”. - korfuri/awesome-monorepo has a great list of monorepo tools that include some that allow repo syncing.
- Other cool monorepo tools you should checkout include Rush (from Microsoft), Nx, and the upcoming TurboRepo.
Where to next?
Monorepos have been growing in popularity for a while, yet the tooling is still not there yet, and they are only accessible to larger dev shops with the resources to invest in the tooling.
My hope is that multi-monorepo tooling will continue to improve over the coming years to allow more companies to easily open-source parts of their stack, and that all developers can reap the benefits of monorepos whilst remaining able to open-source their work, and to build more projects with more easily shared common code. IDE integration is also needed to allow seamless workflows.
I am working on a simple monorepo framework myself called Live, built upon the pnpm package manager and hope to open-source everything soon. The goal is to achieve a seamless multi-monorepo workflow with a simple CLI and IDE integrations for WebStorm and VSCode to allow developers to share more code between projects and with the open-source community.
Keen to hear your thoughts!