Mono-Repos to the Rescue: Organizing and Archiving Old GitHub Projects
Has your number of GitHub repos increased over time? And they all seem strong point of reference and you don't want to delete them? Let's dive into mono-repos for their safe and sound archival!
Mono Repos
Mono Repos are not a direct component in version control systems like git by itself, rahter its just a concept of merging together multiple repositories into a single one for grouping them together, and they're getting popular now a days after a long time.
I've recently moved most of my repositories that were either doing nothing (either prototypes or not-in-use or are just old) into a the public mono-repository Animeshz/archives-public and a private one named archives-private.
Goals
Since mono repos are not a direct concept in git, a lot of guides or articles shows different ways of achieving this, and some are not good enough because saying something a mono repo can also be done by just copy pasting the repository and deleting its .git
folder, which in my opinion loses a lot of information about it.
So I do have some design goals while doing it, namely,
- All the commits should be preserved as-is, with dates and authors.
- The commits should be easily identifyable to which repository it belonged.
- The repositories that will go into the mono-repo should have a designated directory of residence and should not clog other files and folders.
The approach
I've gone through many approaches ranging from git rebase
, git filter-branch
, git env-filter
and many more, which I'll link in the end if you want to have a read, but I'll quickly wrap my approach and compare it with other approaches.
First of all, creating a repository which will be our mono-repo, with https://github.new, and cloning it as usual,
git clone https://github.com/Animeshz/archives-public
Then each repository that has to be archived, we clone it with the specific branch,
git clone https://github.com/<user>/<repository-name> --branch <branch-name>
cd <repository-name>
First, we'll make its history linear so that we can export the whole branch as patch with git format-patch
later on. We'll do this by removing the merge diamond, that is created by canonical merges see my previous article, which can be done by rebase.
This can be done by git rebase
which will rewrite the history by applying commit one after another chronologically,
git rebase --root -X theirs
git merge --ff $(git commit-tree remotes/origin/${BRANCH:-HEAD}^{tree} -m "Fix after making history linear" -p HEAD)
The -X theirs
accepts the merger branch's changes as git merge
happens to do to keep the merge diamond, and then make a merge commit which accounts for the conflict resolution, here we do something similar, we accept theirs and then whatever delta we see from the rebase completion to the original state of the repository were, we apply it with a fast forward merge. Effectively making our history linear.
Now, we can proceed with our main logic, rewriting history such that the whole repo is contained in a directory as well as renaming commits such that they're identifyable to be part of current repo,
We can do this by git filter-branch
and git env-filter
, but git filter-branch
when ran, itself suggests the use of newren/git-filter-repo. So we'll go with suggested method.
nix profile install nixpkgs#git-filter-repo # or by your package-manager
NAME=<repo-name>
git filter-repo --replace-message <(echo "regex:^==>[$NAME] ") --path-rename :"$NAME"/
This should move all the files into subdirectory going by name $NAME
rewriting the commit history, also prefixing each commit message by [$NAME]
.
WARNING
While almost all the inbuilt git commands allows you reverting changes, this command repacks the git objects and garbage collect immediately, thus loosing the ability to go back with git reflog
, if you don't want this behavior you can pass --partial
to not garbage collect in the end.
See docs for more info.
For instance for repo named overengineered-pastebin
, something like this will happen:
$ git log --format=fuller -p
...
commit 3520a5c93bb65103f23532d4d37ffcc33c727587
Author: Animesh Sahu <animeshsahu19@yahoo.com>
AuthorDate: Fri Nov 4 12:03:22 2022 +0530
Commit: Animesh Sahu <animeshsahu19@yahoo.com>
CommitDate: Fri Nov 4 12:03:22 2022 +0530
[overengineered-pastebin] Your commit message
diff --git a/overengineered-pastebin/frontend/src/components/LanguageSelektor.vue b/overengineered-pastebin/frontend/src/components/LanguageSelektor.vue
...
Now the only thing remaining to do is to export the commits and import them into our mono-repo,
git format-patch -k --no-stat --stdout --root HEAD > my.patch
cd /path/to/archives-public/
git am -k -3 --committer-date-is-author-date --empty=drop "/path/to/archiving/repo/my.patch"
This should super-impose the new repo contained in a designated directory by previous commands with the mono-repo by applying the commits from the repo that is to be archived.
Final Script
I've turned this into an interactive script by using charmbracelet/gum.
#!/bin/bash
MONOREPO="$(realpath "$1")"
REPO="$(gum input --placeholder "Repository URL")"
BRANCH="$(gum input --placeholder "Branch (empty for default)")"
NAME="$(gum input --placeholder "Custom name of subrepo")"
NAME="${NAME:-${REPO##*/}}"
NAME="${NAME%.git}"
options=""
if [[ -n "$BRANCH" ]]; then
options+="-b $BRANCH"
fi
[[ -d "/tmp/$NAME" ]] && gum confirm "Delete directory /tmp/$NAME (already exists)" && rm -rf "/tmp/$NAME"
git clone $options "$REPO" "/tmp/$NAME"
cd "/tmp/$NAME"
git rebase --root -X theirs
git merge --ff $(git commit-tree remotes/origin/${BRANCH:-HEAD}^{tree} -m "Fix after making history linear" -p HEAD)
git filter-repo --replace-message <(echo "regex:^==>[$NAME] ") --path-rename :"$NAME"/ -f
git format-patch -k --no-stat --stdout --root HEAD > my.patch
cd $MONOREPO && git am -k -3 --committer-date-is-author-date --empty=drop "/tmp/$NAME/my.patch"
gum confirm "Delete cloned repo" && rm -rf /tmp/$NAME
Conclusion
I've crunched down 110+ repositories on my github profile into bare 26, splitting them into archives-public and archives-private, removing stale forks in between as well. This let me clean up the profile for an ergonomic view into it.
References
- Rebase way of doing it (way too slow), bash
git -c rebase.instructionFormat='%s%nexec GIT_COMMITTER_DATE="%cD" GIT_AUTHOR_DATE="%aD" git commit --amend -m "bhao bhao: $(git show -s --format=%%s)"' rebase -i --root
- Manual patching with sed
- Using
git filter-branch
- Git filter-repo manual
Further Reading
Backlinks