Anonymizing a GIT Repository
Asume you have worked with others on a repository and want to make it public afterwards. Or want to pull request a nice feature you created. But names should be removed for privacy resons.
At university, students create great results during the semester. But some prefer not to be named, if a repository is published. Here is how to get this done.
Steps (for Windows 8.1 environment)
Make an export from current repository and import it to a new one (that will be public)
Filter in all branches the names and e-mail-adresses
Remove references to old commits
Push it to your new repository remote URL (see our result here)
First create an empty folder (/empty/repository) and run an git init.
The rest is easy with git fast-export. See official docs.
git fast-export --all | (cd /empty/repository && git fast-import)
Note: It offers an option --anonymize, but this removes not only names, but all tag names and commit messages are cryptified as well. Too much in most cases.
2. Filter out names and e-mail adresses
This is a tough one. Support exists with git filter-branch --env-filter .
Note: For Windows the parameter for environment filtering needs double quotes “ “ instead of ‘‘. Ironically, the shell syntax inside is then evaluated by a cygwin console by git. So rewriting the Linux syntax to windows fails. The following works for overwriting all names and e-mail-adresses:
Pay attention to the space before the -- after the exports. git is here very sensitive.
Done? Not completely. In my case, students committed with messages that contained their student id, like “s12123 added file router.ts” and so on. I wanted to remove the student IDs. This can be done by filtering the commit messages:
git filter-branch -f --msg-filter "sed -e 's/s[0-9]\+/someone/g'" -- --all
This sed command for the shell replaces all strings starting with s, followed by numbers, with the string someone.
3. Remove references to old commits
git keeps backups of the old branches pointers and this allows easy access to the names. To remove this, git update-ref helps here, but it is painful to do it for all backed up references manually. Make it a batch process:
git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d
Note: Windows has no xargs. To let this work, install a lightwight version of cygwin for your windows. I used GOW.
4. Push it to the new repositorys remote url
Compact Batch version with Bonus
I summed the steps all up in one batch line. And I added the --tag-name-filter in order to let git fix the tags automatically. Because git creates new commits with different hashes for the filtering, tags need to be corrected, too.
git filter-branch -f --env-filter "export GIT_COMMITTER_NAME=Anonymized; export [email protected]; export GIT_AUTHOR_NAME=Anonymized; export [email protected]"^ --msg-filter "sed -e 's/s[0-9]\+/someone/g'"^ --tag-name-filter cat^ -- --all git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d
Note: pay attention to the ^ Windows escape character. It prevents interpretation of newline.
Warning: The process is not 100% save. With some effort repository lurkers can find out names and original e-mail-adresses. If you cannot risk this, think of simply exporting your repository files and commit them as one commit to a new repository. This is the only way to make sure no traces are left.