How to Use Private Repos with Colab
2022-03-01
I forgot that this is a tech blog sometimes. Some notes from work.
(If you just want an answer to your problem, scroll to "A clean setup for private repos" below. This solution does not use personal access tokens, so your team doesn't need to create a billion GitHub accounts.)
Background
Colab is Google's version of a Jupyter notebook in the cloud. It has clean APIs to everything Google and is pre-installed with some nice libraries. I begrudgingly admit that it's not bad.
But Colab still has the same code problems that all Jupyter notebooks have:
- no code review
- no unit tests
- no tooling support
- no reuse across notebooks
- no change history
- no namespacing
- no moral decency
The fix for all of these, of course, is to put important code in a repo then pull that code in as needed. But it's not obvious how to do that.
The setup for public repos
If your repo is public, this is a little easier:
1 |
|
Then your cloned repository is on the filesystem, and you can do whatever you
want with it. If you feel fancy, you can even add a setup.py
to your repo
then pip install
it, though this will cause problems if your dependencies
clash with what Colab already provides:
1 |
|
But if your code is sensitive or proprietary, keeping it public isn't a great business plan.
Some resources online will suggest getting an access token for your account then using that to fetch the repo. But if you work in a larger team, everyone has to set up GitHub, get an access token, and modify the notebook to use it. It's just not a clean or simple solution.
A clean setup for private repos
Here's a clean setup that doesn't require all of your Colab users to create GitHub accounts:
-
Create a new public/private key pair that you will use only for this integration, e.g. through
ssh-keygen -t ed25519
. Keep it simple: default directory, no passphrase. -
Set the public key (
~/.ssh/id_ed25519.pub
) as the deploy key on your private repo. You have the option to enable write access, but this is foolish without good cause. Keep it read-only. -
Dump the private key (
~/.ssh/id_ed25519
) into a string that you save in your Colab. Your security instincts will scream at you not to do this. But (a) anyone with access to the Colab could already see your private code anyway and (b) this private key is used only for the repo read, not for anything else. -
Add some logic to write your private key string to
~/.ssh/id_ed25519
. If the filesystem copy is ever lost, this logic will just write the key back to where it needs to be. But before you run this code, please test that the content of your string is identical to the content of~/.ssh/id_ed25519
.
And now you can get your repo. End-to-end, it looks something like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
Security notes
Anyone with access to the Colab will be able to see your repo. So if your repo contains other secrets you don't want people to see, perhaps split it in two and use only the safe version with Colab.
Final thoughts
Credit where credit is due: thanks to Felix Müller for describing
this approach. My main change was to use ed25519
and to clean up one of the
code examples. I also used this post an excuse to complain about Jupyter
notebooks, which is always a meritorious deed.