How to Use Private Repos with Colab

2022-03-01

I forgot that this is a tech blog sometimes. Some notes from work.

(If you just want an answer to your problem, scroll to "A clean setup for private repos" below. This solution does not use personal access tokens, so your team doesn't need to create a billion GitHub accounts.)

Background

Colab is Google's version of a Jupyter notebook in the cloud. It has clean APIs to everything Google and is pre-installed with some nice libraries. I begrudgingly admit that it's not bad.

But Colab still has the same code problems that all Jupyter notebooks have:

  • no code review
  • no unit tests
  • no tooling support
  • no reuse across notebooks
  • no change history
  • no namespacing
  • no moral decency

The fix for all of these, of course, is to put important code in a repo then pull that code in as needed. But it's not obvious how to do that.

The setup for public repos

If your repo is public, this is a little easier:

1
! git clone https://github.com/my_username/my_repository.git

Then your cloned repository is on the filesystem, and you can do whatever you want with it. If you feel fancy, you can even add a setup.py to your repo then pip install it, though this will cause problems if your dependencies clash with what Colab already provides:

1
! pip install git+https://github.com/my_username/my_repository.git

But if your code is sensitive or proprietary, keeping it public isn't a great business plan.

Some resources online will suggest getting an access token for your account then using that to fetch the repo. But if you work in a larger team, everyone has to set up GitHub, get an access token, and modify the notebook to use it. It's just not a clean or simple solution.

A clean setup for private repos

Here's a clean setup that doesn't require all of your Colab users to create GitHub accounts:

  1. Create a new public/private key pair that you will use only for this integration, e.g. through ssh-keygen -t ed25519. Keep it simple: default directory, no passphrase.

  2. Set the public key (~/.ssh/id_ed25519.pub) as the deploy key on your private repo. You have the option to enable write access, but this is foolish without good cause. Keep it read-only.

  3. Dump the private key (~/.ssh/id_ed25519) into a string that you save in your Colab. Your security instincts will scream at you not to do this. But (a) anyone with access to the Colab could already see your private code anyway and (b) this private key is used only for the repo read, not for anything else.

  4. Add some logic to write your private key string to ~/.ssh/id_ed25519. If the filesystem copy is ever lost, this logic will just write the key back to where it needs to be. But before you run this code, please test that the content of your string is identical to the content of ~/.ssh/id_ed25519.

And now you can get your repo. End-to-end, it looks something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
GITHUB_PRIVATE_KEY = """——-BEIN OPENSSH PRIVATE KEY——-
my super cool key
——-END OPENSSH PRIVATE KEY——-
"""

# Create the directory if it doesn't exist.
! mkdir -p /root/.ssh
# Write the key
with open("/root/.ssh/id_ed25519, "w") as f:
  f.write(GITHUB_PRIVATE_KEY)
# Add github.com to our known hosts
! ssh-keyscan -t ed25519 github.com >> ~/.ssh/known_hosts
# Restrict the key permissions, or else SSH will complain.
! chmod go-rwx /root/.ssh/id_ed25519

# Note the `git@github.com` syntax, which will fetch over SSH instead of
# HTTP.
! git clone git@github.com:my_username/my_password.git

Security notes

Anyone with access to the Colab will be able to see your repo. So if your repo contains other secrets you don't want people to see, perhaps split it in two and use only the safe version with Colab.

Final thoughts

Credit where credit is due: thanks to Felix Müller for describing this approach. My main change was to use ed25519 and to clean up one of the code examples. I also used this post an excuse to complain about Jupyter notebooks, which is always a meritorious deed.