Hello friends, welcome back. First, let’s talk about trouble.

A long while back, I was working at a now defunct ISP installing T1 circuits for various businesses. As the newcomer to the scene, it fell to me to configure the equipment that we would ship to the customer and help them get things set up.

Sometimes, our customers would experience trouble with their circuit. When we would receive these reports, a veteran colleague of mine at the time would consult the original trouble generator that he had created and we would have a good laugh.

Having been inspired by similar projects that use neural networks to write text, I wondered what it would be like to train a neural network to write trouble.

Deeper down the learning hole

I started as many projects of this nature do by creating a Jupyter Notebook. Google Colabatory is an incredible resource in that you have free access to a GPU to train your model with.

Once I had gotten a PyTorch language model to write trouble, I was stuck with the same problem that I have heard from many others; Getting a trained model from a notebook into production sucks.

In the early stages of development, I would retrain the model in the notebook and have it save the corpus and the model to files, download them to my local machine and deploy a new version of trouble onto Google App Engine, where I am running a Flask app.

I wanted to automate this entire process using Google Cloud Build and after a fair amount of work, I’ve managed that. New additions to the trouble database can be retrained and deployed with one command.

Before we begin

There are a number of steps that I’m omitting here, setting up credentials for gcloud, creating a Google App Engine app and how to use PyTorch, but I’d like to focus this article on the pieces to automate the model pipeline.


The first major set of clues came from this blog. In that article, the author describes how to start a Google Compute instance with access to a GPU, download a notebook and execute it with papermill.

Armed with this new set of tools, I refactored my notebook to write the corpus and model to Google Cloud Storage so that it could be fetched in subsequent build steps.

At the end of my training notebook, I added the following code:

# Full reference: https://cloud.google.com/storage/docs/gsutil/commands/mb
bucket_name = 'artificial-trouble'
!gsutil mb gs://{bucket_name}
!gsutil cp corpus gs://{bucket_name}/
!gsutil cp trained_trouble.pt gs://{bucket_name}/

Now when the training is complete, the corpus and the model files that are generated when the notebook is executed will be uploaded to GCS.

Running a notebook from within Google Cloud Build

The next major hurdle was to automate the running of the notebook from Google Cloud Build.

I started by uploading my notebook, and the following executor script to GCS. After running the training notebook with papermill, the script will use the gcloud utility on the instance to terminate itself, saving the need to keep an instance running all the time.

if lspci -vnn | grep NVIDIA > /dev/null 2>&1; then
  # Nvidia card found, need to check if driver is up
  if ! nvidia-smi > /dev/null 2>&1; then
    echo "Installing driver"



INPUT_NOTEBOOK_GCS_FILE=$(curl http://metadata.google.internal/computeMetadata/v1/instance/attributes/input_notebook -H "Metadata-Flavor: Google")
OUTPUT_NOTEBOOK_GCS_FOLDER=$(curl http://metadata.google.internal/computeMetadata/v1/instance/attributes/output_notebook -H "Metadata-Flavor: Google")




INSTANCE_NAME=$(curl http://metadata.google.internal/computeMetadata/v1/instance/name -H "Metadata-Flavor: Google")
INSTANCE_ZONE="/"$(curl http://metadata.google.internal/computeMetadata/v1/instance/zone -H "Metadata-Flavor: Google")
INSTANCE_PROJECT_NAME=$(curl http://metadata.google.internal/computeMetadata/v1/project/project-id -H "Metadata-Flavor: Google")
gcloud --quiet compute instances delete "${INSTANCE_NAME}" --zone "${INSTANCE_ZONE}" --project "${INSTANCE_PROJECT_NAME}"

And uploaded it to my project’s google cloud bucket using gsutil: gsutil cp notebook_executor.sh gs://artificial-trouble/

I can invoke that script in the cloudbuild pipeline by specifying it as a startup-script-url. This takes the place of the script or bash function that the blogpost author was using previously:

# Start training the model
- name: 'gcr.io/cloud-builders/gcloud'
  id: train
  - compute
  - instances
  - create
  - 'notebook-executor'
  - --zone=us-west1-b
  - --image-family=tf-latest-cu100
  - --image-project=deeplearning-platform-release
  - --maintenance-policy=TERMINATE
  - --accelerator=type=nvidia-tesla-p100,count=1
  - --machine-type=n1-standard-8
  - --boot-disk-size=200GB
  - --scopes=https://www.googleapis.com/auth/cloud-platform
  - --metadata=input_notebook=gs://artificial-trouble/trouble_trainer.ipynb,output_notebook=gs://artificial-trouble/,startup-script-url=gs://artificial-trouble/notebook_executor.sh

Wait, wait, wait…

Once the instance running the training notebook has been started, subsequent steps of the cloud build pipeline would be run. I wanted the build pipeline to wait until the training had finished before proceeding, so I added a bash script as a step to monitor the notebook-executor instance for completion. All the usual caveats about error handling apply, as this is pretty flimsy checking:

    status=$(gcloud compute instances list | grep notebook-executor | awk '{ print $6 }')
    if [[ $status == 'RUNNING' ]]; then
        return 1
        return 0

while ! check; do
    echo "Waiting"
    sleep 5s

I uploaded by script to my GCS bucket, and added new steps to my cloudbuild.yaml file:

# Download wait training wait script wait
- name: gcr.io/cloud-builders/gsutil
  args: ['cp', 'gs://artificial-trouble/wait_for_training.sh', '.']
# Loop waiting for training to complete
- name: 'gcr.io/cloud-builders/gcloud'
  entrypoint: 'bash'
    - wait_for_training.sh

Putting it all together

Now that I have a newly trained model, the rest of my cloudbuild.yaml file will clone my project from GitHub, download my new model and corpus from Google Storage, and then deploy a new version of my app:

# Git clone.
- name: 'gcr.io/cloud-builders/git'
  args: ['clone', 'git@github.com:whoahbot/artificial-trouble']
  - name: 'ssh'
    path: /root/.ssh

# Copy trained files
- name: gcr.io/cloud-builders/gsutil
  args: ['cp', 'gs://artificial-trouble/corpus, '/workspace/corpus']
- name: gcr.io/cloud-builders/gsutil
  args: ['cp', 'gs://artificial-trouble/trained_trouble.pt', '/workspace/trained_trouble.pt']

# Deploy a new version of the app
- name: 'gcr.io/cloud-builders/gcloud'
  args: ['app', 'deploy']

Combining all of the steps outlined here into one cloudbuild.yaml file, I can now train and deploy my model with one gcloud command: gcloud builds submit --config cloudbuild.yaml.

Privacy and permissions

In order to be able to clone a private project from GitHub, you’ll need to add a few steps around key management that are well detailed here.

You’ll also need to add several IAM roles to your Google Cloud Build service account in order to be able to boot compute instance, decrypt keys for a private GitHub repo and act as a service account user when issuing gcloud commands. I ended up adding the following roles to my cloud build service account:

App Engine Admin
Cloud Build Service Account
Cloud KMS CryptoKey Decrypter
Compute Instance Admin (beta)
Service Account User
Storage Object Viewer


Thanks for reading if you’ve made it this far. If things aren’t working the way you expected, perhaps you should consult trouble for a diagnosis?