On Demand QA Environments

The Problem

Let's just lead with this. The Engineering team at Numerator will double in the next year...

An engineering team generally doesn't go from 20 - 40 or 40 - 80 without some demand from customers. This demand from customers requires more engineers to build out features, fix bugs, scale systems, and to provide support. As an engineering team scales there becomes an inherent need for?

  1. More Snacks in the Office
  2. Bevi and LaCroix to get that seltzer fix
  3. Tools to make developing software faster

Kind of an unfair question, because all three are right. But, I don't feel I'm qualified to write a blog post about carbonated water and snacks. Meaning the most right answer at this particular moment in time is number three. What can sometimes become an afterthought of a successful engineering team is a development environment where developers can work fast and deliver those million dollar features.

For a fast-growing startup the development environment at Numerator has been great. Vagrant environments are fired up across multiple different teams. With just a few command line commands a developer environment is running a nearly pseudo-production environment. This covers almost all local testing, but what happens when you want to test on the real thing, or at least a little closer to the real-thing?

There are QA environments for that. Prior to this change there were three of them (for a team of eight or so engineers), they are simply a few EC2 instances that comprise most of what's in a production environment. In order to get another one, a developer would likely have to ask an DevOps person to bring up another stack. If the DevOps person is busy, this could cost the developer time, and could possibly delay a feature launch. Managing these environments across multiple database migrations across multiple branches caused a number of headaches as well. The aforementioned issues plus a few more led us to the develop a system that could support on demand creation of QA environments.

What we wanted (needed) to accomplish

  1. New environments in minutes not days
  2. Environments expire, no need to waste $$$ on stale environments
  3. Notifications and user-feedback on stack creation
  4. Environments can be created by anyone with proper Jenkins credentials
  5. Non-engineers can view/test what developers are working on

With that in mind, we got to work. Most of the pieces were already in place, it was just a matter of putting them altogether. We could combine technology we were already using into a robust system that could improve everyones workflow. Let's take a look at the implementation details.

The Infrastructure

The infrastructure/architecture of the project is composed of Terraform, Ansible, and a few AWS services to store and keep track of state.

Terraform

    Module Composed of
       EC2 Instances
       Route 53 Records 
       ELB

Ansible

    Dynamic Inventories (More Information Below)
    Playbook for setup on the instance 
    Playbook for deploy on the instance 

AWS Lambda

    QA Status Slack Command 
    Destroy stacks that are expired 

DynamoDB

    Store information about Stacks 

Making it user friendly

Jenkins Pipelines  
    Create a new stack 
    Deploy an existing stack  
    Destroy a stack 

See It

Create a new Stack

A user can create a stack with a couple text entries, and a button click. A photo from the Jenkins UI.

BPStackCreateUI

Wait for it

Brand new stack will be up in ~15 minutes.

Track Status With Slack


BPSlackOutputCreation

Or Jenkins

BPJenkinsOutput

Deploy to the new stack


Made a modification and need to see the new code? Once the stack is done, a developer can deploy at will to it. Via...you guessed it another Jenkins pipeline.

BPJenkinsDeploy

Track Instances Created


At any point you can use a handy slack command to view all of the stacks currently created.

BPSlackOutput

Destroying the Stacks


The stacks have a default lifespan of 24 hours. A scheduled Lambda task runs every fifteen minutes and checks to see if any stacks need destroyed. If there are any it calls terraform destroy on the stack. This portion could use some improvement and we could monitor usage to better know when to destroy a stack, but for now we just destroy based on a timestamp.

Code Examples


Dynamic Inventories
Since Numerator is an AWS shop, we took the ansible dynamic ec2 inventory script and modified it to fit our needs. Once those modifications were made we could pass this python script via a -i parameter to our ansible setup and deploy scripts.

The output of the script is an inventory file json style.

{
  "qa": {
    "children": [
      "cleaner",
      "web",
      "worker"
    ],
    "vars": {
      "url": "elb.net",
      "s_names": "10.0.0.2 elb.net"
    }
  },
  "cleaner": [
    "10.0.0.3"
  ],
  "web": [
    "10.0.0.2"
  ],
  "worker": [
    "10.0.0.1"
  ],
  "stateful": [
    "10.0.0.4"
  ]
}

And the Jenkins Pipeline/Groovy code to call that.

ansiblePlaybook(
    playbook: 'plays/project/setup_qa.yaml',
    inventory: 'inventories/custom_ec2.py',
    credentialsId: 'creds',
    colorized: true
)

Jenkins Pipeline - Declarative Example


Let's look at a somewhat full example of a Jenkins pipeline. Once the syntax is grasped, it's fairly easy to apply most use-cases to it.

Basic Structure

pipeline {
   agent {
       label 'my_deploy_agent'
   }
   environment {
       ENV_VAR_1 = 'test' 
       TERRAFORM_BRANCH = 'master'
       ANSIBLE_BRANCH = 'master'
   }
   stages {
       stage('Stage1') { 
           step(s)... 
       }
       stage('Stage2') { 
           step(s)...
       } 
       stage('StageN') { 
           step(s)... 
       }
    }
    post {
        success {
           ... 
        }
        failure {
           ...
        }
    }
}

Looking at a Stage


The production deploy of the create stack pipeline has four stages.

  1. Terraform -> create the instances necessary for a QA environment
  2. Wait for SSH -> This is an ansible playbook that pings the server until it's ready to accept ssh connections from a jenkins user.
  3. Ansible Setup -> playbook that runs setup
  4. Ansible Deploy -> playbook that runs deploy

A focus on the Terraform stage



Each stage has steps, a stage can have many steps. The steps are as follows.

  • Send a Slack Message to inform the user that we've started Stack Creation
  • Checkout the Terraform Git Repo
  • Run the Terraform Command

And here's the code to do just that.

       stage('Terraform') {
           steps {
               slackSend(
                   color: "#D4DADF",
                   message: "`${env.JOB_NAME}` - #${env.BUILD_NUMBER} Started by user ${env.USER} (<${env.BUILD_URL}|Open>)\nStack: `${env.STACK_NAME}`",
                   channel: "#your-slack-channel")
               git(
                   credentialsId: 'ABCDEF',
                   url: 'git@github.com:git_repo',
                   branch: env.TERRAFORM_BRANCH)
                withCredentials([[
                    $class: 'AmazonWebServicesCredentialsBinding',
                    credentialsId: 'jenkins-credentials',
                    accessKeyVariable: 'AWS_ACCESS_KEY_ID',
                    secretKeyVariable: 'AWS_SECRET_ACCESS_KEY'
                ]]) {
                    sh(script: 'terraform apply qa-bi-ondemand stackname=${STACK_NAME} verbose=1', returnStdout: true)
                }
           }
       }

Conclusion


We've been using the On-demand QA environments for a couple months now, and it has drastically improved workflow amongst our engineering team. It's also had some quality of life improvements for the DevOps engineers. No longer do we have Slack conversations of who is going to be using QA or Staging environments next. Now the conversations have to turned to something more riveting things like: What constitutes a sandwich?.

If more examples or help is needed constructing these environments feel to reach out.