Test your infrastructure code with Terratest

With the emergence of Infrastructure As Code, (Ansible, Puppet, Heat or Terraform), we’d like to take advantage of all the good practices brought by the Software Craftsmanship movement in order to guarantee our infrastructure’s code quality. Every professional developer knows that to ensure code quality you need tests. One of the resulting practice is TDD aka Test Driven Development.

As a reminder, TDD consists in: begin by creating a test; verifying it’s failing; writing the code necessary to make the test succeed; relaunching the test and verifying it’s succeeding as well as the previous tests and finally refactoring the code before starting a new cycle.

The 3 main TDD’s steps

The advantage of this practice is to provide a short feedback loop and that way to detect bugs as soon as possible. It also makes it possible to meet requirements with minimal complexity and therefore to design better code.

If today the tools allowing developers to do TDD have matured in most languages, this is not the case for Infra as Code tools. We can still read, in the very good post “The Wizard” that today, we could write Ansible code in TDD efficiently....

...but as far as provisioning tools like Terraform are concerned, it's different.

Manual test pains

Terraform, built by HashiCorp, allows us to define infrastructure in a high level language and to deploy it in a cloud provider environment like Amazon Web Services or Google Cloud Platform.

Today when we want to test code, we execute tests manually. We can’t do it locally, you can’t test a VPC installation on your machine using localhost. We are testing directly in our cloud environments.

These can be several minutes long, often including many steps sometimes manual, which forces us either to wait or do regular context switching: therefore delaying the feedback loop. Then, we must validate that the generated infrastructure is matching our expectations. In order to do so, we’re using a web console or a command line. Sometimes, we connect directly to a machine to test the presence of a file. Then we delete this machines and restart again …

What we would have done manually, maybe several times in order to correct our mistakes, Terratest helps us automating it.

Terratest

What is Terratest ? It’s a Go library that helps you create and automate tests for your Infra as Code written with Terraform, Packer for IaaS providers like Amazon, Google or for a Kubernetes cluster.

Terratest is developed by  Gruntwork, an American society, in partnership with HashiCorp who open-sourced more than 300 000 lines of code and which is using Terratest to maintain its codebase. Terratest is available on Github since April 2018, you can go here if you want to play with it.

Why using Terratest ?

There is many advantages to test your code with a library like Terratest, here is a non-exhaustive list:

  • Test complex infrastructure behaviours
  • End to end tests
  • Infrastructure documentation
  • No need for ops to maintain a permanent iso-prod infrastructure to test its modifications
  • Fast feedback loop allowing efficient bug fixing
  • Being able to launch swiftly and regularly these tests
  • Ensure resilience when upgrading tools versions
  • Being able to test cloud-init like scripts
  • Validate deployed AMIs

Moreover, Terratest provides many examples in its Git repository, easing the usage of the library.

How does it work ?

To write Terraform code following Test Driven Infrastructure principles, we'll proceed in stages:

  • We'll start by writing the test using Go in *_test.go, like for example instance_test.go. We'll be attentive to make true or false assertions in our tests. According to the TDD principle, we will first check that the test does not pass by executing the command_"go test instance test_test.go"._
  • Then we'll write our Terraform code, describing our infrastructure, by trying to keep a modular structure of the project.
  • We'll relaunch Terratest, this time to deploy your infrastructure in your favourite IaaS. Be aware Terratest really builds the infrastructure. It makes a terraform apply, this can obviously involve costs.
  • Terratest runs the tests. To validate the compliance of our infrastructure, the library might call HTTP endpoints, connect to machines via SSH, to execute commands, upload files, request Cloud Provider APIs and read Terraform outputs...
  • Finally, the tests infrastructure will be destroyed by terraform destroy.
  • Test results will be displayed in the console.

Hands on

Objective: To play with tool, we’ll test an instance initialization script by connecting directly to the machine to verify its content.

Installation

To discover the Terratest library you can clone its Git repository. The code repository contains both modules and many examples. The examples give a good idea of what the tool can do and are a good starting point.

Terratest being a Go library, it is obviously necessary to have it installed on your machine. To install the Terratest library modules it is preferable to use a dependency manager such as dep.

We can therefore install the module that will be used to test Terraform in this way:

dep ensure -add github.com/gruntwork-io/terratest/modules/terraform

Or like this with go get:

go get github.com/gruntwork-io/terratest/modules/terraform

It is advisable to describe all dependencies in the Gopkg.toml file, this also allows you to fix the version of the dependencies.

Then make sure you have the necessary access to your cloud provider, in our case, we use AWS and load the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. We’ll also make sure we have a private key/public key pair to connect to the created machines and have our public key in AWS. For this example, we named the key "terratest_key"

Finally, we’ll create a project with empty files structured in this way:

Project Structure

Dependencies

To start, we use the well named "test" Go test package and import the modules we need into the instance_test.go file:

package test

First test : The SSH Key

Then, we declare our first test function, which must be named like this: func TestXxx(*testing.T) to be able use it with the go test command. We start by testing that a machine is well created and with an SSH key.

func TestInstanceSshKey(t *testing.T) {}

Now that we have written the "configuration" part, it's time for action. We want to initialize our Terraform working directory and apply our code (terraform init + terraform apply). It is the InitAndApply method that will launch this creation. We also want to destroy all our machines at the end of the tests (terraform destroy). The keyword defer allows you to add the Destroy method to the list of actions to be performed when returning the function.

defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)

And finally we fetch the terraform output in which we will find the name of the SSH key of the instance. This key will allow us to connect with Terratest to the machine. We will therefore make a first assertion so that this key can be defined.

instanceSshKey := terraform.Output(t, terraformOptions, "instance_key")
assert.Equal(t, "terratest_key", instanceSshKey)

Here is our first complete test function:

func TestInstanceSshKey(t *testing.T) {
    terraformOptions := configureTerraformOptions(t)
    defer terraform.Destroy(t, terraformOptions)
    terraform.InitAndApply(t, terraformOptions)
    instanceSshKey := terraform.Output(t, terraformOptions, "instance_key")
    assert.Equal(t, "terratest_key", instanceSshKey)
}
  • Run the test

The following command:

go test instance_test.go

… runs our test and will reward you with a joyful :

FAIL: TestInstanceIp (0.27s)

Not being very verbose by default, we’ll rerun the command with the -v option to get more details.

With a quick glance at the logs, we quickly understand that terratest didn't create anything, it's normal, we haven't implemented anything yet. On the other hand, we notice in the logs that it has initialized the working directory (terraform init) and we now find Terraform .tfstate files.

  • Implementation

We now want to use Terraform to build our EC2 Instance. In the main.tf file we declare the following:

resource "aws_instance" "example" {
    ami           = "ami-5026902d"
    instance_type = "t2.micro"
    key_name = "terratest_key"
}

This simple code snippet describes a centos 7 instance (describe by the ‘ami” property), t2.micro. And configures the instance SSH key, this key already exists in AWS, we have created and copied it during the installation).

Let's now add in the output.tf file, in which we indicate the output of the code, the instance key name:

output "instance_key" {
    value = "${aws_instance.example.key_name}"
}
  • New attempt

We restart the tests with the -v option which will allow us to have a more complete report.

The logs show the  Terraform initialization step (init), then the application (apply). We also see the requested output: the instance key name. Then we observe the machine destruction.

And finally, the relief:

--- PASS: TestInstanceSshKey (73.80s)
PASS
ok      command-line-arguments  73.812s

There is a cache so if you run the same command twice without any changes in the Go code the answer is instantaneous but the tests are not replayed. We can still force the execution by changing the following environment variable: GOCACHE=off.

No refactoring to do, our code is very simple.

If we connect to the console, we see that our instance has already been destroyed. The test lasted a little over a minute and above all without any intervention on our part. On the other hand, we haven't tested anything interesting, except that Terraform was doing its job well and we're getting an output from it.

Second test: The public IP

Let's now write our second test. Our final goal is to connect to a machine to check the presence of a file. We must therefore ensure that the instance is publicly accessible. Let's start by coding the test.

To improve modularity, we will write the test in an independent function. This allows us to run each test independently of the others, but reduces the readability of the output logs. On the other hand, it requires the creation and destruction of a new instance for each test, a very time-consuming operation.

Let's create the new test function: TestInstanceIp. The structure of our instance_test.go file now looks like this:

func configureTerraformOptions(t *testing.T) *terraform.Options {...}
func TestInstanceSshKey(t *testing.T) {...}
func TestInstanceIp(t *testing.T) {...}

We want to make sure that our instance has a public IP. And we would like to use the Terratest aws module for this purpose. As in the first test, let's add the methods to create and destroy our small architecture:

terraformOptions := configureTerraformOptions(t)
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)

Let's make sure the test is red first. To obtain the IP with AWS, we need the instance ID. The ID is an output parameter of Terraform, let's test and implement its recovery, identical to that of the SSH key.

We obtain the following assertion:

instanceID := terraform.Output(t, terraformOptions, "instance_id")
instanceIPFromInstance := aws.GetPublicIpOfEc2Instance(t, instanceID, awsRegion)
assert.Equal(t, “fake_ip”, instanceIPFromInstance)

Let’s verify our test is red: (There is a lot of logs between the 2 results)

--- PASS: TestInstanceSshKey (45.02s)
=== RUN   TestInstanceIp
---
--- FAIL: TestInstanceIp (52.22s)
instance_test.go:50:
Error Trace: instance_test.go:50
Error: Not equal:
expected: "fake_ip"
actual : "35.180.230.122"

Indeed, the IP of the machine is not "fake_ip". Our test fails, we can rely on it.

  • Implementation

We then realize that by default, AWS instances are created with a public IP; we will check that the IP returned by the Terraform output is identical to that of Terratest.

Let's add the following output in output.tf:

output "instance_id" {
    value = "${aws_instance.example.id}"
}
output "instance_public_ip" {
    value = "${aws_instance.example.public_ip}"
}

Then let’s modify our test function TestInstanceIp() in order to compare the two values.

La fonction entière :

func TestInstanceIp(t *testing.T) {
    terraformOptions := configureTerraformOptions(t)
    defer terraform.Destroy(t, terraformOptions)
    terraform.InitAndApply(t, terraformOptions)
    instanceIP := terraform.Output(t, terraformOptions, "instance_public_ip")
    instanceID := terraform.Output(t, terraformOptions, "instance_id")
    instanceIPFromInstance := aws.GetPublicIpOfEc2Instance(t, instanceID, awsRegion)
    assert.Equal(t, instanceIP, instanceIPFromInstance)
}

We restart the tests:

--- PASS: TestInstanceIp (79.52s)
PASS
ok      command-line-arguments  124.562s

Victory, it’s green !

Okay it’s green but the details are not very explicit.

Third test : File Content

We have verified that the instance is created with our SSH key and a public IP address. We will use these tests now to test the writing to a file by a script launched at the machine initialization.

We want to check that the file "/tmp/salut" contains the string "Hello World".

We use the ssh package and the CheckSshCommandE() function to execute the "cat /tmp/salvation" command on the machine and compare it with the character string.

Which gives us the assertion:

expectedText := "Hello, World"
command := fmt.Sprintf("cat /tmp/salut") // Command executed on target machine
actualText, err := ssh.CheckSshCommandE(t, publicHost, command)
assert.Equal(t, expectedText, actualText)

Now we have to tell Terratest how to connect to the machine in question, to get its address as the output. We use the default user 'ec2-user' and our agent's SSH key.

publicIP := terraform.Output(t, terraformOptions, "instance_public_ip")
publicHost := ssh.Host{ Hostname:  publicIP, SshUserName: "ec2-user", SshAgent: true, }

So we ask for 30 tries, 1 every 5 seconds, ignoring the exceptions in order to be able to continue. Finally, since it can take up to a few minutes for the instance to start, we need to make sure that Terratest will try to connect the machine several times before declaring a failure.

maxRetries := 30
timeBetweenRetries := 5 * time.Second
description := fmt.Sprintf("SSH to public host %s", publicInstanceDNS)
retry.DoWithRetry(t, description, maxRetries, timeBetweenRetries, func() (string, error) {
    actualText, err := ssh.CheckSshCommandE(t, publicHost, command)
    assert.Equal(t, expectedText, actualText)
    return "", err
})

Here is the TestFileContent() test function**:**

func TestFileContent(t *testing.T) {
    terraformOptions := configureTerraformOptions(t)
    terraform.InitAndApply(t, terraformOptions)
    defer terraform.Destroy(t, terraformOptions)
    publicIP := terraform.Output(t, terraformOptions, "instance_public_ip")
    publicHost := ssh.Host{
        Hostname:  publicIP,
        SshUserName: "ec2-user",
        SshAgent: true,
    }
    maxRetries := 30
    timeBetweenRetries := 5 * time.Second
    description := fmt.Sprintf("SSH to public host %s", publicIP)
    expectedText := "Hello, World"
    command := fmt.Sprintf("cat /tmp/salut")
    retry.DoWithRetry(t, description, maxRetries, timeBetweenRetries, func() (string, error) {
        actualText, err := ssh.CheckSshCommandE(t, publicHost, command)
        assert.Equal(t, expectedText, actualText)
        return "", err
    })
}

It's quite tedious and requires programming knowledge in Go. On the other hand, we achieve automatically what we would do manually (cat on the file).

  • What did we forget ?

We’re running the test : go test -v instance_test.go -run TestFileContent

Running command cat /tmp/salut on ec2-user@35.180.190.131:22
"returned an error: dial tcp 35.180.190.131:22: i/o timeout. Sleeping for 5s and will try again."

Oops, the instance port is not open ...

We implement and assign to our instance a security group allowing the SSH connection to the machine from any IP in the main.tf file :

resource "aws_security_group" "ssh" {
    ingress {
        from_port = "22"
        to_port   = "22"
        protocol  = "tcp"
        cidr_blocks = ["0.0.0.0/0"]
    }
}

Let’s retry : go test -v instance_test.go -run TestFileContent

New error :

Running command cat /tmp/salut on ec2-user@35.180.190.131:22
"returned an error: Process exited with status 1. Sleeping for 5s and will try again."

The message is not very meaningful, but we know that we did not create the file. So we finish our main.tf by adding a script that writes in the file /tmp/salut at the creation of the machine :

resource "aws_instance" "example" {
    ami           = "ami-5026902d"
    instance_type = "t2.micro"
    key_name = "terratest_key"
    vpc_security_group_ids = ["${aws_security_group.ssh.id}]"
    user_data = <<-EOF
    #!/bin/bash
    echo 'Hello, World!' > /tmp/salut
    EOF
}

Voilà !” :

Running command cat /tmp/salut on ec2-user@35.180.208.120
--- PASS: TestFileContent (0.61s)
PASS
ok      command-line-arguments  0.623s

We implemented and tested writing to a file with an initialization script. In this exercise, the script is not tested in its entirety (we could still check the permissions etc...), but this already gives an idea of what we can accomplish with this tool.

To go Further

These non-exhaustive tests allowed us to see some aspects of Terratest. By digging a little into the repository, we found examples of more complex tests to validate the behavior of your architecture over time, such as a deployment without service interruption.

It also describes how to separate tests into steps using environment variables to avoid the systematic creation and destruction of instances between tests. In addition, Terraform generates a tfstate, a .json file that describes the state of the infrastructure, we can use it to run a test several times without rebuilding or deconstructing the instances.

Also in this repository, mechanisms will be found to randomize the regions in which the infrastructures are created to ensure that creation is possible in all. We will also find out how to make the names of the machinesrandom to avoid conflicts.

It can be interesting to integrate Terratest into a continuous integration platform, and to play tests at each update of the infrastructure code. Building environments only for the duration of the tests is more economical than having dedicated environments all the time.

Finally, the same repo includes test examples and modules covering other AWS services such as S3, ARDS, CloudWatch, IAM and VPCs.

Not to mention that Terratest covers other tools: Packer, GCP and K8.

Bonus : Cloud Nuke

A risk identified by Gruntwork is to have resources not destroyed after failed test series and therefore to have stagnant instances among those that are really used. This represented a significant cost for them: ~85%. They have therefore developed a tool, Cloud Nuke, that regularly cleans the environment of these instances, launch configurations, load-balancers and lost EIPs. Everything that was created more than an hour ago is considered lost, which is longer than the execution of all tests. Their test environment uses an AWS account independent of others to avoid the risk of destroying machines in other environments.

Reservations

  • Unlike other infra as code testing tools such as kitchen or molecule, the library does not abstract the logic of creating, testing or destroying its environment. It follows that it is up to the ops to implement certain mechanisms such as retry in its tests.
  • Once again: be careful, the cloud environment in which Terratest builds the test infrastructures must be partitioned and separated from other environments. It would be a shame to destroy production machines by trying to test your infra …
  • Third, for Terratest to interact with the cloud services to be tested, it must have the appropriate rights. This potentially implies having to manage a new user or powerful role and give it the accreditations.
  • The test output report is either too concise: a FAIL/PASS line in non-verbose mode, or too verbose, displaying all the details of creating and destroying terraforms that overwhelm the information.
  • The library is not very well provided on the less "traditional" AWS services but which are, in its defence, extremely numerous.

What about the others in all this?

There are other infrastructure testing tools, such as kitchen-terraform, a set of test-kitchen plugins written in Ruby, the very young ruby rspec-terraform too, or the Terraform testing framework.

Terratest focuses on the functional aspect of the overall infrastructure rather than the individual properties of these components. The library focuses on automating tasks that validate a behavior rather than observing it. For example, we would rather make real http calls and analyze the return code than check that the httpd service is running on the server.

Conclusion

We have seen that it is possible, although complicated, to create tests written in Go to guarantee the properties of an infrastructure produced by Terraform code.

Terratest is fulfilling its promise of generations of ephemeral environments and test automation. It guarantees that at the end of the execution the machines will be destroyed. It also allows a large number of parameters to be tested on AWS EC2 instances.

We can see that the tool deserves to be easier to use, to be enriched on its target services, to be more readable on reports and to allow better use at scale. We can think that the TDD will quickly mature and that the arrival of these new tools will democratize this practice and that soon the provisioning tools will be easily testable. With these libraries interacting in imperative languages on infrastructure, we can even think that the infrastructure code is moving towards such a paradigm.

Sources

https://github.com/gruntwork-io/terratest

https://blog.gruntwork.io/open-sourcing-terratest-a-swiss-army-knife-for-testing-infrastructure-code-5d883336fcd5

https://blog.gruntwork.io/cloud-nuke-how-we-reduced-our-aws-bill-by-85-f3aced4e5876

https://blog.octo.com/tdi-ou-test-driven-infrastructure/ (in French)