By Joe Stevens, Tech Lead, Infrastructure at Ascend
At Ascend, we’ve put a lot of emphasis on cloud portability for our product, and that’s enabled by a few different open source technologies. We run everything on Kubernetes and manage system state in MySQL plus a large scale blob store (Google Storage/S3/Azure Blob Storage). The whole stack is spun up by Terraform and bootstrapped by a few dozen Jinja-templated YAML files. We had a hackathon recently, and I figured we should put this portability to the test. With 48 hours and a couple engineers, I would port our product to Azure.
Given the makeup of our stack, it should be pretty trivial to move clouds. Write a lump of Terraform for the new environment, throw together some glue code in the homegrown CLI, and you’re off to the races, right? A cloud is a cloud is a cloud.
Surprisingly enough, it almost went like that. However, it wouldn’t be a hackathon without a few snags. One of the more peculiar ones we came across was setting up the public ingress for our UI. We use network load balancers (NLB) for our public ingresses so we can map traffic directly to the nodeport of our internal nginx proxy. Traditionally, we’ve configured this to map the NLB to our high availability node pool where our core services reside, directed it at a predetermined 30000+ port number, and relied on the `externalIPs` block in our service definition to advertise the necessary ports for proper integration through the Kubernetes-level cloud provider. In the case of nginx, that looks like this:
``` apiVersion: v1 kind: Service metadata: labels: app: {{ name }}-ext name: {{ name }}-ext spec: selector: app: {{ name }}-pod ports: - name: http port: 80 targetPort: 80 {% if config.environment == "gcp" %} {% elif config.environment == "aws" %} nodePort: 30080 {% endif %} - name: https port: 443 targetPort: 443 {% if config.environment == "gcp" %} {% elif config.environment == "aws" %} nodePort: 30443 {% endif %} type: NodePort externalIPs: {% for ip in config.externalIps %} - {{ ip }} {% endfor %} ```Obviously, at this point we had this model figured out. I’d just slap another if block in, mess with the numbers until something works, and then figure out how to pipe that all through from Terraform provisioning to YAML render/deploy. Didn’t quite work out that way. I pushed harder and harder to remove variables and baby the service into advertising correctly to the load balancer, but I kept getting unhealthy port checks on all instances in the pool. Eventually, I had to relent. It was clear there was something fundamentally wrong with how I was approaching this problem. As I dove deeper into the integration, one of the first things I noticed was, for Google Cloud (GCP), we didn’t actually need to predetermine a nodeport. The GCP provider will happily negotiate ports with your firewall rule. It turns out that’s the key to the story, the way each cloud provider chooses to implement their Terraform and Kubernetes providers can and does differ wildly, even when their underlying services may not be all that different. As I returned to square one, I realized that all of the tutorials for setting this up described launching a cluster and then launching your LoadBalancer service on Azure Kubernetes Service (AKS), without ever independently managing the external load balancer resource (with lovely documentation here). (RTFM) Naturally, the moment we followed these instructions and refactored our YAML to look like the below, everything just worked.
``` apiVersion: v1 kind: Service metadata: labels: app: {{ name }}-ext name: {{ name }}-ext spec: selector: app: {{ name }}-pod ports: - name: http port: 80 targetPort: 80 {% if config.environment == "gcp" %} {% elif config.environment == "aws" or config.environment == "azure" %} nodePort: 30080 {% endif %} - name: https port: 443 targetPort: 443 {% if config.environment == "gcp" %} {% elif config.environment == "aws" or config.environment == "azure" %} nodePort: 30443 {% endif %} {% if config.environment == "azure" %} loadBalancerIP: {{ config.externalIp }} type: LoadBalancer externalTrafficPolicy: Cluster {% else %} type: NodePort externalIPs: {% for ip in config.externalIps %} - {{ ip }} {% endfor %} ```There were a couple lessons here. First, trying to shove a cloud provider into your model of how you think the world should work is going to bring you a lot of pain. While my early assumption that a cloud is a cloud is a cloud wasn’t necessarily wrong (they all have VMs, managed DBs, large scale blob storage, basic networking, etc), the way each cloud is opinionated about how those pieces should be lego’d together may be significantly different than what you’re used to. That will manifest in how their Terraform and Kubernetes providers are configured, what their IAM model looks like, and more. The second, was that Kubernetes cloud providers are really first class cloud resource managers. There are limits to what they can do though. No matter where you fall on the chicken and egg debate, there are certain network and VM resources that need to exist before the provider in your cluster can go to work. However, once your cluster is running and authenticated with your cloud provider, it can manage resources in a much more dynamic manner than Terraform. Specifically, in my experience these Kubernetes cloud providers have delivered far more toward a true cloud agnosticism story than Terraform. At the end of the day we were able to port Ascend to Azure successfully, which made for a fun presentation!