Spring Boot on EKS: Terraform, Helm & GenAI Guide

Introduction

If you've been learning cloud and DevOps, you've probably heard terms like microservices, Kubernetes, and EKS thrown around constantly. But understanding them in theory is very different from actually deploying a real application end to end.

In this blog, I'll walk you through exactly what I did — deploying the Spring PetClinic Microservices application on AWS Elastic Kubernetes Service (EKS) — covering every step, every command, every mistake I made, and how I fixed them.

What is Spring PetClinic?

Spring PetClinic is a sample application built by the Spring team at VMware to demonstrate best practices in Spring Boot development. The microservices version breaks the monolith into 8 independent services that communicate over a network — each with its own responsibility:

What Did I Deploy?

Locally — all 8 services running with Docker Compose on my MacBook
On AWS — all 8 services running on a 2-node EKS cluster, accessible via a public AWS Application Load Balancer URL
GenAI — connected the AI chat service to OpenAI GPT-4o-mini so you can ask "Does George Franklin's cat Leo have any upcoming visits?"

Why This Project?

Most tutorials show you one service, one container, one deployment. Real-world applications don't work that way.

I chose this project because it forced me to understand:

How multiple services talk to each other
Why startup order matters in distributed systems
How to manage secrets (API keys) safely in Kubernetes
How to use Infrastructure as Code (Terraform) instead of clicking around the AWS console
How to monitor a live system with real observability tools

If you can deploy this, you can deploy almost anything.

Architecture Overview

Here's how the pieces fit together on AWS:

Internet
    │
    ▼
AWS ALB (Application Load Balancer)
    │
    ▼
EKS Cluster (2x t3.medium nodes)
    │
    ├── api-gateway          ← all traffic enters here
    ├── customers-service    ← talks to MySQL
    ├── vets-service         ← talks to MySQL
    ├── visits-service       ← talks to MySQL
    ├── genai-service        ← talks to OpenAI API
    ├── config-server        ← all services pull config from here
    ├── discovery-server     ← all services register here
    └── admin-server         ← monitors all services

Supporting:
    ├── Amazon ECR           ← Docker image registry
    ├── Amazon RDS / MySQL   ← database (deployed in-cluster)
    ├── Prometheus           ← metrics scraping
    └── Grafana              ← metrics dashboards

Prerequisites

Before starting, make sure you have these tools installed:

Verify everything is installed:

docker --version
git --version
aws --version
kubectl version --client
helm version
terraform --versionConfigure AWS credentials:

Configure AWS credentials:

aws configure
# Enter your Access Key ID, Secret Access Key, region (ap-south-1), output format (json)

Verify your AWS identity:

aws sts get-caller-identity

Expected output:

{
    "UserId": "AIDXXXXXXXXXXXXXXXXX",
    "Account": "XXXXXXXXXXXXXXXXX",
    "Arn": "arn:aws:iam::XXXXXXXXXXXXXXXXX:user/terraform-user"
}

Step 1: Clone and Run Locally with Docker Compose

Clone the Repository

git clone https://github.com/varungopal286/spring-petclinic-microservices.git
cd spring-petclinic-microservices

Set Up Environment Variables

The GenAI service needs an OpenAI API key. Create a .env file at the project root:

touch .env

Add the following inside .env:

OPENAI_API_KEY=your-openai-api-key-here
AZURE_OPENAI_KEY=
AZURE_OPENAI_ENDPOINT=

🔒 Security note: This .env file is in .gitignore and will never be pushed to GitHub. Your API key stays on your machine only.

Apple Silicon (M1/M2/M3) Users — Important

The Docker images are built for AMD64 (Linux servers). Apple Silicon Macs run ARM64. Set this before running anything:

export DOCKER_DEFAULT_PLATFORM=linux/amd64

Start All Services

docker compose up -d

Docker will pull all 8 images and start them in dependency order. This takes 3–5 minutes on first run.

Verify Everything is Running

docker ps

You should see 11 containers running (8 services + Zipkin + Prometheus + Grafana).

docker compose ps

All services should show Up status. Config-server and discovery-server will show healthy.

Access the Application

http://localhost:8761 — Eureka showing all services registered

http://localhost:9090 — Spring Boot Admin showing all services UP (green)

Spring PetClinic UI homepage showing the Welcome page at http://localhost:8080

Step 2: The Startup Order — Why Config Server Goes First

This was one of the most important things I learned from this project.

Every microservice in this app needs to:

Fetch its configuration from config-server on startup Register itself with discovery-server so other services can find it
If customers-service starts before config-server is ready, it crashes immediately because it can't load its config. If api-gateway starts before discovery-server, it can't route traffic because there's nothing registered.

How Docker Compose Handles This

In docker-compose.yml, each service has a depends_on block:

customers-service:
  depends_on:
    config-server:
      condition: service_healthy
    discovery-server:
      condition: service_healthy

service_healthy means Docker waits until the healthcheck passes — not just until the container starts, but until the service is actually responding.

Config-server's healthcheck:

healthcheck:
  test: ["CMD", "curl", "-I", "http://config-server:8888"]
  interval: 5s
  timeout: 5s
  retries: 10

How Kubernetes Handles This (Init Containers)

On EKS, Docker Compose's depends_on doesn't exist. Instead, the Helm chart uses init containers — small containers that run before the main app starts and block until a TCP connection succeeds:

initContainers:
  - name: wait-config
    image: busybox:1.35
    command:
      - sh
      - -c
      - |
        until nc -z config-server 8888; do
          echo "Waiting for config-server..."
          sleep 5
        done

The pod won't start its main container until this init container exits successfully. This guarantees startup order on Kubernetes just like depends_on does on Docker Compose.

Step 3: AWS Infrastructure with Terraform

Instead of clicking through the AWS console, all infrastructure is defined as code using Terraform. This means:

The entire infrastructure can be created with one command
It can be destroyed with one command (no orphaned resources costing money)
It can be version-controlled and reviewed like code

What Terraform Creates

terraform/
├── vpc.tf          → VPC, 3 public + 3 private subnets, NAT Gateway, IGW
├── eks.tf          → EKS cluster, managed node group (2x t3.medium)
├── ecr.tf          → ECR repositories for all 8 Docker images
├── iam.tf          → IAM roles for EKS nodes and ALB controller (IRSA)
├── backend.tf      → S3 remote state + DynamoDB locking
└── variables.tf    → Region, cluster name, node size

Remote State Backend

Terraform state is stored in S3 (not locally) so it's safe, shareable, and backed up:

terraform {
  backend "s3" {
    bucket         = "petclinic-tfstate-721449410291"
    key            = "petclinic/terraform.tfstate"
    region         = "ap-south-1"
    dynamodb_table = "petclinic-terraform-locks"
    encrypt        = true
  }
}

DynamoDB prevents two people from running terraform apply at the same time (state locking).

Deploy the Infrastructure

cd terraform
terraform init
terraform plan    # review what will be created — 60 resources
terraform apply   # type 'yes' when prompted

This takes about 15–20 minutes. EKS cluster creation is the slowest part.

Expected output:

Apply complete! Resources: 60 added, 0 changed, 0 destroyed.

Outputs:
cluster_name = "petclinic-cluster"
configure_kubectl_command = "aws eks update-kubeconfig --name petclinic-cluster --region ap-south-1"
ecr_registry = "721449410291.dkr.ecr.ap-south-1.amazonaws.com"

Connect kubectl to the Cluster

aws eks update-kubeconfig --name petclinic-cluster --region ap-south-1
kubectl get nodes

Expected output:

NAME                                        STATUS   ROLES    AGE   VERSION
ip-10-0-1-xxx.ap-south-1.compute.internal  Ready    <none>   5m    v1.32.x
ip-10-0-2-xxx.ap-south-1.compute.internal  Ready    <none>   5m    v1.32.x

AWS EKS console — Cluster overview page showing petclinic-cluster Active

Terminal showing kubectl get nodes with both nodes Ready

Step 4: Push Docker Images to ECR

Login to ECR

aws ecr get-login-password --region ap-south-1 | \
  docker login --username AWS --password-stdin \
  721449410291.dkr.ecr.ap-south-1.amazonaws.com

Expected: Login Succeeded

Tag and Push Each Image

REGISTRY=721449410291.dkr.ecr.ap-south-1.amazonaws.com

docker tag springcommunity/spring-petclinic-config-server \
  $REGISTRY/petclinic/spring-petclinic-config-server:latest
docker push $REGISTRY/petclinic/spring-petclinic-config-server:latest
# Repeat for: discovery-server, api-gateway, customers-service,
# vets-service, visits-service, admin-server, genai-service

AWS ECR console showing all 8 repositories with images pushed

Step 5: Deploy to EKS with Helm

What is Helm?

Helm is the package manager for Kubernetes. Instead of writing 8 separate deployment YAML files, Helm uses templates — one deployment template that loops over all services and generates the Kubernetes manifests automatically.

Install ALB Controller (for public access)

The ALB controller watches for Kubernetes Ingress resources and automatically creates AWS Application Load Balancers:

helm repo add eks https://aws.github.io/eks-charts
helm install aws-load-balancer-controller eks/aws-load-balancer-controller \
  -n kube-system \
  --set clusterName=petclinic-cluster \
  --set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=<IRSA_ROLE_ARN>

Create the OpenAI Secret

The GenAI service needs the OpenAI API key as a Kubernetes Secret — stored encrypted in etcd, not in any YAML file or image:

kubectl create secret generic openai-secret \
  --from-literal=OPENAI_API_KEY=your-key-here \
  -n petclinic

Deploy MySQL

kubectl create namespace petclinic
helm install mysql bitnami/mysql \
  --namespace petclinic \
  --set auth.database=petclinic \
  --set auth.username=petclinic \
  --set auth.password=petclinic

Deploy PetClinic with Helm

helm install petclinic ./helm/petclinic \
  --namespace petclinic \
  --set registry=721449410291.dkr.ecr.ap-south-1.amazonaws.com \
  --set imageTag=latest

Verify All Pods are Running

kubectl get pods -n petclinic

Expected output:

NAME                               READY   STATUS    RESTARTS   AGE
api-gateway-xxxx                   1/1     Running   0          5m
config-server-xxxx                 1/1     Running   0          8m
customers-service-xxxx             1/1     Running   0          5m
discovery-server-xxxx              1/1     Running   0          7m
genai-service-xxxx                 1/1     Running   0          5m
vets-service-xxxx                  1/1     Running   0          5m
visits-service-xxxx                1/1     Running   0          5m
admin-server-xxxx                  1/1     Running   0          5m

Get the Public URL

kubectl get ingress -n petclinic

The ADDRESS column shows your public ALB URL. Open it in your browser.

Load Balancers console showing the ALB Active

Step 6: GenAI Service — AI-Powered Pet Chat

The genai-service connects to OpenAI GPT-4o-mini and lets you ask natural language questions about pets and vets in the system.

The API key is injected into the pod via a Kubernetes Secret — it never appears in any YAML file, Docker image, or GitHub repository:

# In helm/petclinic/values.yaml
genai-service:
  extraEnv:
    - name: OPENAI_API_KEY
      valueFrom:
        secretKeyRef:
          name: openai-secret
          key: OPENAI_API_KEY

Try it — go to the PetClinic UI → click "Ask our AI Assistant" → type:

"Does George Franklin's cat Leo have any upcoming visits?"

The service queries the database and answers using GPT.

Step 7: Observability — Prometheus, Grafana, and Zipkin

Prometheus

Prometheus scrapes metrics from all Spring Boot services via their /actuator/prometheus endpoint every 15 seconds. Metrics include:

JVM memory usage
HTTP request rate and latency
Active database connections
Garbage collection stats

Access locally at: http://localhost:9091

Grafana

Grafana visualizes the Prometheus metrics as dashboards. I used the pre-built Spring Boot dashboard (ID: 11378) which shows:

Memory usage per service
Request rate and response times
JVM thread count
Error rate over time

Access locally at: http://localhost:3030 (username: admin, password: admin)

Zipkin (Distributed Tracing)

Zipkin traces requests across multiple services. When a user searches for a vet, the request flows:

browser → api-gateway → vets-service → MySQL

Zipkin captures the full trace with timing for each hop so you can see exactly where time is being spent.

Access locally at: http://localhost:9411

Challenges I Faced (and How I Fixed Them)

ARM64 vs AMD64 — Platform Mismatch

Problem: My MacBook runs Apple Silicon (ARM64). The Spring PetClinic images are built for AMD64 (what AWS EC2 instances run). Pulling images without specifying the platform caused containers to crash silently.

Fix:

export DOCKER_DEFAULT_PLATFORM=linux/amd64
docker pull --platform linux/amd64 springcommunity/spring-petclinic-discovery-server

api-gateway Killed by OOM (Exit Code 137)

Problem: The api-gateway container kept crashing with exit code 137 (Out of Memory). Docker was killing it because it exceeded the 512MB memory limit.

Fix: Raised memory limits across all services in docker-compose.yml:

api-gateway:
  deploy:
    resources:
      limits:
        memory: 1024M  # was 512M

Also freed up memory by stopping other Docker containers running from unrelated projects.

Terraform Destroy Blocked by Orphaned Security Groups

Problem: After helm uninstall, two security groups created by the ALB controller were still attached to the VPC. Terraform couldn't delete the VPC because of these dependencies.

Fix: Manually identified and deleted the orphaned security groups:

aws ec2 describe-security-groups \
  --filters "Name=vpc-id,Values=vpc-02c9d5892b3fe6432" \
  --region ap-south-1

aws ec2 delete-security-group --group-id sg-09ad695898a268db6 --region ap-south-1
aws ec2 delete-security-group --group-id sg-08355334cbcb467bc --region ap-south-1
terraform destroy

⚠️ Lesson: Always run helm uninstall before terraform destroy. The ALB controller creates AWS resources outside Terraform's management — you must clean those up manually first.

Maven Build Failed — --load Flag Not Supported

Problem: The build script used docker buildx build --load which isn't supported in Docker 20.10.22.

Fix: Skipped rebuilding from source entirely. The official Spring PetClinic images on Docker Hub were functionally identical to what we needed. Tagged and pushed the pre-built images directly to ECR.

Grafana Not Accessible on Port 3000

Problem: Grafana wasn't loading at http://localhost:3000.

Fix: Ran docker ps and found Grafana was mapped to port 3030, not 3000:

0.0.0.0:3030->3000/tcp

The docker-compose.yml had 3030:3000. Always check docker ps output before assuming a port.

Key Takeaway

The single biggest lesson from this project: distributed systems fail in ways that single-service apps never do.

When you have 8 services, the question isn't if something will fail — it's which service failed, why, and in what order. Startup order, memory limits, platform architecture, network dependencies, secret management — all of these are non-issues in a monolith, but every single one of them hit me during this project.

What I'd change for a real production deployment:

Use Amazon RDS instead of in-cluster MySQL (data persistence, automated backups)
Add GitHub Actions CI/CD — auto-build and deploy on every push to main
Use AWS Secrets Manager instead of Kubernetes Secrets for the OpenAI key (rotation, audit trail)
Enable Cluster Autoscaler so nodes scale up under load and scale down to save cost
Add NetworkPolicies to restrict which services can talk to each other

Conclusion

In this project, I went from zero to a fully running microservices application on AWS EKS — locally with Docker Compose, and in the cloud with Terraform, ECR, Helm, and an Application Load Balancer. I connected a real AI service, set up observability with Prometheus and Grafana, and learned how distributed systems actually fail and recover.

If you're learning DevOps or cloud engineering, this kind of hands-on project teaches you more in a weekend than months of tutorial-watching.

GitHub Repository: https://github.com/varungopal286/spring-petclinic-microservices

If you found this useful, follow me on Hashnode for more hands-on cloud and DevOps content. Drop a comment below if you hit any of the same issues — happy to help.

This project was part of the DMI programme — a hands-on cohort-based experience where you build real DevOps and cloud skills in a team environment.

DMI Cohort 3 is starting 27 June — if you want to build real DevOps skills, apply here. Happy building!

Deploying Spring PetClinic Microservices on AWS EKS — A Complete End-to-End Guide

Introduction

What is Spring PetClinic?

What Did I Deploy?

Why This Project?

Architecture Overview

Prerequisites

Step 1: Clone and Run Locally with Docker Compose

Step 2: The Startup Order — Why Config Server Goes First

Step 3: AWS Infrastructure with Terraform

Step 4: Push Docker Images to ECR

Step 5: Deploy to EKS with Helm

Step 6: GenAI Service — AI-Powered Pet Chat

Step 7: Observability — Prometheus, Grafana, and Zipkin

Challenges I Faced (and How I Fixed Them)

Key Takeaway

Conclusion

Comments

More from this blog

From Classroom to Cloud: How My Team Deployed a Microservices App on AWS EKS

Command Palette

Introduction

What is Spring PetClinic?

What Did I Deploy?

Why This Project?

Architecture Overview

Prerequisites

Step 1: Clone and Run Locally with Docker Compose

Step 2: The Startup Order — Why Config Server Goes First

Step 3: AWS Infrastructure with Terraform

Step 4: Push Docker Images to ECR

Step 5: Deploy to EKS with Helm

Step 6: GenAI Service — AI-Powered Pet Chat

Step 7: Observability — Prometheus, Grafana, and Zipkin

Challenges I Faced (and How I Fixed Them)

Key Takeaway

Conclusion

Comments

More from this blog