From Classroom to Cloud: How My Team Deployed a Microservices App on AWS EKS
A hands-on account of deploying Spring PetClinic Microservices on AWS EKS as a team QA & Demo Lead

When I joined the DMI programme, I knew we'd be working on real cloud infrastructure. What I didn't expect was how much I'd learn from debugging a system I only half-owned.
This is the story of our group project — deploying Spring PetClinic Microservices on AWS Elastic Kubernetes Service (EKS) — and the lessons I took away as the team's QA & Demo Lead.
What We Built
Spring PetClinic is a classic Spring Boot reference application. Our team of nine took the microservices version and brought it fully to life on AWS, turning it into a production-grade distributed system.
The final architecture included:
API Gateway — single entry point, load balanced via AWS ELB
Eureka Discovery Server — service registry
Config Server — centralised configuration via an external GitHub repo
Business services — Customers, Vets, and Visits, each with their own MySQL database on EBS-backed PVCs
GenAI Service — a custom Spring AI integration backed by the OpenAI API
Observability stack — Prometheus, Grafana, and Zipkin for metrics and distributed tracing
All services ran in a dedicated spring-petclinic namespace across a 3-node EKS cluster in us-east-1, with container images stored in AWS ECR and deployments driven by GitHub Actions CI/CD.
My Role: QA & Demo Lead (P9)
My responsibilities covered integration, quality assurance, and the live demonstration. In practice, that meant owning three technical areas:
1. GenAI Service Deployment
I wrote the Kubernetes manifest for the GenAI service from scratch — Deployment, Service, environment variable wiring from a Kubernetes secret (openai-secret), and imagePullPolicy: Always to ensure fresh ECR images on every rollout. I also added the service to the GitHub Actions CI pipeline so every push triggered an automated ECR build.
2. API Gateway — Circuit Breaker Fix
After deploying the GenAI service, requests were silently failing. The root cause: Resilience4j's default TimeLimiter timeout is 1 second — far too short for an OpenAI API call. I patched it to 60 seconds via environment variable injection in the gateway deployment, restoring end-to-end functionality without touching application source code.
3. QA Test Infrastructure
I built a full automated test suite from the ground up — 39 test cases across four bash scripts:
k8s-validation.sh — pod and PVC health
health-check.sh — Spring actuator endpoints
api-tests.sh — end-to-end API coverage
monitoring-check.sh — Prometheus and Grafana availability
In the final test round, all four suites passed (39/39 + 1 documented known defect). I also wrote the test plan, test results history, and the 7-minute live demo script used on presentation day.
What Went Well
The CI/CD pipeline was a highlight. Every merged PR triggered an automated ECR build, and rolling deployments meant zero-downtime updates. During the chaos demo, a pod was killed deliberately — Kubernetes replaced it in under 8 seconds.
The observability stack paid off. When the discovery server entered a CrashLoopBackOff , Grafana dashboards pointed us straight at the problem. The root cause turned out to be a single misconfigured URL in a Spring profile — http://config-server:8761 instead of http://config-server:8888. A small typo, caught cleanly because we had visibility.
What I Would Do Differently
Treat the config server as an API contract, not a detail. Spring Cloud Config silently overrides local list properties with whatever is in the remote repo — no merge, no warning. We hit this late and it cost time. Next time I'd document that behaviour on day one and own it explicitly.
Define integration test contracts earlier. Because each service was owned by a different person, integration gaps appeared late. A shared API contract document from week one would have caught the circuit breaker timeout issue before it reached the gateway.
Final Thoughts
This project gave me something solo work rarely does: the discipline of debugging across ownership boundaries, reasoning about systems I didn't build, and communicating failures clearly enough for teammates to act on them.
Cloud-native development is a team sport. The seams between services — the config layer, the timeouts, the secret references — are where real-world failures live. Learning to work confidently at those seams is, I think, the most transferable skill this project gave me.
This project was part of the DMI programme — a hands-on cohort-based experience where you build real DevOps and cloud skills in a team environment.
DMI Cohort 3 is starting 27 June — if you want to build real DevOps skills, apply here. Happy building!
