Leonardo.Ai – STAR Interview Prep (DevOps Engineer)

Agenda: 45m technical (Terraform follow-up, AWS system design, containers/CI/CD) + 20m competencies (problem solving, communication, attention to detail) + 10m your questions.

Quick Question List to Memorize

Terraform module hardening (S3 module challenge)
GitOps platform migration (ECS → K8s)
Resilience to third-party failures (retry storm incident)
Zero-downtime delivery with DB change pressure
Operational readiness & service ownership
Problem solving under constraints (Viator prioritization)
Communication & persuasion (platform adoption)
Attention to detail (policy + infra guardrails)
Conflict management with data (pairing mandate)
Learning from failure (optimization bias)
Major incident command (retry storm)
Balance feature velocity vs stability (modernization)
Disagreement with product on technical direction (error budget)

Technical STAR Q&A (spoken, "I" voice)

Terraform module hardening (S3 module challenge)

Situation: I inherited a request at Wizard.Ai to add a reusable S3 module that every team could consume. The constraints were strict: everything had to be encrypted at rest and in transit, public-by-default was forbidden, and names had to follow wizardai-<name>-<environment> across dev/stage/prod.
Task: I needed to ship a safe-by-default module that junior engineers could use without footguns, but still give seniors escape hatches for edge cases.
Action: I wrapped terraform-aws-modules/s3-bucket v5.9.0 instead of reinventing it. I enforced the naming convention in locals and validated the environment. I defaulted server-side encryption to AES256 with bucket keys and attached policies to deny insecure transport and require the latest TLS. I blocked public access unless teams explicitly set a flag, kept versioning on, and standardized tagging. I exposed opt-ins for logging, replication, and KMS so the module could grow without hacks. I added pre-commit hooks (fmt, validate, tflint, terraform-docs) and an example plan path so CI could prove it.
Result: I delivered a secure, low-friction module that juniors could adopt immediately and seniors could extend. It’s CI-ready, documents itself via terraform-docs, and sets a clear baseline to add SSE-KMS and access logging next.

GitOps platform migration (Kubernetes)

Situation: At Domain, I walked into 100+ services each with their own ALB and ECS cluster. Rollbacks took a full rebuild, east-west traffic rode public endpoints, and costs were climbing.
Task: I had to consolidate onto Kubernetes with GitOps without stalling product teams.
Action: I re-architected the VPC from /16 to /8 with /20 subnets to avoid IP exhaustion before density increased. I chose ArgoCD plus Gateway API so we could decouple deploy from release and secure internal routing. I built golden-path Helm values to hide K8s complexity and adopted a hydrated-manifests repo for auditability and DR. I ran progressive migrations with weighted DNS and held live office hours to migrate services in under an hour.
Result: I hit 50% adoption in seven months, kept 99.9% uptime during migration, cut MTTR to ~15 minutes, dropped compute costs ~18%, and teams self-served deployments through GitOps.

Resilience to third-party failures (retry storm)

Situation: I was on point during a 26-minute auction outage caused by a retry storm on 3rd-party 429s that saturated our queues.
Task: I had to restore service fast and make sure this failure class never returned.
Action: As IC, I ordered an immediate rollback to stop the bleeding. I used ELK to spot the thundering herd pattern (90% 429s) and then standardized a resiliency layer: circuit breakers, exponential backoff with jitter, and chaos tests in CI that inject 429/500s. I templated the pattern so every team could adopt it without bespoke code.
Result: I got the system back up and we have had zero recurrences. Services now degrade gracefully when third parties fail, and resilience checks run in CI before releases.

Zero-downtime delivery with DB change pressure

Situation: I needed faster releases, but risky database migrations were the main source of downtime risk.
Task: I had to enable frequent deploys without breaking schema-dependent services.
Action: I enforced the expand/contract migration pattern, gated risky changes behind feature flags, and used Argo Rollouts for canaries. I wrote migration runbooks and wired health/SLO checks to trigger automatic rollback if error budgets burned too fast.
Result: We shipped multiple times per week with no DB-related downtime and always had a predictable rollback path.

Operational readiness & service ownership

Situation: Teams were shipping fast but without SLOs, and outages were creeping up.
Task: I needed to shift the org to service ownership and make compliance the easy path.
Action: I defined tiered service contracts (Tier0–3) and negotiated non-negotiables for Tier1. I built an automated readiness scorecard (PagerDuty, backups, structured logs) inside the portal and added a pre-flight checklist to the SDLC for capacity and DR testing.
Result: Tier1 uptime reached 99.98%, incidents dropped ~70%, and the framework became the default way teams self-remediate.

Competency/Behavioral STAR Q&A (spoken, "I" voice)

Problem solving under constraints (prioritization)

Situation: At Viator, people complained about GitLab token rotation, but a 3-day prod→staging DB refresh was breaking tests.
Task: I needed to maximize velocity per hour.
Action: I delivered a “good enough” token script in two days to remove the noise. Then I re-architected the refresh: I split sanitize/import, dropped unused high-volume tables, and parallelized PII scrubbing.
Result: I cut refresh time to under four hours, enabled daily refreshes, and restored test reliability while avoiding weeks on low-ROI work.

Communication & persuasion (platform adoption)

Situation: Product teams feared K8s would slow them down.
Task: I had to win adoption without mandating it.
Action: I used a strangler-fig approach: I put Gateway API in front of ECS so teams got rate-limiting/auth for free. I built a golden-path Helm/Argo template so they only edited a 10-line values file. I ran office hours and live-migrated services in under an hour to prove it.
Result: Adoption hit 50% in seven months, MTTR dropped to 15 minutes, satisfaction was 4.7/5, and compute costs fell ~18%.

Attention to detail (policy + infra guardrails)

Situation: The S3 module needed to be secure by default for less-experienced users.
Task: I had to prevent misconfigurations at the source.
Action: I validated env values, enforced naming, denied insecure transport, defaulted SSE, blocked public by default, and added pre-commit hooks plus examples. I required explicit opt-ins for public access and KMS.
Result: Teams consumed the module safely without review churn, and CI could enforce policy-as-code from day one.

Conflict management with data (pairing mandate)

Situation: My manager mandated 100% pairing; the team feared losing velocity.
Task: I needed to de-escalate and get data.
Action: I framed a four-week experiment, tested ping-pong vs strong-style vs async, and paired first on a complex ArgoCD refactor to gather evidence.
Result: We landed on a hybrid: pairing for onboarding and complex design, optional otherwise. Onboarding time dropped 60% and morale recovered.

Learning from failure (optimization bias)

Situation: I was over-designing K8s lifecycle management and risked delaying the MVP.
Task: I had to favor delivery over perfection.
Action: I defined a sacrificial architecture for v1, time-boxed RFCs, and invited a red-team engineer to cut scope ruthlessly.
Result: We shipped two weeks early, and the pragmatic approach became the team standard.

Incident leadership (retry storm)

Situation: I led during the 26-minute retry-storm outage.
Task: Restore service and harden the system.
Action: I ordered rollback, then enforced circuit breakers/backoff and chaos tests into CI; I templated the pattern for all services.
Result: No recurrences, and teams treat resilience as a first-class requirement.

Balance feature velocity vs stability (modernization)

Situation: At Domain, deploys were manual/risky, but we couldn’t pause features.
Task: Modernize CI/CD and observability without a feature freeze.
Action: I picked a lighthouse service, migrated it with zero downtime and trunk-based flow, then packaged a migration kit (Terraform modules, CI templates, runbook) so product engineers could migrate the other services. I coached them and added observability/IM workshops.
Result: We finished in five months (one month early), moved to multi-times/week deploys with zero downtime, and teams owned their stacks.

Disagreement with product on technical direction (error budget)

Situation: A team wanted to ship two features with less than five minutes of error budget left.
Task: Prevent an SLA breach.
Action: I modeled the breach risk and reframed the choice as “features vs contract violation.” I proposed a two-week Reliability Sprint and introduced a “Code Yellow” policy blocking non-critical deploys when budget <10%.
Result: We fixed the top issues, restored the budget to 99.95%, avoided penalties, and made error budgets a planning KPI.

Quick candidate questions (10m)

How is infra split post-Canva acquisition—shared vs independent stack?
Current deployment model for AI workloads (GPU/Inferentia) and bottlenecks?
How do you measure platform success (SLOs, velocity, cost targets)?
Governance for Terraform changes (approvals, policy-as-code, drift detection)?
Preferred whiteboard patterns/constraints for system design?
Post-incident review culture—blameless RCAs and automated follow-ups?
Data isolation/PII handling for customer traffic through models?

Day-of reminders

Re-read these stories; keep answers ~90–120s.
Skim Leonardo.Ai product, AWS keynote, CEO podcast, Canva news → 2–3 talking points + 1 tailored question each.
Be ready to explain S3 module defaults and one improvement (SSE-KMS + access logging).
Practice a 5-minute AWS infra whiteboard: edge → app → data → observability → security → resilience.

Leonardo.Ai – STAR Interview Prep (DevOps Engineer) ​

Quick Question List to Memorize ​

Technical STAR Q&A (spoken, "I" voice) ​

Terraform module hardening (S3 module challenge) ​

GitOps platform migration (Kubernetes) ​

Resilience to third-party failures (retry storm) ​

Zero-downtime delivery with DB change pressure ​

Operational readiness & service ownership ​

Competency/Behavioral STAR Q&A (spoken, "I" voice) ​

Problem solving under constraints (prioritization) ​

Communication & persuasion (platform adoption) ​

Attention to detail (policy + infra guardrails) ​

Conflict management with data (pairing mandate) ​

Learning from failure (optimization bias) ​

Incident leadership (retry storm) ​

Balance feature velocity vs stability (modernization) ​

Disagreement with product on technical direction (error budget) ​

Quick candidate questions (10m) ​

Day-of reminders ​

Leonardo.Ai – STAR Interview Prep (DevOps Engineer)

Quick Question List to Memorize

Technical STAR Q&A (spoken, "I" voice)

Terraform module hardening (S3 module challenge)

GitOps platform migration (Kubernetes)

Resilience to third-party failures (retry storm)

Zero-downtime delivery with DB change pressure

Operational readiness & service ownership

Competency/Behavioral STAR Q&A (spoken, "I" voice)

Problem solving under constraints (prioritization)

Communication & persuasion (platform adoption)

Attention to detail (policy + infra guardrails)

Conflict management with data (pairing mandate)

Learning from failure (optimization bias)

Incident leadership (retry storm)

Balance feature velocity vs stability (modernization)

Disagreement with product on technical direction (error budget)

Quick candidate questions (10m)

Day-of reminders