Appearance
Leonardo.Ai – STAR Interview Prep (DevOps Engineer)
Agenda: 45m technical (Terraform follow-up, AWS system design, containers/CI/CD) + 20m competencies (problem solving, communication, attention to detail) + 10m your questions.
Quick Question List to Memorize
- Terraform module hardening (S3 module challenge)
- GitOps platform migration (ECS → K8s)
- Resilience to third-party failures (retry storm incident)
- Zero-downtime delivery with DB change pressure
- Operational readiness & service ownership
- Problem solving under constraints (Viator prioritization)
- Communication & persuasion (platform adoption)
- Attention to detail (policy + infra guardrails)
- Conflict management with data (pairing mandate)
- Learning from failure (optimization bias)
- Major incident command (retry storm)
- Balance feature velocity vs stability (modernization)
- Disagreement with product on technical direction (error budget)
Technical STAR Q&A (spoken, "I" voice)
Terraform module hardening (S3 module challenge)
- Situation: I inherited a request at Wizard.Ai to add a reusable S3 module that every team could consume. The constraints were strict: everything had to be encrypted at rest and in transit, public-by-default was forbidden, and names had to follow
wizardai-<name>-<environment>across dev/stage/prod. - Task: I needed to ship a safe-by-default module that junior engineers could use without footguns, but still give seniors escape hatches for edge cases.
- Action: I wrapped
terraform-aws-modules/s3-bucketv5.9.0 instead of reinventing it. I enforced the naming convention in locals and validated theenvironment. I defaulted server-side encryption to AES256 with bucket keys and attached policies to deny insecure transport and require the latest TLS. I blocked public access unless teams explicitly set a flag, kept versioning on, and standardized tagging. I exposed opt-ins for logging, replication, and KMS so the module could grow without hacks. I added pre-commit hooks (fmt,validate,tflint,terraform-docs) and an example plan path so CI could prove it. - Result: I delivered a secure, low-friction module that juniors could adopt immediately and seniors could extend. It’s CI-ready, documents itself via terraform-docs, and sets a clear baseline to add SSE-KMS and access logging next.
GitOps platform migration (Kubernetes)
- Situation: At Domain, I walked into 100+ services each with their own ALB and ECS cluster. Rollbacks took a full rebuild, east-west traffic rode public endpoints, and costs were climbing.
- Task: I had to consolidate onto Kubernetes with GitOps without stalling product teams.
- Action: I re-architected the VPC from /16 to /8 with /20 subnets to avoid IP exhaustion before density increased. I chose ArgoCD plus Gateway API so we could decouple deploy from release and secure internal routing. I built golden-path Helm values to hide K8s complexity and adopted a hydrated-manifests repo for auditability and DR. I ran progressive migrations with weighted DNS and held live office hours to migrate services in under an hour.
- Result: I hit 50% adoption in seven months, kept 99.9% uptime during migration, cut MTTR to ~15 minutes, dropped compute costs ~18%, and teams self-served deployments through GitOps.
Resilience to third-party failures (retry storm)
- Situation: I was on point during a 26-minute auction outage caused by a retry storm on 3rd-party 429s that saturated our queues.
- Task: I had to restore service fast and make sure this failure class never returned.
- Action: As IC, I ordered an immediate rollback to stop the bleeding. I used ELK to spot the thundering herd pattern (90% 429s) and then standardized a resiliency layer: circuit breakers, exponential backoff with jitter, and chaos tests in CI that inject 429/500s. I templated the pattern so every team could adopt it without bespoke code.
- Result: I got the system back up and we have had zero recurrences. Services now degrade gracefully when third parties fail, and resilience checks run in CI before releases.
Zero-downtime delivery with DB change pressure
- Situation: I needed faster releases, but risky database migrations were the main source of downtime risk.
- Task: I had to enable frequent deploys without breaking schema-dependent services.
- Action: I enforced the expand/contract migration pattern, gated risky changes behind feature flags, and used Argo Rollouts for canaries. I wrote migration runbooks and wired health/SLO checks to trigger automatic rollback if error budgets burned too fast.
- Result: We shipped multiple times per week with no DB-related downtime and always had a predictable rollback path.
Operational readiness & service ownership
- Situation: Teams were shipping fast but without SLOs, and outages were creeping up.
- Task: I needed to shift the org to service ownership and make compliance the easy path.
- Action: I defined tiered service contracts (Tier0–3) and negotiated non-negotiables for Tier1. I built an automated readiness scorecard (PagerDuty, backups, structured logs) inside the portal and added a pre-flight checklist to the SDLC for capacity and DR testing.
- Result: Tier1 uptime reached 99.98%, incidents dropped ~70%, and the framework became the default way teams self-remediate.
Competency/Behavioral STAR Q&A (spoken, "I" voice)
Problem solving under constraints (prioritization)
- Situation: At Viator, people complained about GitLab token rotation, but a 3-day prod→staging DB refresh was breaking tests.
- Task: I needed to maximize velocity per hour.
- Action: I delivered a “good enough” token script in two days to remove the noise. Then I re-architected the refresh: I split sanitize/import, dropped unused high-volume tables, and parallelized PII scrubbing.
- Result: I cut refresh time to under four hours, enabled daily refreshes, and restored test reliability while avoiding weeks on low-ROI work.
Communication & persuasion (platform adoption)
- Situation: Product teams feared K8s would slow them down.
- Task: I had to win adoption without mandating it.
- Action: I used a strangler-fig approach: I put Gateway API in front of ECS so teams got rate-limiting/auth for free. I built a golden-path Helm/Argo template so they only edited a 10-line values file. I ran office hours and live-migrated services in under an hour to prove it.
- Result: Adoption hit 50% in seven months, MTTR dropped to 15 minutes, satisfaction was 4.7/5, and compute costs fell ~18%.
Attention to detail (policy + infra guardrails)
- Situation: The S3 module needed to be secure by default for less-experienced users.
- Task: I had to prevent misconfigurations at the source.
- Action: I validated env values, enforced naming, denied insecure transport, defaulted SSE, blocked public by default, and added pre-commit hooks plus examples. I required explicit opt-ins for public access and KMS.
- Result: Teams consumed the module safely without review churn, and CI could enforce policy-as-code from day one.
Conflict management with data (pairing mandate)
- Situation: My manager mandated 100% pairing; the team feared losing velocity.
- Task: I needed to de-escalate and get data.
- Action: I framed a four-week experiment, tested ping-pong vs strong-style vs async, and paired first on a complex ArgoCD refactor to gather evidence.
- Result: We landed on a hybrid: pairing for onboarding and complex design, optional otherwise. Onboarding time dropped 60% and morale recovered.
Learning from failure (optimization bias)
- Situation: I was over-designing K8s lifecycle management and risked delaying the MVP.
- Task: I had to favor delivery over perfection.
- Action: I defined a sacrificial architecture for v1, time-boxed RFCs, and invited a red-team engineer to cut scope ruthlessly.
- Result: We shipped two weeks early, and the pragmatic approach became the team standard.
Incident leadership (retry storm)
- Situation: I led during the 26-minute retry-storm outage.
- Task: Restore service and harden the system.
- Action: I ordered rollback, then enforced circuit breakers/backoff and chaos tests into CI; I templated the pattern for all services.
- Result: No recurrences, and teams treat resilience as a first-class requirement.
Balance feature velocity vs stability (modernization)
- Situation: At Domain, deploys were manual/risky, but we couldn’t pause features.
- Task: Modernize CI/CD and observability without a feature freeze.
- Action: I picked a lighthouse service, migrated it with zero downtime and trunk-based flow, then packaged a migration kit (Terraform modules, CI templates, runbook) so product engineers could migrate the other services. I coached them and added observability/IM workshops.
- Result: We finished in five months (one month early), moved to multi-times/week deploys with zero downtime, and teams owned their stacks.
Disagreement with product on technical direction (error budget)
- Situation: A team wanted to ship two features with less than five minutes of error budget left.
- Task: Prevent an SLA breach.
- Action: I modeled the breach risk and reframed the choice as “features vs contract violation.” I proposed a two-week Reliability Sprint and introduced a “Code Yellow” policy blocking non-critical deploys when budget <10%.
- Result: We fixed the top issues, restored the budget to 99.95%, avoided penalties, and made error budgets a planning KPI.
Quick candidate questions (10m)
- How is infra split post-Canva acquisition—shared vs independent stack?
- Current deployment model for AI workloads (GPU/Inferentia) and bottlenecks?
- How do you measure platform success (SLOs, velocity, cost targets)?
- Governance for Terraform changes (approvals, policy-as-code, drift detection)?
- Preferred whiteboard patterns/constraints for system design?
- Post-incident review culture—blameless RCAs and automated follow-ups?
- Data isolation/PII handling for customer traffic through models?
Day-of reminders
- Re-read these stories; keep answers ~90–120s.
- Skim Leonardo.Ai product, AWS keynote, CEO podcast, Canva news → 2–3 talking points + 1 tailored question each.
- Be ready to explain S3 module defaults and one improvement (SSE-KMS + access logging).
- Practice a 5-minute AWS infra whiteboard: edge → app → data → observability → security → resilience.