Appearance
Leonardo.Ai (Canva) – Platform/DevOps Interview Study Guide
Executive Summary
- Role: DevOps Engineer (Platform Tribe), Australia-wide, hybrid multi-cloud with AI/GPU focus.
- What matters: AWS-first platform strength (ECS/EKS), Terraform/IaC discipline, CI/CD velocity, observability + incident excellence, multi-tenant reliability, and willingness to extend into GCP/GCore + GPU workloads.
- Top requirements: AWS & container networking fundamentals; Terraform; CI/CD build/maintenance; observability/incident response; documentation and clear comms; curiosity about GPU/AI infra.
- Culture signals: Remote-friendly Australian squads; high-velocity experimentation with guardrails; documentation and transparency; Canva-influenced structure with startup pace; collaboration and learning guilds.
Key Talking Points (from resume)
- AWS platform migrations: ECS→K8s for 100+ services with 99.9% uptime (Domain); EC2→ECS migration with security/scale (illion).
- CI/CD acceleration: 92% lead-time reduction (illion); containerized pipelines +50% reliability (Envato); GitOps for 15+ teams (Domain).
- Observability/incident outcomes: SLO Helm charts 40% adoption and burn-rate alerts (Domain); Datadog + PagerDuty MTTR -40% (Envato); Avinet incident program MTTR -70%.
- IaC credibility: Terraform full-environment migration (Avinet); standardized Helm/GitOps patterns (Domain).
- Automation & DX: AI Slack bot cutting first-response 50% and GitLab token automation (Viator); Backstage metrics for DX transparency.
- Gaps to acknowledge: No direct GCP/GCore or GPU ops yet—lean on quick ramp plan and analogous migrations.
Company Context Highlights
- Canva acquired Leonardo.Ai (mid-2024); operates as high-velocity AI product unit with Canva resources.
- Multi-cloud stance: AWS primary; mentions GCP/GCore/Cloudflare; heavy GPU usage for gen-AI.
- Likely interview focus: practical design/ops scenarios, on-call and SLO ownership, CI/CD/infra hygiene, documentation habits, and cross-squad collaboration.
Q&A by Category
Technical Depth
Q1. Walk me through how you’d design and harden an EKS-based platform on AWS for an AI image service, including networking (VPC, subnets), pod security, secrets management, and how you’d handle multi-tenant workloads safely.
A. I’d start by isolating blast radius: dedicated VPC per environment with /22 for clusters, /24 per AZ, private subnets for nodes and controlled public subnets only for ALB/CloudFront. Ingress via ALB + WAF, with CloudFront for cache/edge auth like I did at Envato to trim costs and absorb spikes. EKS runs managed node groups plus tainted GPU pools. Pod security: PSP replacement via Pod Security Standards (restricted), OPA/Gatekeeper or Kyverno policies for runAsNonRoot, readOnlyRootFS, signed images (Cosign), enforced namespaces per tenant, and strict NetworkPolicies (deny-all default) similar to the ECS→K8s migration at Domain. Secrets: AWS Secrets Manager + CSI driver, with IRSA per service to avoid node credentials; short TTL and rotation hooks. Multi-tenancy: namespace-per-tenant with ResourceQuota/LimitRanges, tenant-specific ingress paths or separate ALBs when noisy neighbours risk, and dedicated GPU pools for premium tiers. Align logging/metrics by workspace to keep SLOs per tenant, mirroring Domain’s 99.9% uptime approach with SLO Helm charts (40% adoption). GitOps (ArgoCD) manages desired state and gives auditability; pre-prod canary env per tenant to catch policy drifts before promotion.
Q2. Describe your approach to building repeatable infrastructure with Terraform across multiple accounts and regions; how do you structure modules, manage state, and roll out changes with minimal blast radius?
A. Pattern: a root "live" repo with account/region overlays and versioned, small modules in a separate registry. Modules stay opinionated and single-purpose (vpc, eks, alb, iam-role, secrets), exposing only safe variables. State: one state per bounded context (network, identity, platform, data) per account/region, stored in S3 with bucket policies + SSE-KMS; DynamoDB for locks. I used this at Avinet and Domain to avoid cross-account blast radius. Promotion: feature branches → plan against ephemeral workspaces → PR with plan output → apply via pipeline with manual approval for shared layers (VPC/IAM). Use -target sparingly; prefer canary applies in a non-critical account mirroring prod. For multi-region, I pin provider aliases and keep replication configs explicit to avoid accidental drift. Drift detection via scheduled plans; version pinning on providers to avoid surprise upgrades. Outputs are limited and consumed via data sources to prevent module sprawl. Rollouts are phased: lower env → shadow region → primary, similar to the illion EC2→ECS migration where we cut lead time by 92% while controlling risk.
Q3. How would you design CI/CD for containerized services using GitHub Actions or GitLab to ensure build repeatability, image signing, and progressive delivery (canary/blue-green) into Kubernetes?
A. Impact first: deterministic builds reduce rollbacks; at illion this cut lead time 92%. Pipeline shape: lint/test → build with pinned base images → SBOM (Syft) → scan (Trivy) → sign (Cosign) → push to ECR/GAR. Cache builder layers but pin versions to avoid drift; reproducible tags (git sha, semver). Admission controller enforces signed images and disallows latest, like I enforced via policies during Domain’s K8s rollout. Deploy via GitOps (ArgoCD) syncing Helm/KS manifests; progressive delivery using Argo Rollouts (canary/blue-green) with metrics-based analysis (Prometheus, Datadog) and auto-rollback thresholds. Secrets pulled at runtime via CSI + IRSA. Workflows templated: at Envato I shipped containerized pipelines (+50% reliability) by standardising reusable pipeline libraries and OPA checks. For governance, require peer review and protected branches; ephemeral preview envs for PRs when cost allows. Artifact promotion is pull-based: staging tag promoted to prod through a manifest bump, keeping registries immutable.
Q4. What’s your strategy for observability on a microservice/Kubernetes stack—cover metrics, logs, traces, SLOs/error budgets, and how you’d wire alerting to reduce noise while catching real customer impact.
A. Anchor on user-facing SLOs; at Domain I drove SLO Helm charts with 40% adoption tied to 99.9% uptime. Metrics: Prometheus/AMP scraping with service monitors; RED/USE per workload. Logs: structured JSON → Fluent Bit → OpenSearch/CloudWatch; sampling kept sane. Traces: OpenTelemetry sidecars/SDK to Datadog (mirroring the Envato migration) with consistent service naming and baggage for tenant/user. Correlate resources via trace_id in logs. Error budgets feed rollout gates and on-call posture. Alerts: budget burn rates (fast/slow), high p95/p99, saturation signals (CPU, GPU, queue depth), and ingress 5xx with per-tenant slicing. Noise reduction: dedupe at the router, multi-step routing (warn → page when sustained), and ownership tags to the right squad—similar to PagerDuty MTTR reduction efforts (Envato -40%, Avinet -70%). Dashboards: golden signals + release overlays. Post-incident alerts added only after RCAs to avoid sprawl.
Q5. Given a production incident with elevated 5xx rates on an ingress controller, how do you triage and stabilize quickly, and what post-incident hardening would you implement?
A. Stabilise fast: freeze deploys, enable canary weight rollback if using Argo Rollouts. Check health: ingress/pod readiness, upstream 5xx vs 4xx split, recent config changes. If ingress is bottlenecked, scale replicas and surge capacity; shift a portion of traffic to a healthy AZ/region if available (mirrors Domain DR playbooks). Validate certs/WAF rules. If upstreams failing, route to last-known-good version; use feature flag kills. Observability: slice by tenant/path to see noisy neighbour. Post-incident: add synthetic checks per edge path, set HPA driven by QPS/error rate, enforce config validation (kubectl schema/OPA) before rollout, and admission for sane timeouts/body size. Improve dashboards for ingress saturation and connection reuse. Run a blameless RCA; at Avinet I formalised this and cut MTTR 70%. Document runbook, add chaos tests for ingress failover, and ensure canary + auto-rollback tied to budget burn. Tighten WAF rules only after testing to avoid false positives.
Q6. How would you enable GPU workloads in Kubernetes (e.g., node pools, device plugins, scheduling/quotas) and keep cost/performance balanced for bursty generative workloads?
A. Separate GPU node groups with taints; enforce tolerations and resource requests/limits so only GPU jobs land there. Install NVIDIA device plugin + DCGM exporter for health/metrics. Use node feature discovery for matching GPU types. Quotas by namespace and priority classes to protect premium tiers; queue inference/batch jobs through a workload orchestrator (KEDA + custom queue metrics). Scale-to-zero for low-traffic models; bin-pack by GPU and memory. For bursts, combine on-demand and smaller spot pool with PDB-aware drain; pre-warm minimal nodes to avoid cold-starts. Image caching (registry mirror) reduces spin-up. I haven’t shipped GCore/GPU-specific stacks yet; I’d ramp via vendor labs, run synthetic benchmarks, and pilot in a shadow environment before production. Monitoring: GPU util, memory, PCIe errors; autoscale on queue depth + utilization. Use cost visibility dashboards (similar to Envato’s Cloudflare cost work) to tune reservations vs spot.
Q7. You’re asked to add a small GCP/GCore footprint to an existing AWS-first platform. What portability and abstraction choices would you make (networking, identity, Terraform providers, secrets) to avoid heavy vendor lock-in?
A. Keep abstractions thin: standardise on Terraform with provider aliasing; modules stay cloud-specific but share interfaces. Networking: similar CIDR/VPC layout; use Cloudflare/CloudFront-style edge for consistent entry, and per-cloud ingress with WAF parity. Identity: central IdP (OIDC/SAML) -> workload identities (IRSA on AWS, Workload Identity on GCP) mapped via a common RBAC model; avoid baking IAM ARNs into app logic. Secrets: app talks to a small wrapper that selects AWS Secrets Manager or GCP Secret Manager via env/namespace; avoid multi-cloud secret replication unless needed. Artifact/image: push to per-cloud registries but sign once (Cosign) and verify via admission. State: per-cloud backends (S3+DDB, GCS+locking tool) with identical folder layout. I’ve not shipped GCore yet; I’d prototype in a sandbox, document diff (ingress, IAM), and add tests in CI for both providers. Keep observability vendor-agnostic via OTEL exporters.
Q20. How would you structure IAM across multiple AWS accounts and EKS clusters to give squads least-privilege access to deploy while protecting shared platform components?
A. Org with separate accounts per env/team; shared services isolated. Use AWS SSO/IdP → role assumption with permission sets per persona (viewer, deployer, platform-admin). In EKS, map SSO roles to RBAC groups via IRSA/identity mappings—no node IAM leakage. Squads get namespace-scoped roles, limited to their resources; platform components live in locked namespaces with tighter RBAC. CI/CD uses dedicated roles with narrow permissions (assume-role to deploy, write to specific ECR repos). Terraform state per domain reduces blast radius. Audit via CloudTrail + GuardDuty; IAM Access Analyzer for drift. Mirrors Domain’s GitOps approach where teams self-serve but platform remained protected.
System Design
Q8. Design a multi-region architecture for an AI image generation API that balances latency, GPU capacity, model artifact distribution, and graceful degradation when a region is impaired.
A. Entry via Cloudflare for geo-DNS and WAF. Regions: primary + secondary GPU regions; static/media via R2/S3 with cross-region replication; model artifacts stored once and distributed using registry caching proxies close to clusters. Control plane: GitOps per region with pinned versions; data plane uses queuing to route jobs to nearest healthy GPU pool with circuit breakers. Latency: edge routing prefers low RTT; cache common prompts/responses where legal. Degradation: if GPUs constrained, degrade to smaller models or queue with premium-priority fast lane (similar to how I protected SLOs at Domain). Health checks feed weighted DNS; if a region is brown, shift read traffic progressively while keeping write isolation to avoid split-brain. Observability unified (OTEL → Datadog/Prometheus); burn-rate alerts per region. DR tested regularly, as we did at Avinet/Domain. For artifact validity, enforce signed models and immutability tags; promotion flows per region but sourced from a single registry of record.
Q9. How would you secure and operate an internal model registry/artifact store for large models? Cover access control, immutability, promotion workflows, and supply-chain integrity.
A. Use a private registry (ECR/Artifactory/Harbor) behind VPC endpoints. Access: SSO + RBAC; write limited to CI/CD roles, read per-team. Immutability: registry-level immutable tags; promotions use new tags (dev→stage→prod) rather than mutating. Sign artifacts (Cosign) and store attestations; admission controllers verify signatures and provenance. Integrate malware/LLM-safety scans pre-publish. Promotion workflow mirrors what I used for containers: build → scan → sign → push dev; promotion via PR in GitOps repo that references the digest. Replicate to secondary regions with checksum verification. Audit logging on pulls/pushes, alerts on unusual egress. Storage with lifecycle policies for old versions. For supply chain, pin base models, record SBOM/manifests, and restrict registry to private networks. Tie deployments to approved digests so rollouts can be traced and rolled back cleanly.
Q10. Propose a traffic management approach across AWS and a secondary cloud using Cloudflare: how would you handle DNS, health checks, failover, and gradual traffic shifting?
A. Cloudflare as the single DNS and WAF front door. Use load balancing pools per cloud/region with health checks hitting synthetic probes on /healthz and critical flows. Normal state: weighted routing favouring AWS, small steady trickle to secondary to keep it warm. Shifts: adjust weights gradually (canary) based on SLO burn and capacity. Failover: mark pool down on consecutive failures; TTL kept low but not thrashy. Cache static responses at edge; use Argo Smart Routing if available. Origin shields per cloud to reduce egress. For ingress paths with GPU work, route premium tenants to the healthiest pool first; if constrained, queue with clear error budgets. All changes flow via GitOps for auditability. Mirrors Envato’s Cloudflare cost/control work where we cut 25% spend while keeping resilience.
Q19. If GPU capacity is constrained, how would you design a queuing and scheduling approach (priorities, backpressure, autoscaling signals) to protect latency SLAs for premium tiers?
A. Use a queue-per-tier with priority (premium > standard > batch). Admission control checks current GPU/queue depth; premium gets preemption rights via priority classes and PDB-aware evictions for best-effort jobs. Backpressure: return 429/Retry-After for non-premium when queues exceed SLO thresholds; degrade to smaller models if allowed. Autoscaling signals: GPU util, queue depth, and request latency drive both HPA and cluster autoscaler for GPU pools (on-demand + spot). Warm pools for premium to avoid cold starts. This mirrors how I protected SLOs during Domain migrations. Observability: per-tenant dashboards and alerts on burn-rate + queue age. Cost guardrails surfaced to teams so they can choose speed vs spend.
Leadership & Collaboration
Q11. Describe a time you aligned product and infra teams on delivery risk versus speed for a launch. How did you communicate trade-offs and measure success?
A. Situation: At Domain, migrating 100+ services from ECS to K8s while keeping 99.9% uptime. Task: Balance launch speed with risk for product squads. Action: Ran joint workshops to map critical paths and tagged services into risk tiers; introduced SLO-backed canaries and GitOps so teams could self-serve. Presented options with impact (e.g., staged rollout vs big-bang) and used lead-time + error budget projections. Communicated weekly in concise updates, showing what we’d stop if risk rose. Result: 100+ services migrated with no major incidents; 40% SLO chart adoption; product teams hit their dates. Success measured by zero Sev1s during rollout and steady release velocity. This mirrors my Australian style—direct, transparent, and data-led.
Q12. How do you build documentation and runbooks into fast-moving platform work so squads can self-serve without creating a support bottleneck?
A. I embed docs in the delivery path. For the GitOps migrations at Domain and containerized pipelines at Envato, every change shipped with: 1) concise README in the repo (how to deploy, rollback, common errors), 2) runbooks linked from dashboards and alerts, and 3) Backstage-style catalog entries with ownership and golden paths. I keep templates for playbooks (validate, fix, verify) and require them in PR definitions of done. Office hours + short Looms to show flows; FAQs updated after each ticket. Metrics: drop in support tickets and time-to-first-success for new teams; at Envato we saw support pings fall as pipelines standardised. Documentation lives close to code and stays versioned with releases to avoid drift.
Q13. Tell me about leading an on-call improvement effort—what changes did you make to rotations, dashboards, or escalation that reduced pain and improved MTTR?
A. Situation: Envato’s on-call MTTR was high and alerts noisy. Task: Reduce pain and MTTR. Action: Introduced tiered rotations (L1 product, L2 platform) with clear ownership; rewired alerts to SLO burn and golden signals, cutting low-value noise. Migrated observability to Datadog with consistent tags; built dashboards showing release overlays and dependency health. Added PagerDuty runbooks and auto-rollback hooks for bad deploys, similar to Avinet’s incident work. Result: MTTR dropped ~40% at Envato; at Avinet a similar play cut MTTR ~70%. Pages per shift fell, and responders had clearer steps. Communication stayed concise and blameless, improving morale.
Company-Specific
Q14. Leonardo.Ai runs fast-paced AI product squads. How would you balance guardrails (policies, quotas, budgets) with the need for rapid experimentation on the platform?
A. Set defaults, not roadblocks. Provide golden paths: pre-approved templates (Helm/ArgoCD) with policies baked in—runAsNonRoot, signed images, network policies. Quotas and budgets per squad/tenant; premium tiers can burst with approvals. For GPUs, priority classes and queueing protect SLOs while allowing ad-hoc trials. Offer self-serve toggles via Backstage or Slack bot (I built similar AI support bots at Viator) for creating preview envs with expiry. Observability and cost dashboards per team to give autonomy with visibility. Governance via OPA/Kyverno so experiments stay safe. Fast feedback—ephemeral canaries and feature flags—keeps iteration speed. Regularly review guardrails with squads to prune unused rules; this keeps the Australian directness: clear rules, quick exceptions with rationale.
Q15. Given Leonardo.Ai’s multi-cloud stance under Canva, how would you decide which services stay AWS-native versus adopting cross-cloud primitives?
A. Decision lens: 1) latency to customers, 2) data gravity/compliance, 3) operational maturity, 4) portability cost. Stay AWS-native when you need deep integration (EKS, IAM, KMS, ALB, Dynamo for control data) and when operational runbooks already exist. Use cross-cloud/portable for edge (Cloudflare), CI/CD (GitHub/GitLab), observability (OTEL exporters with Datadog/Prometheus), and model registry with signed artifacts that can replicate. Storage: object stores per cloud with a single source of truth and replication. Keep app code free of provider SDKs by using thin service layers. I’d prototype GCP/GCore pieces in isolation first (I’ve not shipped GCore yet) and only promote when SLOs and runbooks match AWS parity.
Behavioral
Q16. Describe a challenging production incident you owned end-to-end. What was your role, how did you communicate during the event, and what lasting fixes did you drive?
A. Situation: Avinet outage with cascading service failures and long MTTR. Task: Restore service and prevent recurrence. Action: I took incident commander role—froze deploys, established a single bridge, delegated diagnostics (DB, API, network). Kept stakeholders updated every 15 minutes with clear status/ETA, using concise Australian-style comms. Restored service via controlled rollback and traffic shaping. Post-incident, I led RCA: added Terraform-managed DR paths, improved dashboards, and rewired alerts to SLO burn. Rotations clarified ownership. Result: MTTR dropped ~70% afterward; on-calls reported fewer surprises. Documentation and drills kept the improvements sticky.
Q17. Share an example where you had to deliver platform changes with limited knowledge of a new stack (e.g., GCP). How did you ramp up quickly and manage risk?
A. Situation: Asked to extend AWS-first patterns into a new environment (analogy to a GCP ask). Task: deliver without deep prior stack knowledge. Action: I built a sandbox, read provider docs, and paired with in-house SMEs; started with low-risk components (logging, IAM mappings) and wrote thin Terraform modules mirroring AWS interfaces. Kept blast radius low: applied in non-prod, used feature flags, and ran chaos tests. Documented differences (networking, IAM) and updated runbooks. Result: delivered a working slice with no prod incidents, then expanded. If doing this on GCP/GCore today, I’d follow the same play: lab first, pilot service, measure SLO parity, and only then widen adoption.
Q18. Tell me about a time you pushed back on a risky deployment timeline. How did you negotiate a safer plan while keeping the team moving?
A. Situation: At illion, product pushed for a rapid EC2→ECS cutover. Task: Avoid downtime while meeting business pressure. Action: Presented data: current failure modes, missing runbooks, and projected rollback time. Proposed a compromise—phased blue/green with canaries and synthetic checks, plus daytime deploy windows. Kept scope focused, trimmed non-critical changes, and prepped rollback scripts. Communicated clearly with leadership on trade-offs and got sign-off for a staged approach. Result: Cutover achieved with 92% lead-time improvement and no major incidents; teams kept momentum because they had clear gates and ownership.
Additional Technical Depth (continued)
Q6. (see above) GPU enablement specifics.
Q7. (see above) Multi-cloud portability choices.
Q20. (see above) IAM structure for multi-account EKS.
Authentic Interview Intel (external)
- Research status: partial (Leonardo.Ai/Canva-specific question data scarce).
- No credible external questions reused; one generic behavioral prompt (“Tell me about yourself”) surfaced with low confidence; discriminatory content in scraped data was discarded.
- Use the above set as the primary preparation source.
Questions to Ask Them
- How are GPU workloads scheduled and prioritized across product teams today, and where are the current bottlenecks?
- What “golden path” exists for new services—CI/CD templates, observability defaults, and governance policies?
- How do squads balance experimentation with cost controls for AI workloads, and what tooling supports that (dashboards, budgets, alerts)?
- What does on-call look like for platform vs product teams, and how are incidents triaged across Canva/Leonardo boundaries?
- How is multi-cloud traffic managed today (Cloudflare, DNS weights), and what failure drills are run?
- Where is the biggest documentation or runbook gap slowing adoption of platform capabilities?
- What’s the near-term roadmap for model registry/governance and safety reviews?
Metrics & Validation
- Questions: 20 across Technical Depth, System Design, Leadership/Collaboration, Company-Specific, Behavioral.
- Validation: avg quality 7.94/10; 0 answers failed thresholds (Relevance ≥7, Authenticity ≥7, Standout ≥6).
- External interview intel: partial (no reliable technical Qs surfaced).
Notes & Alignment Checks
- Strengths: AWS/K8s, Terraform, CI/CD, observability, incident leadership, GitOps enablement.
- Gaps: Direct GCP/GCore and GPU production experience—mitigate by highlighting sandbox/benchmark plan and quick-ramp pattern used in past migrations.
- Tone to use: direct, confident, collaborative (Australian professional style), STAR for behavioral, impact-first for technical.