Appearance
Precision Sourcing — Site Reliability Engineer (AI-backed)
Job URL: https://www.linkedin.com/jobs/view/4323030456
Role signals: AWS and Azure in hybrid mode, production Kubernetes with Helm and Argo CD (GitOps), Terraform IaC, CI/CD and strong Git discipline, automation in Python/Node, ownership mindset, remote within Australia (AEST hours), backed by a major Australian bank.
Technical Design Questions
1) How would you design a hybrid AWS/Azure GitOps platform for AI workloads?
- Situation: At Domain, I migrated 100+ services from ECS to Kubernetes while keeping regulated workloads auditable; at Avinet I moved all AWS to Terraform with isolated state.
- Task: Deliver repeatable clusters on AWS and Azure with guardrails, Git-driven deployment, and clear separation per environment.
- Actions: Standardised cluster add-ons via Helm, then used Argo CD ApplicationSets to target env/region overlays (ingress, DNS, identities, External Secrets). Kept IaC modules provider-agnostic with cloud-specific implementations; isolated state backends (S3+Dynamo vs Azure Storage+locks). Enforced policy-as-code (OPA/tfsec) and signed releases; added SLO burn alerts and drift detection.
- Result: 50% of Domain workloads moved in 7 months with 99.9% platform uptime; auditors at Avinet/illion accepted Terraform evidence with reduced blast radius and faster approvals.
2) Describe a major design you resolved (Kubernetes platform from the ground up).
- Situation: Domain needed a resilient successor to ECS for 100+ microservices.
- Task: Stand up a Kubernetes-based platform with safer deployments and simple onboarding.
- Actions: Built baseline Helm charts (network policies, quotas, PDBs), established cluster catalog repos, and wired Argo CD for GitOps rollouts. Introduced SLO templates and alert/runbook links; set canary and rollback flows with Argo Rollouts. Created golden paths and workshops to reduce migration friction.
- Result: Teams shipped via Git in ~9 minutes, deploys rose from 2/week to 5/day, and core platform held 99.9% uptime during the migration.
3) How would you structure Terraform for multi-cloud reliability?
- Situation: At Avinet and illion I moved brownfield AWS estates to Terraform while meeting audit needs.
- Task: Build reusable modules that separate interface from provider specifics and keep state safe.
- Actions: Split interface modules (VPC, EKS/AKS, LB, secrets) from cloud-specific implementations; maintained per-env/region workspaces with strict backends and locking. Added tfsec and admission checks in CI; required PR plans plus promotion tags; templated IAM/identity roles with least privilege; scheduled drift detection in read-only first.
- Result: Passed SOC2/ISO with zero criticals at illion; cut change lead time from 2h to ~10m by automating plans/applies; avoided cross-env blast radius during refactors.
4) How do you run progressive delivery with Argo CD safely?
- Situation: Domain’s teams needed safer deploys during ECS→Kubernetes migration.
- Task: Provide GitOps pipelines with guardrails and quick rollback.
- Actions: Used Argo CD ApplicationSets per service; paired with Argo Rollouts for canary/blue-green; set health gates on readiness/error rate/p95/SLO burn. Enabled rollback via Git revert; enforced signed commits/tags and admission policies (allowlist images, SOPS/External Secrets for secrets). Added drift alerts and sync waves to manage dependencies.
- Result: Deployment frequency jumped to 5/day with sub-10-minute pipelines and reversible changes; engineers trusted the path enough to migrate 15+ teams without incident regressions.
5) How have you automated SRE toil with Python/Node?
- Situation: At Viator I owned cross-domain DX projects; at Domain I needed safer tokens and chat workflows.
- Task: Remove manual support load and enforce consistent governance.
- Actions: Built an AI Slack bot (Python) to answer common support questions against internal KB, with structured logging and feature flags. Delivered GitLab token automation (Node) with retries/backoff, expiry checks, and audit logging; added dry-run and scope flags plus unit tests.
- Result: First-response time dropped by ~50%; reclaimed ~10% team velocity; reduced token-related incidents while keeping an audit trail for security.
6) How would you keep observability tight while controlling cost?
- Situation: Envato’s hybrid Heroku/AWS stack had expensive noisy telemetry.
- Task: Improve signal quality and cost for a cloud app portfolio.
- Actions: Consolidated to Datadog with golden-signal dashboards, trimmed log retention, and introduced SLIs/SLOs. Added PagerDuty workflows and tuned alerts to actionable thresholds; layered OTel instrumentation on critical paths to reduce guesswork.
- Result: Telemetry spend dropped ~45% while incident resolution speed improved ~40%; teams had clearer runbooks and fewer false alerts.
Behavioral & Leadership Questions
7) Tell me about persuading teams to adopt a new platform.
- Situation: Product teams at Domain resisted Kubernetes/GitOps due to perceived overhead.
- Task: Hit 50% migration in 9 months without slowing delivery.
- Actions: Ran phased onboarding (gateway first, then services, then CI move), set clear targets, hosted bi-weekly office hours, and shipped templated Helm charts plus docs/videos. Measured deploy times and satisfaction to show progress.
- Result: >50% workloads moved in 7 months; MTTR ~15 minutes; developer satisfaction 4.7/5; AWS costs down 18% after decommissioning legacy stacks.
8) Give an example of embedding security/compliance into delivery.
- Situation: illion faced SOC2 gaps from manual reviews and stale AMIs during an EC2→ECS move.
- Task: Automate compliance without slowing teams.
- Actions: Built Packer pipeline for weekly/CVE-driven AMI refresh with Secrets Manager; enforced tfsec/IAM least-privilege in Terraform; integrated CrowdStrike scans in Bitbucket pipelines; improved tests to allow auto-deploy of patched packages.
- Result: SOC2 Type 2 and ISO attained with zero critical findings; security review time dropped from 48h to 15m; 92% runtime vuln reduction per Inspector.
9) Describe a major incident you resolved.
- Situation: Domain experienced a 26-minute outage from an auction-results service looping on HTTP 429s.
- Task: Restore service quickly and prevent recurrence as on-call responder.
- Actions: Parsed ELK logs to identify retry loop; coordinated rollback of the latest change; then wrote a checklist linked to alerts and ensured backoff handling was added to the client.
- Result: Service restored within the SLA window; new runbook/checklist reduced future MTTR and prevented repeat 429 storms.
10) How do you close an Azure knowledge gap quickly?
- Situation: My recent depth is heavier on AWS/Kubernetes/GitOps; this role needs AWS and Azure.
- Task: Reach Azure parity without blocking delivery.
- Actions: Time-box deep dives on Azure equivalents (AGIC vs ALB, DNS, identity, secrets); pair with an Azure SME while mapping AWS patterns to Azure; build a lab mirroring prod; start with one pilot service and expand via documented templates; keep stakeholders updated with a 30-60-90 plan.
- Result: Approach mirrors how I adopted new stacks at illion and Domain—paired learning plus small pilots—letting me meet delivery goals while ramping safely.