Precision Sourcing — Site Reliability Engineer (AI-backed)

Job URL: https://www.linkedin.com/jobs/view/4323030456

Role signals: AWS and Azure in hybrid mode, production Kubernetes with Helm and Argo CD (GitOps), Terraform IaC, CI/CD and strong Git discipline, automation in Python/Node, ownership mindset, remote within Australia (AEST hours), backed by a major Australian bank.

Technical Design Questions

1) How would you design a hybrid AWS/Azure GitOps platform for AI workloads?

Situation: At Domain, I migrated 100+ services from ECS to Kubernetes while keeping regulated workloads auditable; at Avinet I moved all AWS to Terraform with isolated state.
Task: Deliver repeatable clusters on AWS and Azure with guardrails, Git-driven deployment, and clear separation per environment.
Actions: Standardised cluster add-ons via Helm, then used Argo CD ApplicationSets to target env/region overlays (ingress, DNS, identities, External Secrets). Kept IaC modules provider-agnostic with cloud-specific implementations; isolated state backends (S3+Dynamo vs Azure Storage+locks). Enforced policy-as-code (OPA/tfsec) and signed releases; added SLO burn alerts and drift detection.
Result: 50% of Domain workloads moved in 7 months with 99.9% platform uptime; auditors at Avinet/illion accepted Terraform evidence with reduced blast radius and faster approvals.

2) Describe a major design you resolved (Kubernetes platform from the ground up).

Situation: Domain needed a resilient successor to ECS for 100+ microservices.
Task: Stand up a Kubernetes-based platform with safer deployments and simple onboarding.
Actions: Built baseline Helm charts (network policies, quotas, PDBs), established cluster catalog repos, and wired Argo CD for GitOps rollouts. Introduced SLO templates and alert/runbook links; set canary and rollback flows with Argo Rollouts. Created golden paths and workshops to reduce migration friction.
Result: Teams shipped via Git in ~9 minutes, deploys rose from 2/week to 5/day, and core platform held 99.9% uptime during the migration.

3) How would you structure Terraform for multi-cloud reliability?

Situation: At Avinet and illion I moved brownfield AWS estates to Terraform while meeting audit needs.
Task: Build reusable modules that separate interface from provider specifics and keep state safe.
Actions: Split interface modules (VPC, EKS/AKS, LB, secrets) from cloud-specific implementations; maintained per-env/region workspaces with strict backends and locking. Added tfsec and admission checks in CI; required PR plans plus promotion tags; templated IAM/identity roles with least privilege; scheduled drift detection in read-only first.
Result: Passed SOC2/ISO with zero criticals at illion; cut change lead time from 2h to ~10m by automating plans/applies; avoided cross-env blast radius during refactors.

4) How do you run progressive delivery with Argo CD safely?

Situation: Domain’s teams needed safer deploys during ECS→Kubernetes migration.
Task: Provide GitOps pipelines with guardrails and quick rollback.
Actions: Used Argo CD ApplicationSets per service; paired with Argo Rollouts for canary/blue-green; set health gates on readiness/error rate/p95/SLO burn. Enabled rollback via Git revert; enforced signed commits/tags and admission policies (allowlist images, SOPS/External Secrets for secrets). Added drift alerts and sync waves to manage dependencies.
Result: Deployment frequency jumped to 5/day with sub-10-minute pipelines and reversible changes; engineers trusted the path enough to migrate 15+ teams without incident regressions.

5) How have you automated SRE toil with Python/Node?

Situation: At Viator I owned cross-domain DX projects; at Domain I needed safer tokens and chat workflows.
Task: Remove manual support load and enforce consistent governance.
Actions: Built an AI Slack bot (Python) to answer common support questions against internal KB, with structured logging and feature flags. Delivered GitLab token automation (Node) with retries/backoff, expiry checks, and audit logging; added dry-run and scope flags plus unit tests.
Result: First-response time dropped by ~50%; reclaimed ~10% team velocity; reduced token-related incidents while keeping an audit trail for security.

6) How would you keep observability tight while controlling cost?

Situation: Envato’s hybrid Heroku/AWS stack had expensive noisy telemetry.
Task: Improve signal quality and cost for a cloud app portfolio.
Actions: Consolidated to Datadog with golden-signal dashboards, trimmed log retention, and introduced SLIs/SLOs. Added PagerDuty workflows and tuned alerts to actionable thresholds; layered OTel instrumentation on critical paths to reduce guesswork.
Result: Telemetry spend dropped ~45% while incident resolution speed improved ~40%; teams had clearer runbooks and fewer false alerts.

Behavioral & Leadership Questions

7) Tell me about persuading teams to adopt a new platform.

Situation: Product teams at Domain resisted Kubernetes/GitOps due to perceived overhead.
Task: Hit 50% migration in 9 months without slowing delivery.
Actions: Ran phased onboarding (gateway first, then services, then CI move), set clear targets, hosted bi-weekly office hours, and shipped templated Helm charts plus docs/videos. Measured deploy times and satisfaction to show progress.
Result: >50% workloads moved in 7 months; MTTR ~15 minutes; developer satisfaction 4.7/5; AWS costs down 18% after decommissioning legacy stacks.

8) Give an example of embedding security/compliance into delivery.

Situation: illion faced SOC2 gaps from manual reviews and stale AMIs during an EC2→ECS move.
Task: Automate compliance without slowing teams.
Actions: Built Packer pipeline for weekly/CVE-driven AMI refresh with Secrets Manager; enforced tfsec/IAM least-privilege in Terraform; integrated CrowdStrike scans in Bitbucket pipelines; improved tests to allow auto-deploy of patched packages.
Result: SOC2 Type 2 and ISO attained with zero critical findings; security review time dropped from 48h to 15m; 92% runtime vuln reduction per Inspector.

9) Describe a major incident you resolved.

Situation: Domain experienced a 26-minute outage from an auction-results service looping on HTTP 429s.
Task: Restore service quickly and prevent recurrence as on-call responder.
Actions: Parsed ELK logs to identify retry loop; coordinated rollback of the latest change; then wrote a checklist linked to alerts and ensured backoff handling was added to the client.
Result: Service restored within the SLA window; new runbook/checklist reduced future MTTR and prevented repeat 429 storms.

10) How do you close an Azure knowledge gap quickly?

Situation: My recent depth is heavier on AWS/Kubernetes/GitOps; this role needs AWS and Azure.
Task: Reach Azure parity without blocking delivery.
Actions: Time-box deep dives on Azure equivalents (AGIC vs ALB, DNS, identity, secrets); pair with an Azure SME while mapping AWS patterns to Azure; build a lab mirroring prod; start with one pilot service and expand via documented templates; keep stakeholders updated with a 30-60-90 plan.
Result: Approach mirrors how I adopted new stacks at illion and Domain—paired learning plus small pilots—letting me meet delivery goals while ramping safely.

Precision Sourcing — Site Reliability Engineer (AI-backed) ​

Technical Design Questions ​

Behavioral & Leadership Questions ​

Precision Sourcing — Site Reliability Engineer (AI-backed)

Technical Design Questions

Behavioral & Leadership Questions