Title: SRE Infrastructure Engineer
Location: SFO, CA (5 Days Onsite)
Â
Â
Job Description:
We are seeking a SRE Infrastructure Resource having 8+ years of professional experience ensuring the reliability, scalability, and performance of Google Cloud-based services through automation, monitoring, and proactive engineering. Key responsibilities include managing infrastructure as code (Terraform), optimizing GKE/Kubernetes, incident response, and implementing SLIs/SLOs to minimize manual toil.
This role requires close collaboration with cross‑functional teams, adherence to DevOps and Agile practices, and ownership of service quality and delivery.
Key Responsibilities
- GCP Infrastructure Management: Design, deploy, and maintain robust infrastructure components, including VPCs, Compute Engine, GKE (Kubernetes), and storage solutions.
- Automation & IaC: Utilize Terraform or Deployment Manager to manage cloud resources and build CI/CD pipelines to automate deployments. Minimizing manual, repetitive tasks by developing automation scripts and custom tools to streamline deployments and operations.
- Observability & Incident Management: Develop monitoring, alerting, and logging systems (e.g., Cloud Monitoring, Prometheus, Grafana). Act as primary on-call to troubleshoot production incidents.
- Incident Management: Serving as a first responder for system outages and conducting deep-dive root cause analysis (post-mortems) to prevent recurrence
- CI/CD Pipeline Management: Designing and supporting automated deployment pipelines using Jenkins, ArgoCD, Artifactory, DevSecOps, GitLab CI, or GitHub Actions
- Reliability Engineering: Define and maintain Service Level Indicators (SLIs) and Service Level Objectives (SLOs) – Latency, Traffic, Errors, and Saturation
- Optimization & Security: Proactively optimize infrastructure for cost, performance, and security compliance.
- Site Reliability Engineer, Google Cloud Engine AI SRE at Google: Focus specifically on AI workload health, and GCE visibility
Mandatory Technical Skills & Competencies
- Experience: 8+ years in SRE, DevOps, or systems engineering, specifically with Google Cloud Platform.
- Technical Skills: Deep knowledge of Linux, Kubernetes (GKE), networking (VPCs, CDNs), and containerization.
- Programming: Proficiency in scripting/programming languages like Python, Go, or Shell.
- Methodologies: Strong understanding of GitOps, CI/CD pipelines, and SRE principles (error budgets, toil reduction)
- Strong troubleshooting skills across the full stack (network, OS, application).
- Ability to balance system stability with the need for rapid deployment.
- Observability Tools: Experience implementing monitoring and logging stacks like Prometheus, Grafana, or the Google Cloud Operations Suite
- Excellent collaboration skills to work with development teams for service ownership
Soft Skills
- Strong problem-solving and analytical skills
- Clear communication with technical and non‑technical stakeholders
- Ownership mindset and production‑grade engineering discipline
- Ability to work independently and within cross‑functional teams
Â
Â
Â
|
Neha Chaudhary |