Site Reliability Engineer with NVIDIA (DGX) & Cisco Exp -- San Jose, CA (remote/hybrid)

Job Title: Site Reliability Engineer

Location: San Jose, CA (remote/hybrid)
JOB TYPE: Sub Contract
JOB MODE: Remote/Hybrid

Job Description:
Responsibilities & Required Skills/Experience:

1) NVIDIA (DGX) – A100/ H100/ H200
2) Cisco UCS-C885A
3) Docker
4) NVIDIA certificated professionals preferred
5) Infrastructure knowledge on above skills
6) DevOps Automation
CI/CD systems (e.g., GitLab, GitHub Actions, Jenkins)
Terraform, Ansible, Jenkins
Python
8) Enterprise Grade Kubernetes cluster (RedHat OpenShift – preferred) and/or Google Anthos
AI Infrastructure SRE Engineer responsible for

Technical knowledge of high-performance compute, NVIDIA DGX/GPUs and/or Cisco Unified Compute System.
• Handle availability, latency, scalability and efficiency of NVIDIA and Cisco UCS infrastructure
by instilling engineering reliability into the development life cycle with a focus on fault tolerant approaches.
• Drive capacity planning, performance analysis, instrumentation, and other non-functional systems requirements.
• Automate operational capabilities using Python, Ansible, Terraform, Go etc.
• Deliver automation through CI/CD pipeline and chatbot etc.
• Implement metrics driven processes to ensure service quality targets are met.

Share me resumes at hemanth@flexontechnologies.com

—

APPLY NOW

🔔 Get our daily C2C jobs / Hotlist notifications on WHATSAPP

WHATSAPP TELEGRAM LINKEDIN

Site Reliability Engineer with NVIDIA (DGX) & Cisco Exp — San Jose, CA (remote/hybrid)

Related

About Author

Leave a Reply Cancel reply

Related

About Author

Leave a Reply Cancel reply

Post your C2C job instantly