JOB TYPE: Sub Contract
JOB MODE: Remote/Hybrid
Job Description:
Responsibilities & Required Skills/Experience:
1) NVIDIA (DGX) – A100/ H100/ H200
2) Cisco UCS-C885A
3) Docker
4) NVIDIA certificated professionals preferred
5) Infrastructure knowledge on above skills
6) DevOps Automation
CI/CD systems (e.g., GitLab, GitHub Actions, Jenkins)
Terraform, Ansible, Jenkins
Python
8) Enterprise Grade Kubernetes cluster (RedHat OpenShift – preferred) and/or Google Anthos
AI Infrastructure SRE Engineer responsible for
Technical knowledge of high-performance compute, NVIDIA DGX/GPUs and/or Cisco Unified Compute System.
• Handle availability, latency, scalability and efficiency of NVIDIA and Cisco UCS infrastructure
by instilling engineering reliability into the development life cycle with a focus on fault tolerant approaches.
• Drive capacity planning, performance analysis, instrumentation, and other non-functional systems requirements.
• Automate operational capabilities using Python, Ansible, Terraform, Go etc.
• Deliver automation through CI/CD pipeline and chatbot etc.
• Implement metrics driven processes to ensure service quality targets are met.
Share me resumes at hemanth@flexontechnologies.com
—
