Role – AI Reliability Engineer (SRE) for Gen AI Systems
Location – Tampa, FL (Onsite)
Job Type – C2C/W2
Job Description:
- Minimum 13+ years of experience.
- AI Reliability Engineer (SRE) for Gen AI Systems Position Overview We are seeking a highly skilled AI Reliability Engineer (SRE) for Gen AI Systems to join our Application engineering team. In this role, you will bridge the gap between advanced Application Development, cloud infrastructure and machine learning operations. You will be responsible for two core mandates: building and maintaining the Application and infrastructure that powers our large language models (LLMs) and designing autonomous, agentic AI workflows to eliminate operational toil and automate incident response.
Key Responsibilities
- AI Infrastructure Reliability: Design, scale, and maintain highly available infrastructure for LLM training, fine-tuning, and inference workloads.
- Agentic Operations: Architect and deploy multi-agent GenAI systems to automate alert triage, root cause analysis (RCA), and self-healing system remediation.
- GPU & Cluster Management: Optimize GPU orchestration, cluster health, and compute utilization across large-scale Kubernetes clusters.
- Performance Monitoring: Define and monitor non-traditional SLOs/SLIs, including Time-to-First-Token (TTFT), Inter-Token Latency, and cost-per-query limits.
- Data & Vector Pipeline Ops: Ensure the reliability, latency, and synchronization of vector databases and Retrieval-Augmented Generation (RAG) pipelines.
- Incident Management & ChatOps: Integrate LLMs and agentic frameworks into ChatOps tooling (e.g., Slack, Teams) to provide real-time, natural-language incident assistance.
- Security & Guardrails: Implement infrastructural boundaries to protect LLM endpoints from prompt injections, hallucinations, and data compliance leaks.
- Required Technical Skills
- Infrastructure & DevOps: Deep expertise in Kubernetes (EKS/GKE), Infrastructure as Code (Terraform), and CI/CD deployment pipelines.
- Software Engineering: Strong proficiency in Python or Go, with experience building tool integrations via APIs and Model Context Protocol (MCP).
- GenAI Engineering: Hands-on experience with LLM orchestration frameworks (e.g., AutoGen, LangChain, LlamaIndex).
- Data & Vector Systems: Experience managing distributed vector databases (e.g., Pinecone, Milvus, Qdrant, or pgvector).
- Observability: Advanced knowledge of cloud monitoring stacks (Datadog, Prometheus, OpenTelemetry) applied to both standard infrastructure and AI workloads (e.g., Triton Inference Server monitoring).
- Preferred Qualifications
- Background in implementing semantic caching layers to optimize cloud and API token costs.
- Proven track record of turning traditional engineering runbooks into executable code for automated agents.
|
Tel: +1 630 536 8202 Ext. 5576 Dir: +1 630 937 0276 |