SRE Team Leader & Escalation Manager
Why Join Us? We are looking for a technically strong and AI-savvy SRE Team Lead & Escalation Manager to own production reliability, incident management, and cross-functional prioritization. This role leads our AI-driven automation strategy, drives self-healing infrastructure development, and sets a new standard for modern reliability engineering. Key Responsibilities Lead and mentor the SRE team; improve monitoring, alerting, and observability. Own production incidents and escalations end-to-end — from mitigation to RCA and corrective action. Lead the design and development of self-healing systems capable of detecting, diagnosing, and remediating incidents autonomously. Drive automation of repetitive operational workflows using AI/ML-based solutions to reduce toil and MTTR. Manage the cross-functional Squad handling customer and production issues; align priorities across Support, QA, R&D, and Sources. Track key operational metrics and lead long-term reliability improvements. Qualifications 3-5 years in SRE or Incident Management. Mandatory: Hands-on experience applied to operational challenges (AIOps, anomaly detection, LLM-based automation, or auto-remediation). Proven track record of automating workflows and reducing manual toil at scale. Strong cloud background (AWS/Azure/GCP) and experience with Kubernetes, Docker, and CI/CD. Proficiency with observability tools (Grafana, Prometheus, ELK) and scripting (Python, Bash). Demonstrated leadership in high-pressure, cross-functional environments. Advantages Background in cybersecurity or SaaS platforms. Experience with LLMOps, AI agents, or orchestration platforms (e.g., n8n, Temporal). Key Attributes Strong ownership, accountability, and composure under pressure. Passionate about leveraging AI to automate workflows, reduce toil, and accelerate incident resolution. Visionary about self-healing operations — able to both define the strategy and drive its implementation. Collaborative leader with the ability to align cross-functional stakeholders. Technically hands-on systems-level thinker with the drive to engineer scalable, long-term solutions.