Director, Site Reliability Engineering (Ref: 188516)
- eTail
- Hybrid, Atlanta, United States
- permanent
- $ 200000.00 per annum
-
about the roleAbout Us
Our client is a leader within the Food & Beverage industry, recognized for its dedication to quality and innovation in food service. The organization is committed to continuously improving its offerings and operational processes to deliver exceptional value to its customers while maintaining a culture focused on excellence and responsibility.
Job DescriptionOverview of the Position
In collaboration with our client, we are seeking a highly experienced Director of Site Reliability Engineering (SRE). This critical leadership position will oversee the development and implementation of reliability, observability, and automation strategies within cloud-native, multi-tenant systems. The successful candidate will be pivotal in ensuring high performance, availability, and efficiency across both production and customer-facing environments, all while fostering a culture of continuous improvement and operational excellence.
Key Responsibilities- Transform the SRE team into a proactive engineering force that champions innovation, automation, and valuable business outcomes.
- Develop and implement an advanced SRE roadmap that emphasizes self-healing capabilities, dynamic scaling, and platform resilience.
- Enhance existing SLAs, SLOs, and SLIs into predictive reliability frameworks aligned with business objectives, including formalizing executive-level reporting on SLOs.
- Lead the transformation of observability into a predictive, AI/ML-oriented function that prioritizes anomaly detection, early alerts, and service health forecasting.
- Improve incident response mechanisms by incorporating intelligent automation, optimizing runbooks, and refining on-call protocols for rapid issue resolution.
- Implement chaos engineering and resilience testing practices across vital systems, establishing formalized capacity stress testing and failover validation processes.
- Enhance CI/CD pipelines to facilitate secure, high-frequency deployments with automatic rollback capabilities and agile environment provisioning.
- Institutionalize Infrastructure as Code (IaC) methodologies to enable reliable and traceable infrastructure management at scale.
- Refine FinOps strategies to provide actionable analytics on cost versus performance trade-offs and the ROI of services.
- Promote collaboration between SRE, Security, and Compliance teams for improved detection, triage, and resolution of security incidents.
- Maintain a balance between system reliability and deployment speed by analyzing stability metrics and error rates.
- Conduct blameless postmortems for significant incidents to foster a culture of growth and learning.
- Steer go-live activities for critical brand launches and expansions of the NextGen platform.
- Collaborate with architecture and product teams to integrate observability, scalability, and cost considerations into solutions.
- Revamp disaster recovery protocols to comply with stringent Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO), ensuring fully automated failover mechanisms.
- Address existing technical debts while preventing the accumulation of new issues.
- Manage vendor performance, including overseeing contract renewals and compliance with third-party tooling and infrastructure partnerships.
- Ensure timely and thorough contractor audits, identity governance, and system access evaluations on a quarterly basis.
- Foster a workplace culture that encourages continuous learning, experimentation, and innovation through coaching, advanced training opportunities, and stretch assignments.
- Utilize agile retrospectives, SLIs, and service assessments to establish an ongoing improvement framework.
- Enhance the team's visibility and influence across the organization by aligning technical initiatives with business impacts.
Requirements- A Bachelor's degree in Information Systems or a related discipline is required.
- A minimum of 10 years of experience in software development or information technology is essential.
- At least 5 years of experience working with cloud-native platforms, preferably Azure.
- A minimum of 5 years in DevOps and/or Site Reliability Engineering is required.
- Experience leading teams of engineers for a minimum of 4 years is critical.
- In-depth understanding of Infrastructure as Code (IaC) is necessary.
- Familiarity with CI/CD automation within a pipeline-based Software Development Life Cycle (SDLC) is essential.
- Experience working in a Scrum team is beneficial.
Benefits- Competitive salary and performance incentives.
- Generous paid time off policy, including holidays and personal days.
- Comprehensive health benefits covering medical, dental, and vision.
- Retirement savings plan with employer match.
- Ongoing professional development and training opportunities.
- Flexible working arrangements to promote work-life balance.
OtherThe selected candidate will have the opportunity to lead transformative initiatives within a vibrant organization known for its commitment to innovation within the Food & Beverage sector. This role offers a chance to significantly impact operational reliability and performance while working in a supportive and forward-thinking environment.

RELATED JOBS
-
Project Manager (Ref: 188479)About UsMy client are a leader in transforming complex urban environments into safer, more inclusive, and digitally advanced spaces. Our work spans regulated infrastructure sectors ...
-
Data Architect (Ref: 188202)About UsOur client stands as a prominent player within the products manufacturing sector, dedicated to innovation, quality, and sustainability. This organization is committed to de ...
-
Data Analyst (Ref: 188210)...