[Remote] Director of Cloud Operations
Note: The job is a remote job and is open to candidates in USA. Firstup is a company dedicated to improving employee experience through innovative communication solutions. They are seeking a Director of Cloud Operations to lead their cloud infrastructure and operational practices, ensuring reliability and efficiency across their SaaS platform while fostering a high-performing team.
Responsibilities
- Own the availability, performance, and resilience of our multi-region AWS platform
- Drive improvements in system reliability through well-defined SLIs/SLOs, error budgets, and proactive engineering practices
- Lead efforts to reduce MTTR and improve incident response effectiveness across the organization
- Guide architecture decisions for microservices, Kubernetes (EKS), and serverless workloads to ensure scalability and fault tolerance
- Advance our observability strategy using Datadog, ensuring actionable insights across infrastructure and applications
- Establish and refine incident management practices, including on-call processes, escalation paths, and post-incident reviews
- Act as an incident commander for critical events and contribute to the on-call rotation
- Elevate operational standards through automation, standardization, and adoption of modern best practices
- Drive cost optimization initiatives across AWS environments without compromising performance or reliability
- Leverage AI and automation to improve operational efficiency, accelerate root cause analysis, and enhance system insights
- Continuously improve CI/CD pipelines (CircleCI) and infrastructure-as-code practices (Terraform)
- Lead, mentor, and support a distributed team of CloudOps engineers across the US and UK
- Foster a culture of accountability, learning, and continuous improvement
- Provide technical guidance while enabling the team to grow in ownership and capability
- Ensure stability and support for existing customers while maintaining clear operational boundaries with the cloud platform
Skills
- 10+ years in cloud infrastructure, SRE, or DevOps roles
- 3+ years experience leading CloudOps/SRE teams
- Proven track record of leading operational or platform transformations in a SaaS environment
- Experience operating multi-region, customer-facing systems at scale
- Strong hands-on experience with AWS (multi-region architectures)
- Strong hands-on experience with Kubernetes (EKS) and containerized environments
- Solid understanding of microservices and distributed systems design
- Familiarity with serverless architectures and modern cloud-native patterns
- Deep experience with incident management, on-call operations, and reliability engineering practices
- Strong understanding of SLO/SLI frameworks, monitoring strategies, and performance optimization
- Demonstrated ability to balance hands-on technical work with team leadership
- Collaborative, pragmatic leader who can influence across teams and functions
- Passion for building and supporting high-performing teams
- Focus on continuous improvement, with a bias toward measurable outcomes
- Infrastructure as Code (Terraform preferred)
- CI/CD pipelines (CircleCI or similar)
- Observability platforms (Datadog or equivalent)
Benefits
- Excellent PTO program
- Great health benefits
- A casual and friendly environment
- Remote work
- A leadership team who truly believes in your growth – both personally and professionally
Company Overview
Company H1B Sponsorship