[Remote] Lead Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. BillingPlatform is an industry-leading, fast-growing SaaS company that offers a cloud-based revenue lifecycle management platform. They are seeking a Lead Site Reliability Engineer to own and improve on-call processes, manage SLOs, and enhance system reliability through various engineering practices.
Responsibilities
- Own and improve on-call processes, incident response playbooks, and post-mortem culture
- Define, track, and manage SLOs, SLIs, and error budgets for critical services
- Lead blameless post-mortems and drive systematic reliability improvements
- Respond to production incidents and coordinate cross-functional resolution
- Design, build, and maintain scalable AWS infrastructure using IaC (Terraform, Pulumi)
- Manage Kubernetes clusters and containerized workloads in production
- Build and maintain CI/CD pipelines to improve deployment speed and reliability
- Evaluate and implement tooling to enhance developer productivity and system stability
- Implement monitoring, alerting, and distributed tracing (Prometheus, Grafana, Datadog, Jaeger)
- Identify and resolve performance bottlenecks across services, networks, and databases
- Build dashboards and runbooks for self-service operational insights
- Partner with engineering teams to embed reliability practices (load testing, capacity planning, chaos engineering)
- Conduct architecture reviews with a focus on reliability and operability
Skills
- 5+ years of experience in SRE, DevOps, or infrastructure engineering
- Deep expertise with AWS and cloud-native architectures
- Strong experience with Kubernetes and container orchestration at scale
- Hands-on experience with infrastructure-as-code tools (Terraform or Pulumi)
- Proficiency in Python, Go, or Bash
- Experience with observability tools (Prometheus, Grafana, Datadog, or similar)
- Strong understanding of SLOs, SLIs, and error budgets
- Experience with service mesh technologies (Istio, Linkerd)
- Familiarity with chaos engineering tools (Chaos Monkey, Gremlin, LitmusChaos)
- Background in Oracle database reliability and administration
- Contributions to open-source infrastructure projects
- Experience in a high-growth SaaS or product-led environment
- Excellent English communication skills (written and spoken)
Benefits
- Competitive compensation with a robust benefits package, including medical, dental, vision, LTD, HSA, FSA, free virtual mental health counseling, and health and wellness perks
- Medical insurance coverage effective on the first day of employment
- 401(k) match that is 100% immediately vested
- Discretionary and charitable time off program
- Home office setup allowance for fully remote employees
Company Overview
Company H1B Sponsorship