[Remote] Sr. Engineering Manager, MLOps
Note: The job is a remote job and is open to candidates in USA. Quince is a tech company disrupting the retail industry by leveraging AI, analytics, and automation. They are seeking a Senior Engineering Manager, MLOps to build and scale the infrastructure that supports production-grade Machine Learning, ensuring seamless operations for their Data Scientists and AI Researchers.
Responsibilities
- Define the MLOps Vision & Strategy: Architect a long-term roadmap that transitions ML workflows from manual scripts to a fully automated, self-service platform for all Quince Data Scientists and AI Researchers
- Own the "Paved Road" for Production: Build and maintain the end-to-end infrastructure for model training, deployment, and serving, ensuring researchers can move from "idea to production" with zero friction
- Drive Strategic Prioritization: Partner with business leaders to align infrastructure investments with core e-commerce drivers like real-time personalization, dynamic pricing, and inventory forecasting
- Lead "Build vs. Buy" Evaluations: Make high-judgment decisions on when to leverage cloud-native services (e.g., SageMaker, Vertex AI) versus building custom internal tools to optimize for cost, speed, and flexibility
- Guarantee System Scalability & Reliability: Oversee the uptime and performance of production ML services, ensuring the stack can handle massive traffic surges and seasonal spikes without degradation
- Manage Compute Governance & Costs: Direct the optimization of high-cost computational resources, such as GPU clusters and cloud instances, balancing high-performance training needs with fiscal responsibility
- Recruit and Mentor Top Talent: Build and lead a high-performing team of ML Infra and DevOps engineers, providing technical coaching, career pathing, and performance management
- Establish MLOps Standards: Drive the adoption of best practices in CI/CD for ML, Infrastructure as Code (IaC), and automated testing to ensure a modular and maintainable system
- Bridge the Research-Engineering Gap: Act as the primary cross-functional lead, translating the complex needs of AI Researchers into actionable engineering requirements for the infrastructure team
- Define and Track Velocity Metrics: Establish KPIs for the infrastructure team, such as model deployment frequency, mean time to recovery (MTTR), and infrastructure cost per inference
- Champion Operational Excellence: Lead root-cause analyses (RCAs) for production failures and foster a culture of accountability where systemic fixes are prioritized over "quick patches."
- Stay Ahead of the AI Curve: Monitor emerging trends in LLM-ops, vector databases, and real-time feature engineering to ensure Quince’s infrastructure remains competitive and future-proof
Skills
- 10+ years of industry experience, with at least 3-5 years in a leadership or management role specifically focused on ML Infrastructure, MLOps, or large-scale Data Platform engineering
- Proven track record of building and scaling MLOps platforms that support the full model lifecycle—from data ingestion and distributed training to real-time inference and monitoring
- Deep technical expertise in cloud-native infrastructure (preferably AWS) and orchestration tools like Kubernetes (EKS), Docker, and Infrastructure as Code (Terraform/Pulumi)
- Hands-on experience with ML frameworks and tooling, such as PyTorch, TensorFlow, Kubeflow, or SageMaker, and a strong opinion on how to integrate them into a cohesive developer experience
- Expertise in building and managing Feature Stores and high-throughput data pipelines (using tools like Spark, Flink, or Kafka) to ensure data consistency across training and serving
- Experience partnering with AI Research and Data Science teams to understand their unique workflows and translate research needs into robust, scalable engineering solutions
- Strong understanding of CI/CD for ML, including automated testing for models, model versioning, and 'blue-green' or 'canary' deployment strategies
- Demonstrated ability to manage high-cost compute resources, with experience optimizing GPU utilization and cloud spend in a hyper-growth environment
- Excellence in operational leadership, with a history of driving service availability, performance, and stability through rigorous on-call rotations and root-cause analysis
- A product-oriented mindset, with the ability to treat infrastructure as a platform and prioritize the roadmap based on researcher velocity and business ROI
- Exceptional communication and influence skills, capable of navigating ambiguity and building consensus across engineering, product, and data science leadership
- Kindness and high standards: You move fast and push for excellence, but you do so as a supportive team player who fosters a culture of psychological safety and extreme candor
Benefits
- Bonus and equity may also be provided for eligible roles
Company Overview
Company H1B Sponsorship