[Remote] Senior Cloud DevOps & Infrastructure Engineer
Note: The job is a remote job and is open to candidates in USA. Diverse Lynx is seeking a Senior Cloud DevOps & Infrastructure Engineer with a focus on GCP and AI. The role involves designing, deploying, and maintaining secure and scalable cloud infrastructure, primarily on a multi-cloud platform, while implementing GitOps best practices and supporting AI/ML workloads.
Responsibilities
- Infrastructure as Code (IaC): Architect and provision production-grade infrastructure using Terraform. Manage state files, modules, and ensure infrastructure immutability
- AIML: Experience with LLM Models - in multi cloud environment
- Kubernetes & Containerization: Design and manage clusters. Create and optimize Docker files (multi-stage builds, distroless/hardened images). Manage complex deployments using Helm Charts
- CI/CD & GitOps: Build end-to-end CI/CD pipelines using GitLab CI. Implement GitOps workflows to synchronize infrastructure and application state
- Design, configure, and manage scalable and secure cloud infrastructure for MLOps
- AI Infrastructure Support: Configure and maintain environments suitable for AI/ML workloads (GPU node pools, LLM integration, large model serving, high-performance storage)
- Production Support & Troubleshooting: Act as the primary escalation point for deployment failures, network and Infra issues. Perform Root Cause Analysis (RCA)
- Security & Compliance: Implement 'Secure by Design' principles
- Having good knowledge of network security, identity and privilege access management, landing zone concepts for cloud platforms (Azure, AWS)
- Multi-Cloud Strategy: While GCP is primary, maintain and support secondary environments in AWS (and potentially Azure) to ensure business continuity
Skills
- 6 – 8 Years of experience in Cloud Infrastructure & DevOps Engineering
- Expert in Kubernetes, Terraform, and GitLab CI/CD
- Experience supporting AI/ML workloads
- Architect and provision production-grade infrastructure using Terraform
- Experience with LLM Models in multi cloud environment
- Design and manage Kubernetes clusters
- Create and optimize Docker files (multi-stage builds, distroless/hardened images)
- Manage complex deployments using Helm Charts
- Build end-to-end CI/CD pipelines using GitLab CI
- Implement GitOps workflows to synchronize infrastructure and application state
- Design, configure, and manage scalable and secure cloud infrastructure for MLOps
- Configure and maintain environments suitable for AI/ML workloads (GPU node pools, LLM integration, large model serving, high-performance storage)
- Act as the primary escalation point for deployment failures, network and Infra issues
- Perform Root Cause Analysis (RCA)
- Implement 'Secure by Design' principles
- Good knowledge of network security, identity and privilege access management, landing zone concepts for cloud platforms (Azure, AWS)
- Maintain and support secondary environments in AWS (and potentially Azure)
- Deep expertise in GCP (Compute Engine, GKE, Cloud Storage, IAM)
- Strong working knowledge of AWS (EC2, EKS, S3, IAM)
- Knowledge of using various programming languages (Python required, knowledge of Java, C#, JavaScript is a plus)
- Advanced proficiency in Kubernetes
- Ability to write and manage custom Helm charts
- Experience with Ingress Controllers (Nginx), Service Mesh, and Autoscaling (HPA/VPA/Cluster Autoscaler)
- Expert-level knowledge of GitLab CI/CD (writing .gitlab-ci.yml, runners, artifacts, caching)
- Understanding GitOps principles
- Strong hands-on experience with Terraform for provisioning cloud resources across multiple environments (Dev/Stage/Prod)
- Proficiency in Bash/Shell scripting and Python
- Strong Linux administration skills
- Experience setting up monitoring and using Cloud Native tools, Prometheus, and Grafana
- Experience with Azure Cloud infrastructure
- Knowledge of Identity Providers (Keycloak, Azure AD/Entra ID) and OIDC integration
- Experience with Service Mesh
- Understanding of ITIL processes (Incident/Change Management) and tools like ServiceNow, JIRA
- Basic understanding of Python/Flask/Fast API applications to assist developers in troubleshooting
Company Overview
Company H1B Sponsorship