[Remote] Sys/Cloud Admin/Incident Response Engineer
Note: The job is a remote job and is open to candidates in USA. i4DM is a company that provides Federal agencies with access to highly skilled professionals for complex mission challenges. They are seeking an experienced Sys/Cloud Admin/Incident Response Engineer to support enterprise monitoring operations, incident detection, and response activities for a mission-critical platform within the Department of Veterans Affairs environment.
Responsibilities
- Administer, monitor, and support cloud and platform services, virtual infrastructure, and hosted applications to maintain system health, availability, and performance
- Configure, tune, and maintain monitoring, logging, and alerting solutions to improve visibility across infrastructure, applications, and service dependencies
- Validate alert accuracy, reduce noise, and help ensure operational issues are detected proactively through effective observability practices
- Perform routine system administration tasks such as environment checks, service restarts, access support, patch coordination, and operational maintenance activities
- Monitor incident queues and system alerts, perform initial triage, document impact, and execute defined escalation procedures for incidents affecting mission-critical services
- Participate in major incident response activities, including troubleshooting, log review, coordination with engineering teams, and support for service restoration efforts
- Follow incident response playbooks, severity models, and communication protocols to support timely resolution and accurate status reporting
- Document incident timelines, actions taken, recovery steps, and supporting evidence to enable post-incident review and continuous improvement
- Support coordination during operational events by working across infrastructure, application, DevSecOps, SRE, and service management teams
- Provide clear, timely updates on incident status, service impact, troubleshooting progress, and recovery actions to internal stakeholders
- Escalate issues appropriately based on impact, urgency, and established operational procedures
- Maintain accurate operational records in ticketing, incident, and knowledge management systems
- Partner with engineers and platform teams to improve dashboards, alerts, runbooks, and operational procedures supporting reliable service delivery
- Identify recurring operational issues, alert gaps, and system weaknesses, and recommend practical improvements to reduce incident frequency and response time
- Support automation efforts for routine operational tasks, alert correlation, remediation workflows, and incident response activities where applicable
- Contribute to post-incident reviews, root cause analysis activities, and implementation of corrective or preventive actions
- Help maintain operational reporting on incidents, system health, availability, and response metrics to support service-level objectives and operational reviews
- Ensure incident records, escalation paths, standard operating procedures, and response documentation remain current and usable
- Support compliance with operational policies, security requirements, and change management practices in cloud and enterprise environments
- Participate in on-call or after-hours operational support, as required, in a 24x7 mission-driven environment
Skills
- Bachelor's degree in Information Technology, Computer Science, Engineering, Cybersecurity, or a related field; equivalent relevant experience may be considered
- 3+ years of experience in systems administration, cloud operations, site reliability, network operations, incident response, or enterprise production support roles
- Hands-on experience supporting Windows and/or Linux server environments, cloud-hosted infrastructure, and enterprise application platforms
- Experience with monitoring, logging, and observability tools used to detect, investigate, and troubleshoot service disruptions
- Working knowledge of incident management processes, ticketing workflows, escalation practices, and service restoration procedures in ITIL-aligned environments
- Ability to analyze logs, alerts, and system behavior to support troubleshooting and rapid issue resolution
- Strong written and verbal communication skills, with the ability to document incidents and coordinate effectively across technical and non-technical stakeholders
- Ability to work in a 24x7, SLA-driven environment and participate in operational response activities under time-sensitive conditions
- Candidates must be eligible to obtain and maintain a Public Trust clearance
- Experience supporting VA or other Federal Government environments, including familiarity with operational reporting, service management, and compliance expectations
- Experience with cloud and platform technologies such as AWS, Azure, Kubernetes, container platforms, virtualization, or hybrid infrastructure
- Familiarity with enterprise monitoring and observability platforms such as Splunk, Dynatrace, CloudWatch, Azure Monitor, Grafana, or similar tools
- Experience using scripting or automation tools such as PowerShell, Python, Bash, or infrastructure automation frameworks to streamline operational tasks
- Exposure to DevSecOps, Site Reliability Engineering (SRE), SAFe Agile, or modern incident response and post-incident review practices
- Relevant certifications such as AWS Certified SysOps Administrator, Azure Administrator Associate, CompTIA Security+, ITIL Foundation, Splunk, or similar credentials
Company Overview