[Remote] Sys/Cloud Admin/Incident Response Engineer

Remote Full-time Live

Note: The job is a remote job and is open to candidates in USA. i4DM is a company that provides Federal agencies with access to highly skilled professionals for complex mission challenges. They are seeking an experienced Sys/Cloud Admin/Incident Response Engineer to support enterprise monitoring operations, incident detection, and response activities for a mission-critical platform within the Department of Veterans Affairs environment.

Responsibilities

Administer, monitor, and support cloud and platform services, virtual infrastructure, and hosted applications to maintain system health, availability, and performance
Configure, tune, and maintain monitoring, logging, and alerting solutions to improve visibility across infrastructure, applications, and service dependencies
Validate alert accuracy, reduce noise, and help ensure operational issues are detected proactively through effective observability practices
Perform routine system administration tasks such as environment checks, service restarts, access support, patch coordination, and operational maintenance activities
Monitor incident queues and system alerts, perform initial triage, document impact, and execute defined escalation procedures for incidents affecting mission-critical services
Participate in major incident response activities, including troubleshooting, log review, coordination with engineering teams, and support for service restoration efforts
Follow incident response playbooks, severity models, and communication protocols to support timely resolution and accurate status reporting
Document incident timelines, actions taken, recovery steps, and supporting evidence to enable post-incident review and continuous improvement
Support coordination during operational events by working across infrastructure, application, DevSecOps, SRE, and service management teams
Provide clear, timely updates on incident status, service impact, troubleshooting progress, and recovery actions to internal stakeholders
Escalate issues appropriately based on impact, urgency, and established operational procedures
Maintain accurate operational records in ticketing, incident, and knowledge management systems
Partner with engineers and platform teams to improve dashboards, alerts, runbooks, and operational procedures supporting reliable service delivery
Identify recurring operational issues, alert gaps, and system weaknesses, and recommend practical improvements to reduce incident frequency and response time
Support automation efforts for routine operational tasks, alert correlation, remediation workflows, and incident response activities where applicable
Contribute to post-incident reviews, root cause analysis activities, and implementation of corrective or preventive actions
Help maintain operational reporting on incidents, system health, availability, and response metrics to support service-level objectives and operational reviews
Ensure incident records, escalation paths, standard operating procedures, and response documentation remain current and usable
Support compliance with operational policies, security requirements, and change management practices in cloud and enterprise environments
Participate in on-call or after-hours operational support, as required, in a 24x7 mission-driven environment

Skills

Bachelor's degree in Information Technology, Computer Science, Engineering, Cybersecurity, or a related field; equivalent relevant experience may be considered
3+ years of experience in systems administration, cloud operations, site reliability, network operations, incident response, or enterprise production support roles
Hands-on experience supporting Windows and/or Linux server environments, cloud-hosted infrastructure, and enterprise application platforms
Experience with monitoring, logging, and observability tools used to detect, investigate, and troubleshoot service disruptions
Working knowledge of incident management processes, ticketing workflows, escalation practices, and service restoration procedures in ITIL-aligned environments
Ability to analyze logs, alerts, and system behavior to support troubleshooting and rapid issue resolution
Strong written and verbal communication skills, with the ability to document incidents and coordinate effectively across technical and non-technical stakeholders
Ability to work in a 24x7, SLA-driven environment and participate in operational response activities under time-sensitive conditions
Candidates must be eligible to obtain and maintain a Public Trust clearance
Experience supporting VA or other Federal Government environments, including familiarity with operational reporting, service management, and compliance expectations
Experience with cloud and platform technologies such as AWS, Azure, Kubernetes, container platforms, virtualization, or hybrid infrastructure
Familiarity with enterprise monitoring and observability platforms such as Splunk, Dynatrace, CloudWatch, Azure Monitor, Grafana, or similar tools
Experience using scripting or automation tools such as PowerShell, Python, Bash, or infrastructure automation frameworks to streamline operational tasks
Exposure to DevSecOps, Site Reliability Engineering (SRE), SAFe Agile, or modern incident response and post-incident review practices
Relevant certifications such as AWS Certified SysOps Administrator, Azure Administrator Associate, CompTIA Security+, ITIL Foundation, Splunk, or similar credentials

Company Overview

i4DM provides full range of information technology consulting services to government and commercial clients. It was founded in 2002, and is headquartered in Millersville, Maryland, USA, with a workforce of 51-200 employees. Its website is https://www.i4dm.com.

Apply To This Job

Apply

[Remote] Sys/Cloud Admin/Incident Response Engineer

On the same wavelength

[Remote] PATIENT ACCOUNT ANALYST

[Remote] Temporary Part-Time Recruiting Coordinator

[Remote] B2B SaaS Account Executive

[Remote] Associate Site Reliability Engineer

[Remote] Customer Support Specialist - Remote

[Remote] Sourcing Recruiter (Remote) - North East Region

[Remote] Operations Director | Remote| Flexible Career Pivot

[Remote] Director, Operations Analytics

[Remote] Project Manager-HV Cables

[Remote] IBM ITX/ITXA Developer – NCPDP Healthcare

Experienced Home-Based Chat Specialist - Beginners Welcome: Flexible Hours, Competitive Pay, and Opportunities for Growth

Entry-Level Remote Data Entry Specialist – Work from Home Opportunity with blithequark

Cyber Operations Analyst, Office of Chief Information Officer

Senior Financial analyst

Remote Data Entry Specialist – Information Management Professional ($22/Hour) – Work From Home Opportunity

Consultant, Environmental Biologist/Ecologist/Scientist

Experienced Full Stack Data Entry Specialist – Advanced Data Management and Analytics

Weekend Part-Time Customer Service Representative – Remote Opportunity at arenaflex

Apply Now: Communication Specialist, Ethics & Compliance

Clinical Research Associate (CRA)