We are seeking a highly skilled and hands-on Data Center Engineer to support a GPU-accelerated data center environment featuring direct-to-chip liquid cooling infrastructure. This role is critical in maintaining uptime, responding to infrastructure incidents, and ensuring operational excellence across power, cooling, server, and network environments.
The successful candidate will serve as the onsite technical expert during incidents and maintenance activities, providing real-time troubleshooting, physical investigations, vendor coordination, and accurate communication with internal teams, customers, and service providers.
This position is ideal for professionals who thrive in mission-critical environments and can remain calm, methodical, and decisive during high-pressure situations.
Tasks
Power Incident Response
- Respond immediately to power-related incidents affecting data center operations.
- Investigate facility-level and rack-level power issues, including UPS systems, PDUs, breakers, and server power supplies.
- Execute Emergency Operating Procedures (EOPs) and Maintenance Operating Procedures (MOPs).
- Safely isolate faulty equipment and perform approved recovery actions.
- Coordinate with facilities teams and remote engineering teams during incident resolution.
- Monitor infrastructure recovery and verify restoration of servers, switches, CDUs, and supporting systems.
Physical Network Troubleshooting
Incident Management & Bridge Call Communication
- Participate in incident bridge calls with facilities teams, network engineers, vendors, management, and customers.
- Provide accurate, factual, and verified onsite observations.
- Maintain detailed timelines of actions taken and observations made during incidents.
- Update incident tickets and documentation in real time.
- Ensure proper operational records are maintained for audit and customer reporting purposes.
Vendor Coordination & Warranty Support
- Escort and supervise vendors, contractors, and third-party engineers onsite.
- Coordinate warranty repairs and hardware replacement activities.
- Validate the quality and completion of vendor-performed work.
- Ensure deployments and repairs meet operational and quality standards.
Rack & Stack / Deployment Activities
- Install and deploy servers, switches, and supporting infrastructure.
- Route and organize fiber, DAC, and copper cabling according to standards.
- Maintain proper cable management, labeling, and rack organization.
- Verify deployment quality and installation accuracy.
Inventory & Asset Management
Preventive Maintenance & Daily Operations
Requirements
Required Qualifications
- 3–5 years of experience in Data Center Operations, Critical Facilities, Infrastructure Support, or a similar environment.
- Strong understanding of data center power systems, including:
- UPS systems
- PDUs
- Circuit breakers
- Redundant power architectures
- Hands-on experience with:
- Fiber optics
- DAC cables
- Structured cabling
- Transceivers and optics
- Physical network troubleshooting
- Experience working with servers and network switches.
- Ability to follow runbooks, SOPs, MOPs, and escalation procedures.
- Strong troubleshooting and problem-solving skills.
- Excellent verbal and written communication skills.
- Comfortable working independently during critical incidents.
Preferred Qualifications
- Experience supporting GPU clusters, AI infrastructure, HPC environments, or liquid-cooled data centers.
- Familiarity with CDUs, facility chillers, and advanced cooling systems.
- Understanding of mission-critical facility operations.
- Experience working with enterprise hardware vendors and warranty processes.
Physical Requirements
- Ability to lift and move equipment up to 50 lbs (23 kg).
- Comfortable working in hot aisle environments.
- Ability to wear required PPE.
- Ability to kneel behind racks, work in confined spaces, and access under-floor infrastructure when required.