We use cookies. Find out more about it here. By continuing to browse this site you are agreeing to our use of cookies.
#alert
Back to search results
New

AI Operations & Infrastructure Engineer

Invictus International
United States, Maryland, Fort Meade
4409 Llewellyn Avenue (Show on map)
Jun 23, 2026

Title: AI Operations & Infrastructure Engineer

Location: Fort Meade, MD

Clearance: TS/SCI with a CI Polygraph

Job Details:



  • Manage and maintain AI computing platforms, including GPUs and other specialized hardware
  • Install and configure GPU drivers and software
  • Oversee the AI software stack and tools
  • Implement and manage containerization technologies like Docker and Kubernetes
  • Configure and optimize networking infrastructure for AI workloads, including InfiniBand and Ethernet
  • Manage storage solutions for AI data, considering performance and capacity requirements
  • Deploy and manage data processing units (DPUs) to accelerate data center workloads
  • Monitor and manage AI cluster health and resource utilization
  • Implement workload management and scheduling tools like Slurm and Kubernetes
  • Ensure efficient power and cooling for AI infrastructure to maintain optimal operating conditions
  • Configure high-performance networking solutions for AI and machine learning workloads
  • Optimize network performance to ensure maximum throughput and minimal latency for AI computations
  • Implement and fine-tune network protocols to enhance data transfer speeds and efficiency
  • Integrate NVIDIA networking products with existing AI infrastructure, including servers, GPUs, and storage systems
  • Deploy networking solutions in data centers to ensure seamless connectivity between AI components
  • Diagnose and resolve networking issues impacting AI workloads to maintain optimal system performance
  • Provide technical support and guidance to teams managing AI infrastructure
  • Collaborate with data scientists, researchers, and IT professionals to understand networking requirements and challenges
  • Lead deployment and validation of servers and systems for AI enabled platforms
  • Configure and manage network topologies, BMC, OOB, TPM, power, and cooling
  • Install, upgrade, and validate GPU-based servers, BlueField DPUs, cables, and transceivers
  • Perform firmware upgrades, hardware validation, and storage setup
  • Configure and administer physical and logical resources, including M IG partitioning and BlueField platforms
  • Install and configure operating systems, cluster software, drivers, containers (Docker), and NGC CLI
  • Manage and orchestrate clusters using NVIDIA Base Command Manager, Slurm, Pyxis, Enroot, and Run: Ai
  • Perform stress, benchmarking, and burn-in tests using HPL, NCCL, NVIDIA Nemo, and ClusterKit
  • Verify cabling, firmware/software versions, and network signal quality
  • Troubleshoot and resolve hardware, software, storage, and performance faults
  • Replace faulty components and optimize systems for AMD/Intel platforms
  • Monitor, document, and report on cluster health, resource usage, and job performance
  • Ensure secure, efficient, and scalable operation of NVIDIA AI infrastructure, including user access and workload management



Requirements:



  • Qualified candidates must hold an active NVIDIA Professional Certification in either AI Networking, AI Infrastructure, or AI Operations
  • Prior direct, hands-on professional experience administering NVIDIA GPU and data processing unit (DPU) technologies, AI software stacks, and data center environments for high-performance AI workloads
  • Comprehensive expertise in deploying and maintaining AI compute platforms, requiring proficiency in containerization and workload orchestration using Docker, Kubernetes, Slurm, NVIDIA Base Command Manager, and Run:Ai
  • Must be capable of configuring physical and logical resources, including Multi-Instance GPU (MIG) partitioning and BlueField platforms, while overseeing critical facility elements such as power, cooling, and storage solutions
  • The ability to demonstrate advanced skills in AI networking, specifically configuring and optimizing high-performance InfiniBand and Ethernet fabrics to ensure maximum throughput and minimal latency
  • Current active TS/SCI clearance with a CI Polygraph


Equal Opportunity Employer/Veterans/Disabled

Applied = 0

(web-77cf7d65c7-4rhzf)