We’re looking for an MLE to scale the training and deployment of large transformer-based models. You’ll work across training infrastructure, inference optimization, and reinforcement learning pipelines in multi-GPU and multi-node environments.

Responsibilities:

  • Performance engineering of training, inference, and RL infrastructure for large language models
  • Implementing parallelization strategies (data, tensor, pipeline, context) and optimizing bottlenecks
  • Building fault-tolerant training systems with checkpointing and recovery
  • Designing RL pipelines from reward modeling to policy optimization, including trainer-inference communication
  • Verifying numerical correctness and validating methods from recent publications

Qualifications:

  • Experience in training and deploying large neural networks in production
  • Expert-level PyTorch
  • Multi-node, multi-GPU training and debugging experience
  • Deep understanding of GPU memory management and distributed systems profiling
  • Hands-on RL experience, including policy-space methods (GRPO, PPO, etc.)

Preferred:

  • Multi-modal model training (e.g. separate vision encoders) and MoE / expert parallelism
  • NVIDIA GPU programming (Triton, CUTLASS, custom CUDA kernels) and deep NCCL knowledge
  • FP8 or FP4 training experience
  • Familiarity with TorchTitan, SGLang, vLLM, Megatron, etc.
  • Track record of open-source contributions

ARTIFICIAL INTELLIGENCE MADE HUMAN

NODES

THE AI ACCELERATOR COMPANY

NODES