We’re looking for an MLE to scale the training and deployment of large transformer-based models. You’ll work across training infrastructure, inference optimization, and reinforcement learning pipelines in multi-GPU and multi-node environments.
Responsibilities:
- Performance engineering of training, inference, and RL infrastructure for large language models
- Implementing parallelization strategies (data, tensor, pipeline, context) and optimizing bottlenecks
- Building fault-tolerant training systems with checkpointing and recovery
- Designing RL pipelines from reward modeling to policy optimization, including trainer-inference communication
- Verifying numerical correctness and validating methods from recent publications
Qualifications:
- Experience in training and deploying large neural networks in production
- Expert-level PyTorch
- Multi-node, multi-GPU training and debugging experience
- Deep understanding of GPU memory management and distributed systems profiling
- Hands-on RL experience, including policy-space methods (GRPO, PPO, etc.)
Preferred:
- Multi-modal model training (e.g. separate vision encoders) and MoE / expert parallelism
- NVIDIA GPU programming (Triton, CUTLASS, custom CUDA kernels) and deep NCCL knowledge
- FP8 or FP4 training experience
- Familiarity with TorchTitan, SGLang, vLLM, Megatron, etc.
- Track record of open-source contributions