Architect and implement efficient ML Inference pipelines for large language models.
Responsibilities:
- Design and implement high-performance inference pipelines
- Optimize model serving for throughput, latency, and cost across different workloads
- Collaborate with research and product teams to integrate inference into real-world applications
- Help enhance and manage the deployment pipeline and monitor production clusters
- Debug production inference issues
- Stay up-to-date with the latest in inference tech and open-source frameworks
Qualifications:
- Deep experience developing and tuning LLM inference frameworks (e.g. vLLM)
- Solid communication skills; ability to work independently and within a team
- Experience with cloud infrastructure (AWS, GCP, Azure) and Kubernetes
- Passion for AI and practical ML systems
- Experience building, deploying and operating highly available, scalable, distributed cloud services.