Architect and implement efficient ML Inference pipelines for large language models.

Responsibilities:

  • Design and implement high-performance inference pipelines
  • Optimize model serving for throughput, latency, and cost across different workloads
  • Collaborate with research and product teams to integrate inference into real-world applications
  • Help enhance and manage the deployment pipeline and monitor production clusters
  • Debug production inference issues
  • Stay up-to-date with the latest in inference tech and open-source frameworks

Qualifications:

  • Deep experience developing and tuning LLM inference frameworks (e.g. vLLM)
  • Solid communication skills; ability to work independently and within a team
  • Experience with cloud infrastructure (AWS, GCP, Azure) and Kubernetes
  • Passion for AI and practical ML systems
  • Experience building, deploying and operating highly available, scalable, distributed cloud services.