We’re looking for a MLE (Pretraining Data) to lead construction and scaling of large-scale training corpora for frontier, open source transformer models. You’ll focus on dataset design, filtering, synthetic data generation, mixture experiments, and empirical evaluation to improve model quality at scale.
Responsibilities:
- Collecting, filtering, and synthesizing pretraining-scale datasets
- Designing dataset mixtures and running controlled ablations
- Performing dataset comparisons and empirical evaluations across training runs
- Developing end-to-end pipelines for collecting, processing, and evaluating datasets
- Scaling and maintaining large training corpora across diverse sources
- Collaborating with training and infrastructure teams to align data strategy with model scaling
Qualifications:
- Experience building or scaling large pretraining datasets
- Experience running dataset ablations and mixture experiments
- Strong Python engineering skills
- Experience with distributed data processing systems
- Deep understanding of how dataset composition affects model behavior
Preferred:
- Experience with distributed data processing frameworks such as Datatrove, Dask, Spark, or similar systems for large-scale dataset construction and transformation
- Familiarity with synthetic data orchestration systems (e.g., NeMo DataDesigner) and large-scale generation, filtering, and evaluation workflows
- Experience working with or building large-scale curated datasets similar like FineData, specifically FineWebEDU and FinePDFs
- Familiarity with open model training initiatives such as SmolLM, BLOOM (BigScience), and Nemotron, including exposure to pretraining mixtures, scaling, and evaluation