We’re looking for a MLE (Pretraining Data) to lead construction and scaling of large-scale training corpora for frontier, open source transformer models. You’ll focus on dataset design, filtering, synthetic data generation, mixture experiments, and empirical evaluation to improve model quality at scale.

Responsibilities:

  • Collecting, filtering, and synthesizing pretraining-scale datasets
  • Designing dataset mixtures and running controlled ablations
  • Performing dataset comparisons and empirical evaluations across training runs
  • Developing end-to-end pipelines for collecting, processing, and evaluating datasets
  • Scaling and maintaining large training corpora across diverse sources
  • Collaborating with training and infrastructure teams to align data strategy with model scaling


Qualifications:

  • Experience building or scaling large pretraining datasets
  • Experience running dataset ablations and mixture experiments
  • Strong Python engineering skills
  • Experience with distributed data processing systems
  • Deep understanding of how dataset composition affects model behavior


Preferred:

  • Experience with distributed data processing frameworks such as Datatrove, Dask, Spark, or similar systems for large-scale dataset construction and transformation
  • Familiarity with synthetic data orchestration systems (e.g., NeMo DataDesigner) and large-scale generation, filtering, and evaluation workflows
  • Experience working with or building large-scale curated datasets similar like FineData, specifically FineWebEDU and FinePDFs
  • Familiarity with open model training initiatives such as SmolLM, BLOOM (BigScience), and Nemotron, including exposure to pretraining mixtures, scaling, and evaluation

ARTIFICIAL INTELLIGENCE MADE HUMAN

NODES

THE AI ACCELERATOR COMPANY

NODES