Research Engineer, Text Data Research – MSL FAIR

Back to all jobs
  • Meta
  • Menlo Park, CA
  • Full-Time
  • 2 weeks ago
  • $154,003/year to $217,000/year
Published
May 4, 2026
Location
Menlo Park, CA
Category
Job Type

Research Engineer, Text Data Research – MSL FAIR: our view in 3 lines...

  • The Role: An engineer role for building and curating large-scale training data for advanced large language models within an AI research lab.
  • The Person: Design and build scalable data curation systems and pipelines, execute high-priority pre/mid/post-training data projects, and lead complex technical initiatives for LLM training data.
  • Requirements: Experience with LLM/NLP research, pre-training or mid-training data curation, PyTorch, SQL, Spark, and Hive is required or preferred.

Job Description

Meta is seeking AI research engineers to help us build the data foundation for Meta's most advanced Large Language Models. We're looking for engineers with LLM expertise to join us on working with data at scale and to push beyond the data ceiling.

Our team contributes to data curation across all stages of LLM development (pre-training, mid-training, post-training) and all domains/modalities (e.g., web, code, agent, multilingual). We tackle the hardest challenges at trillion-scale, including organic data curation, synthetic data generation, agent and interaction data, and frontier paradigms that redefine what's possible.

Based in Meta Superintelligence Labs (MSL) within the Fundamental AI Research Organization (FAIR), you'll directly contribute to Meta’s frontier models like Llama, while having the chance to collaborate with researchers and engineers across MSL.

Responsibilities

  • Collaborate with cross-functional teams to develop Meta’s next foundational models
  • Architect efficient and scalable data curation systems and pipelines
  • Fundamentally improve our data velocity across workflows and projects by contributing to the advancement of data tooling
  • Execute on high priority projects in pre-training, mid-training, or post-training data curation
  • Apply specialized expertise in agentic data, synthetic data, reasoning data, web parser, coding data, data scaling laws, or datamix optimization
  • Lead complex technical projects end-to-end
Minimum Qualifications

  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • 1+ year of industry research experience in LLM/NLP or related AI/ML models
  • Experience owning and/or driving complex technical projects from end-to-end
  • Practical experience with pre-training or mid-training data curation for large foundational models and experience working with organic, synthetic, agentic, or reasoning data for LLMs
  • Demonstrated data infrastructure and software background, and experience building data tooling and services
  • Published research in leading peer-reviewed conferences (e.g., NeurIPS, ICML, ICLR, ACL, EMNLP) and/or demonstrated significant industry influence in the field of AI
Preferred Qualifications

  • Experience working on frontier-quality/state-of-the-art Large Language Models
  • Masters degree or PhD in Computer Science or a related technical field
  • Hands-on experience with modeling frameworks like PyTorch
  • Hands-on experience on SQL and large-scale data handling, with familiarity of frameworks like Spark and Hive

$154,003/year to $217,000/year + bonus + equity + benefits

Key Skills
? Key Skills in dark blue have been inferred based on similar industry roles
Apache Spark LLM Data Curation Data Pipelines Large-scale Data Handling ML NLP LLM SQL Spark Pytorch Hive

Subscribe to Career Resources

Get the latest career advice, industry insights, and job opportunities delivered to your inbox.