ML Safety Research Engineer

Apple

San Francisco, California

Posted today

ML Safety Research EngineerApple Services Engineering (ASE) powers many AI features across App Store, Music, Video and more. We build deeply personal products with the goal of representing users around the globe authentically. We work continuously to avoid perpetuating systemic biases and maintain safe and trustworthy experiences across our AI tools and models.

Our team, part of Apple Services Engineering, is looking for an ML Research Engineer to lead the design and continuous development of automated safety benchmarking methodologies. In this role, you will investigate how media-related agents behave, develop rigorous evaluation frameworks and techniques, and establish scientific standards for assessing risks they pose and safety performance. This role supports the development of scalable evaluation techniques that ensure our engineers have the right tools to assess candidate models and product features for responsible and safe performance. The capabilities you build will allow for the generation of benchmark datasets and evaluation methodologies for model and application outputs, at scale, to enable engineering teams to translate safety insights into actionable engineering and product improvements. This role blends deep technical expertise with strong analytical judgment to develop tools and capabilities for assessing and improving the behavior of advanced AI/ML models.

You will work cross-functionally with Engineering and Project Managers, Product, and Governance teams to develop a suite of technologies to ensure that AI experiences are reliable, safe, and aligned with human expectations. The successful candidate will take a proactive approach to working independently and collaboratively on a wide range of projects. In this role, you will work alongside a small but impactful team, collaborating with ML and data scientists, software developers, project managers, and other teams at Apple to understand requirements and translate them into scalable, reliable, and efficient evaluation frameworks.

ResponsibilitiesDesign scientifically-grounded benchmarking methodologies covering multiple dimensions of responsibility and safety across several media and application marketplace use cases

Develop automated evaluation pipelines that collect, automatically judge, and analyze model outputs with respect to safety policies, at scale

Create and curate datasets, tasks, and feature usage scenarios that represent realistic and adversarial use cases across multiple languages, markets, and domains

Define and validate new metrics for complex phenomena such as multi-turn agentic interaction patterns

Apply statistical rigor and reproducibility to above mentioned objectives

Work closely with engineering and research teams to translate experimental findings into actionable model improvements and safety mitigations

Publish internal reports and external papers

Monitor evolving industry practices and academic work to ensure benchmarks remain relevant

Minimum QualificationsAdvanced degree (MS or PhD) in Computer Science, Software Engineering, or equivalent research/work experience

• years of work experience either as a postdoc or in the industry

Strong research background in empirical evaluation, experimental design, or benchmarking

Strong proficiency in Python (pandas, NumPy, Jupyter, PyTorch, etc.)

Deep familiarity with software engineering workflows and developer tools

Experience working with or evaluating AI/ML models, preferably LLMs or program synthesis systems

Strong analytical and communication skills, including the ability to write clear reports

Technical

Skills

Proficiency in Python (pandas, NumPy, Jupyter, PyTorch, etc.)

Experience working with large datasets, annotation tools, and model evaluation pipelines

Familiarity with evaluations specific to responsible AI and safety, hallucination detection, and/or model alignment concerns

Ability to design taxonomies, categorization schemes, and structured labeling frameworks

Analytical Strength: Ability to interpret unstructured data (text, transcripts, user sessions) and derive meaningful insights

Communication: Strong ability to stitch together qualitative and quantitative insights into actionable guidance; strong ability to communicate complex architectures and systems to a variety of stakeholders

Education in Data Science, Linguistics, Cognitive Science, HCI, Psychology, Social Science, or a related field

Preferred QualificationsPublications in AI/ML evaluation or related fields

Experience with automated testing frameworks

Experience constructing human-in-the-loop or multi-turn evaluation setups

Intermediate or Advanced Proficiency in Swift

Familiarity with RAG systems, reinforcement learning, agentic architectures, and model fine-tuning

Expertise in designing annotation guidelines and validation instruments and techniques

Background in human factors, social science, and/or safety assessment methodologies

Pay & BenefitsAt Apple, base pay is one part of our total compensation package and is determined within a range. This provides the opportunity to progress as you grow and develop within a role. The base pay range for this role is between $181,100 and $272,100, and your base pay will depend on your skills, qualifications, experience, and location. Apple employees also have the opportunity to become an Apple shareholder through participation in Apple's discretionary employee stock programs. Apple employees are eligible for discretionary restricted stock unit awards, and can purchase Apple stock at a discount if voluntarily participating in Apple's Employee Stock Purchase Plan. You'll also receive benefits including: Comprehensive medical and dental coverage, retirement benefits, a range of discounted products and free services, and for formal education related to advancing your career at Apple, reimbursement for certain educational expenses including tuition. Additionally, this role might be eligible for discretionary bonuses or commission payments as well as relocation.

Apple is an equal opportunity employer that is committed to inclusion and diversity. We seek to promote equal opportunity for all applicants without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, Veteran status, or other legally protected characteristics.

Apple accepts applications to this posting on an ongoing basis.

Estimated Salary

$206

/ hour

Apple pays $206 for Data Scientist in San Francisco, California, with most salaries ranging from $132 to $304. Pay can vary based on role, experience, and local cost of living.

Median

$206

Low

$132

High

$304

Companies Similar to Apple for Jobs

Figures represent approximate ranges and may vary based on experience, location, and other factors. For the most accurate information, please consult the employer directly. Contact us to suggest updates to this information.

Apple hiring ML Safety Research Engineer, San Francisco, California

ML Safety Research Engineer

Skills

Experience with automated testing frameworks

Apple

About

Company Details

Follow Them

Estimated Salary

Companies Similar to Apple for Jobs

Amazon

H-E-B

Staples

Target

Share This Job