Amazon logo
Amazon
Write a Review|3.4|

Amazon hiring Sr. Software Engineer- AI/ML, AWS Neuron Distributed Training, Cupertino, California

Sr. Software Engineer- AI/ML, AWS Neuron Distributed Training

Amazon

Cupertino, California
Posted 1 weeks ago

Responsibilities

Primary Duties

  • Design, implement and optimize distributed training solutions for large scale ML models running on Trainium instances.
  • Extend and optimize popular distributed training frameworks including FSDP (Fully-Sharded Data Parallel), torchtitan and Hugging Face libraries for the Neuron ecosystem.
  • Develop and optimize mixed-precision and low-precision training techniques. Work with BF16, FP8, and emerging numerical formats to maximize training throughput while maintaining model accuracy and convergence quality.
  • Profile, analyze, and tune end-to-end training pipelines to achieve optimal performance on Trainium hardware. Partner with hardware, compiler, and runtime teams to influence system design and unlock new capabilities.
  • Work directly with AWS solution architects and customers to deploy and optimize training workloads at scale.

Full Job Description

Senior Software Engineer
Annapurna Labs designs silicon and software that accelerates innovation. Our custom chips, accelerators, and software stacks enable us to tackle unprecedented technical challenges and deliver solutions that help customers change the world. AWS Neuron is the complete software stack powering AWS Trainium (Trn2/Trn3), our cloud scale Machine Learning accelerators and we are seeking a Senior Software Engineer to join our ML Distributed Training team.

In this role, you will be responsible for the development, enablement, and performance optimization of large scale ML model training across diverse model families. This includes massive scale pre-training and post-training of LLMs with Dense and Mixture-of-Experts architectures, Multimodal models that are transformer and diffusion based, and Reinforcement Learning workloads. You will work at the intersection of ML research and high performance systems, collaborating closely with chip architects, compiler engineers, runtime engineers and AWS solution architects to deliver cost-effective, performant machine learning solutions on AWS Trainium based systems.

Key job responsibilities:
  • Design, implement and optimize distributed training solutions for large scale ML models running on Trainium instances.
  • Extend and optimize popular distributed training frameworks including FSDP (Fully-Sharded Data Parallel), torchtitan and Hugging Face libraries for the Neuron ecosystem.
  • Develop and optimize mixed-precision and low-precision training techniques. Work with BF16, FP8, and emerging numerical formats to maximize training throughput while maintaining model accuracy and convergence quality.
  • Profile, analyze, and tune end-to-end training pipelines to achieve optimal performance on Trainium hardware. Partner with hardware, compiler, and runtime teams to influence system design and unlock new capabilities.
  • Work directly with AWS solution architects and customers to deploy and optimize training workloads at scale.

About the team:
Annapurna Labs was a startup company acquired by AWS in 2015, and is now fully integrated. If AWS is an infrastructure company, then think Annapurna Labs as the infrastructure provider of AWS. Our org covers multiple disciplines including silicon engineering, hardware design and verification, software, and operations. AWS Nitro, ENA, EFA, Graviton and F1 EC2 Instances, AWS Neuron, Inferentia and Trainium ML Accelerators, and in storage with scalable NVMe, are some of the products we have delivered, over the last few years.

How to Apply

Share This Job