The Diffuse Project is dedicated to advancing our understanding of protein motion through the use of diffuse scattering – a signal in X-ray crystallography that is currently under-utilized or ignored, that will unlock our ability to measure protein dynamics. We are bringing together a diverse team of researchers, software developers, and beamline scientists to accomplish our mission. We are committed to Open Science principles of making all of our work, software, and data open and FAIR all along the way. The Diffuse Project is generously funded by and is part of the Astera Institute. You can read more about The Diffuse Project here, and Astera’s mission, vision, and programming here.
The Diffuse Project is seeking a Machine Learning Infrastructure Engineer to lead the development of robust, scalable backend systems that power machine learning–driven discoveries in structural biology. You will work at the intersection of scientific research and software engineering, working with researchers to train, test, and deploy ML models directly on experimental data (electron density/structure factors) coming from X-ray crystallography and cryo-EM.
This role is ideal for someone with deep experience in ML infrastructure and scientific computing who thrives in a collaborative and product-minded environment. This is a 6-month assignment with potential for extension.
Architect, build, and maintain ML infrastructure pipelines for model training, validation, and deployment across diverse experimental datasets in collaboration with scientists
Design and manage data ingestion and preprocessing workflows for structural biology data (PDBs, cryo-EM maps, diffraction patterns, etc.) in collaboration with scientists
Develop and maintain backend services and APIs that support modular access to models, datasets, and experiment metadata
Support GPU/accelerated training on local HPC clusters or cloud platforms
Implement data versioning, model tracking, and reproducibility tools
Collaborate with ML researchers and experimentalists to streamline the integration of new algorithms, datasets, and evaluation metrics
Ability to work effectively in a multidisciplinary team environment
Strong programming skills in Python, ideally with experience in PyTorch
Deep understanding of machine learning infrastructure, including model training pipelines, GPU utilization, experiment tracking, and deployment
Proficiency in backend development (e.g., REST APIs, containerization with Docker, workflow management, and data engineering tools)
Experience with distributed compute environments
Solid understanding of scientific computing workflows, version control, and reproducibility principles
At least two years of experience working on ML models
(Bonus) Familiarity with structural biology data formats
(Bonus) Experience designing systems for diffusion-based models
W-2, Fix-term employment, 6-month assignment. Potential extension based on performance and business needs.
This role is Remote, with access to our office located in Emeryville, CA. Some travel may be required from time-to-time for in-person collaboration and work.
The posted salary range is based on location in the Bay Area. The successful candidate will receive a competitive compensation package, commensurate with their experience and location.