bluedoor data·Job Postings API·bluedoor.sh ↗

HomeCompaniesIfm UsMachine Learning Infrastructure Engineer

Machine Learning Infrastructure Engineer

Ifm Us · Sunnyvale, CA · On Site · Active · $150,000–$450,000 / year · Lever

Job facts

FieldValue
CompanyIfm Us
TitleMachine Learning Infrastructure Engineer
Normalized title-
Department / teamEngineering
LocationSunnyvale, CA, United States
Work modelOn Site
Employment typeFull Time
Salary$150,000–$450,000 / year
Statusactive
ATS providerLever
Posted / first seen2025-07-18 / 2026-05-29
Changed / last seen2026-06-02 / 2026-06-06

Related slices

PageWhat it containsOpen
Company jobsActive postings from Ifm Us.Open
Company breakdownsRole, location, ATS, and work model facets for this company.Open
ATS provider jobsActive postings observed through Lever.Open
Provider filtered searchThe same provider as a filtered job collection.Open
City jobsActive postings in Sunnyvale.Open
Work model jobsActive On Site postings.Open
Lifecycle eventsOpen, update, close, and reopen events for this posting.Open
Original postingCanonical source or apply URL captured from the ATS.Open

Linked records

CompanyIfm Us
Source4d111a77-38db-4b88-84a8-24f761a495a9
ATS providerLever

Description

About the Institute of Foundation Models We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy. As part of our team, you’ll have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers. The Role We're looking for a distributed ML infrastructure engineer to help extend and scale our training systems. You’ll work side-by-side with world-class researchers and engineers to: • Extend distributed training frameworks (e.g., DeepSpeed, FSDP, FairScale, Horovod) • Implement distributed optimizers from mathematical specs • Build robust config + launch systems across multi-node, multi-GPU clusters • Own experiment tracking, metrics logging, and job monitoring for external visibility • Improve training system reliability, maintainability, and performance • While much of the work will support large-scale pre-training, pre-training experience is not required. Strong infrastructure and systems experience is what we value most. Key Responsibilities • Distributed Framework Ownership – Extend or modify training frameworks (e.g., DeepSpeed, FSDP) to support new use cases and architectures. • Optimizer Implementation – Translate mathematical optimizer specs into distributed implementations. • Launch Config & Debugging – Create and debug multi-node launch scripts with flexible batch sizes, parallelism strategies, and hardware targets. • Metrics & Monitoring – Build systems for experiment tracking, job monitoring, and logging usable by collaborators and researchers. • Infra Engineering – Write production-quality code and tests for ML infra in PyTorch or JAX; ensure reliability and maintainability at scale. Qualifications Must-Haves: • 5+ years of experience in ML systems, infra, or distributed training • Experience modifying distributed ML frameworks (e.g., DeepSpeed, FSDP, FairScale, Horovod) • Strong software engineering fundamentals (Python, systems design, testing) • Proven multi-node experience (e.g., Slurm, Kubernetes, Ray) and debugging skills (e.g., NCCL/GLOO) • Ability to implement algorithms across GPUs/nodes based on mathematical specs • Experience working on an ML platform/ infrastructure, and/or distributed inference optimization team • Experience with large-scale machine learning workloads (strong ML fundamentals) Nice-to-Haves: • Exposure to mixed-precision training (e.g., bf16, fp8) with accuracy validation • Familiarity with performance profiling, kernel fusion, or memory optimization • Open-source contributions or published research (MLSys, ICML, NeurIPS) • CUDA or Triton kernel experience • Experience with large-scale pre-training • Experience building custom training pipelines at scale and modifying them for custom needs • Deep familiarity with training infrastructure and performance tuning

Full job record

Job ID7aeaeb8a2e8c0d3ff9a9c50b92bafcef5c0d46b9
Org IDbb7fb7ce-62b9-4ed3-9327-02a3c7b7e5d0
Source ID4d111a77-38db-4b88-84a8-24f761a495a9
Board ID4d111a77-38db-4b88-84a8-24f761a495a9
Providerlever
Provider Job Key5edf0fd9-2f47-4f1f-bf12-a787ebf9934e
TitleMachine Learning Infrastructure Engineer
Normalized Title
Statusactive
Activeyes
Location TextSunnyvale, CA
Department
TeamEngineering
Employment TypeFull-time
Workplace Typeon_site
Remote Policy
CountryUnited States
RegionCA
CitySunnyvale
Salary RawUSD 150000-450000 per-year-salary
Salary Min150,000
Salary Max450,000
Salary CurrencyUSD
Salary Periodyear
Source URLhttps://jobs.lever.co/ifm-us/5edf0fd9-2f47-4f1f-bf12-a787ebf9934e
Apply URLhttps://jobs.lever.co/ifm-us/5edf0fd9-2f47-4f1f-bf12-a787ebf9934e/apply
First Seen At2026-05-29 06:59:53Z
Last Seen At2026-06-06 20:14:05Z
Last Checked At2026-06-06 20:14:05Z
Last Changed At2026-06-02 10:41:24Z
Inactive At
Source Posted At2025-07-18 05:41:27Z
Source Updated At
Raw Payload Uris3://job-postings-prod-raw-590183727216/raw/provider=lever/board=ifm-us/date=2026-06-06/2026-06-06T20-14-04-180Z-dba991fe17ae8dd61e2db3cfb8af8d8d910a473e10cffaf0af12daa6be784167.json
Event Fields
{
  "content_hash": "e2ef138ac16c0d97a82ee36b5e3c2520111394cdccbd71cee2aca9909238225f",
  "source_hash": "d365c90d8b5c2f593dd4c52ca14aca46193d4cbf522f4af5f004a57f63f83b26",
  "last_changed_at": "2026-06-02T10:41:24.749Z",
  "active_status": "active"
}
Parsed Structured
{
  "language": "en",
  "location": {
    "raw": "Sunnyvale, CA",
    "city": "Sunnyvale",
    "region": "CA",
    "country": "United States",
    "is_remote": false,
    "confidence": 0.9
  },
  "salary_max": 450000,
  "salary_min": 150000,
  "inferred_at": "2026-06-06T20:14:05.520Z",
  "launch_scope": {
    "reason": "english_us_canada",
    "included": true,
    "language": "en",
    "location": {
      "raw": "Sunnyvale, CA",
      "city": "Sunnyvale",
      "region": "CA",
      "country": "United States",
      "is_remote": false,
      "confidence": 0.9
    },
    "countries": [
      "United States"
    ]
  },
  "remote_policy": null,
  "salary_period": "year",
  "workplace_type": "on_site",
  "salary_currency": "USD"
}
Extensions
{}
Native Structured
{
  "lists": [],
  "country": "US",
  "createdAt": 1752817287843,
  "updatedAt": null,
  "categories": {
    "team": "Engineering",
    "location": "Sunnyvale, CA",
    "commitment": "Full-time",
    "allLocations": [
      "Sunnyvale, CA"
    ]
  },
  "salaryRange": {
    "max": 450000,
    "min": 150000,
    "currency": "USD",
    "interval": "per-year-salary"
  },
  "workplaceType": "onsite"
}
Get this page with API

Rendered from the bluedoor Job Postings API. Reproduce it:

GET https://api.bluedoor.sh/job-postings/v1/jobs/7aeaeb8a2e8c0d3ff9a9c50b92bafcef5c0d46b9?include=descriptionJSON
GET https://api.bluedoor.sh/job-postings/v1/orgs/bb7fb7ce-62b9-4ed3-9327-02a3c7b7e5d0JSON
GET https://api.bluedoor.sh/job-postings/v1/sources/4d111a77-38db-4b88-84a8-24f761a495a9JSON
GET https://api.bluedoor.sh/job-postings/v1/jobs/7aeaeb8a2e8c0d3ff9a9c50b92bafcef5c0d46b9/eventsJSON