Home › Companies › Ifm Us › Senior Distributed Systems Engineer
Senior Distributed Systems Engineer
Ifm Us · Sunnyvale, CA · On Site · Active · $200,000–$400,000 / year · Lever
Job facts
| Field | Value |
|---|---|
| Company | Ifm Us |
| Title | Senior Distributed Systems Engineer |
| Normalized title | - |
| Department / team | Engineering |
| Location | Sunnyvale, CA, United States |
| Work model | On Site |
| Employment type | - |
| Salary | $200,000–$400,000 / year |
| Status | active |
| ATS provider | Lever |
| Posted / first seen | 2026-03-03 / 2026-05-29 |
| Changed / last seen | 2026-06-02 / 2026-06-23 |
Related slices
| Page | What it contains | Open |
|---|---|---|
| Company jobs | Active postings from Ifm Us. | Open |
| Company breakdowns | Role, location, ATS, and work model facets for this company. | Open |
| ATS provider jobs | Active postings observed through Lever. | Open |
| Provider filtered search | The same provider as a filtered job collection. | Open |
| City jobs | Active postings in Sunnyvale. | Open |
| Work model jobs | Active On Site postings. | Open |
| Lifecycle events | Open, update, close, and reopen events for this posting. | Open |
| Original posting | Canonical source or apply URL captured from the ATS. | Open |
Linked records
| Company | Ifm Us |
| Source | 4d111a77-38db-4b88-84a8-24f761a495a9 |
| ATS provider | Lever |
Description
About the Institute of Foundation Models
The Institute of Foundation Models (IFM) designs and operates ultra-scale GPU supercomputing systems to train next-generation foundation models. We believe performance, fault tolerance, and scalability are co-designed across model architecture, communication systems, runtime, and hardware topology.
This role sits at the core of that effort — driving communication performance, distributed reliability, and cross-layer optimization for large-scale training workloads.
The Mission
We are looking for a deeply technical engineer to co-design and optimize the communication stack for large-scale distributed training, including hybrid parallelism and Mixture-of-Experts (MoE) workloads.
This is not a network operations role. This is a systems-level engineering position focused on performance engineering, distributed debugging, and communication-runtime co-design.
· Design and optimize expert-parallel and hybrid-parallel communication patterns
· Drive high-performance hierarchical collectives for MoE workloads
· Co-design runtime orchestration with communication topology awareness
· Reduce tail latency and improve determinism across thousands of GPUs
· Architect fault-tolerant distributed execution under real-world cluster failures
Core Technical Scope
· Communication-compute overlap and topology-aware collective optimization
· Deep debugging of NCCL, RDMA, and custom communication layers
· Hybrid expert parallel strategies in modern large-scale MoE systems
· Elastic and resilient distributed job orchestration concepts
· Congestion analysis and routing optimization across InfiniBand/RoCE fabrics
· Microbenchmarking and performance modeling for communication-heavy workloads
Expected Technical Depth
· Hybrid expert parallel communication for Mixture-of-Experts training
· Scaling behavior under network pressure
· Distributed orchestration for elastic, large-scale training
· Fault detection and recovery in distributed GPU workloads
· Cross-layer bottlenecks: GPU ↔ NIC ↔ PCIe ↔ NVSwitch ↔ Fabric ↔ Scheduler
Required Background
· Experience optimizing distributed training at 1,000+ GPU scale (or equivalent depth)
· Hands-on expertise with RDMA, InfiniBand, RoCE, and GPUDirect RDMA
· Deep familiarity with NCCL and/or UCX internals
· Strong systems programming ability (C/C++, Rust, or Go)
· Strong familiarity with modern model training frameworks such as PyTorch
· Ability to troubleshoot and profile training performance issues related to communication bottlenecks
· Ability to translate research ideas into production-grade optimizations
· Experience debugging distributed hangs, desynchronization, and performance regressions
What We Mean by "Hardcore"
· You can explain why an communication degrades at scale and how to fix it
· You have improved real cluster throughput via communication redesign
· You can trace a distributed hang across ranks and identify the root cause
· You are comfortable working at the boundary between hardware and runtime
Application Requirements
· Include a link to your GitHub (required)
· Provide links to relevant distributed systems, HPC, or large-scale training projects
· Include a list of publications and/or public technical reports (if applicable)
· Describe the hardest distributed debugging problem you solved
· Include measurable performance improvements you have delivered
Academic Qualifications
Master’s, or Bachelor’s + 1 year of relevant experience.
Visa Sponsorship
This position is eligible for visa sponsorship.
Benefits Include
*Comprehensive medical, dental, and vision benefits
*Bonus
*401K Plan
*Generous paid time off, sick leave and holidays
*Paid Parental Leave
*Employee Assistance Program
*Life insurance and disability
Full job record
| Job ID | 99b1a53555716798a783e306190b4b7ed74c4848 |
| Org ID | bb7fb7ce-62b9-4ed3-9327-02a3c7b7e5d0 |
| Source ID | 4d111a77-38db-4b88-84a8-24f761a495a9 |
| Board ID | 4d111a77-38db-4b88-84a8-24f761a495a9 |
| Provider | lever |
| Provider Job Key | 0d6c2a0b-1a84-4ddc-ae0b-a13eb5730812 |
| Title | Senior Distributed Systems Engineer |
| Normalized Title | — |
| Status | active |
| Active | yes |
| Location Text | Sunnyvale, CA |
| Department | — |
| Team | Engineering |
| Employment Type | — |
| Workplace Type | on_site |
| Remote Policy | — |
| Country | United States |
| Region | CA |
| City | Sunnyvale |
| Salary Raw | USD 200000-400000 per-year-salary |
| Salary Min | 200,000 |
| Salary Max | 400,000 |
| Salary Currency | USD |
| Salary Period | year |
| Source URL | https://jobs.lever.co/ifm-us/0d6c2a0b-1a84-4ddc-ae0b-a13eb5730812 |
| Apply URL | https://jobs.lever.co/ifm-us/0d6c2a0b-1a84-4ddc-ae0b-a13eb5730812/apply |
| First Seen At | 2026-05-29 06:59:53Z |
| Last Seen At | 2026-06-23 07:55:59Z |
| Last Checked At | 2026-06-23 07:55:59Z |
| Last Changed At | 2026-06-02 10:41:24Z |
| Inactive At | — |
| Source Posted At | 2026-03-03 01:10:02Z |
| Source Updated At | — |
| Raw Payload Uri | s3://job-postings-prod-raw-590183727216/raw/provider=lever/board=ifm-us/date=2026-06-23/2026-06-23T07-55-58-773Z-0ab7124e577c71437276926999ed4d9ae62c5beae65d219654cae179b652c2af.json |
Event Fields
{
"content_hash": "254fa5160b2a63eec7c5438cbc459d6448b6ad65d43cc9ed90ae15c1ccd11dd9",
"source_hash": "6a884e6406a9d316395900d7cdb2a3e81ab4719b41cbb0ae7514b715e091dd2c",
"last_changed_at": "2026-06-02T10:41:24.749Z",
"active_status": "active"
}Parsed Structured
{
"dedupe": null,
"language": "en",
"location": {
"raw": "Sunnyvale, CA",
"city": "Sunnyvale",
"region": "CA",
"country": "United States",
"is_remote": false,
"confidence": 0.9
},
"salary_max": 400000,
"salary_min": 200000,
"inferred_at": "2026-06-23T07:55:59.802Z",
"launch_scope": {
"reason": "english_us_canada",
"included": true,
"language": "en",
"location": {
"raw": "Sunnyvale, CA",
"city": "Sunnyvale",
"region": "CA",
"country": "United States",
"is_remote": false,
"confidence": 0.9
},
"countries": [
"United States"
]
},
"remote_policy": null,
"salary_period": "year",
"workplace_type": "on_site",
"salary_currency": "USD"
}Extensions
{}Native Structured
{
"lists": [],
"country": "US",
"createdAt": 1772500202035,
"updatedAt": null,
"categories": {
"team": "Engineering",
"location": "Sunnyvale, CA",
"allLocations": [
"Sunnyvale, CA"
]
},
"salaryRange": {
"max": 400000,
"min": 200000,
"currency": "USD",
"interval": "per-year-salary"
},
"workplaceType": "onsite"
}Get this page with API
Rendered from the bluedoor Job Postings API. Reproduce it:
GET https://api.bluedoor.sh/job-postings/v1/jobs/99b1a53555716798a783e306190b4b7ed74c4848?include=descriptionJSONGET https://api.bluedoor.sh/job-postings/v1/orgs/bb7fb7ce-62b9-4ed3-9327-02a3c7b7e5d0JSONGET https://api.bluedoor.sh/job-postings/v1/sources/4d111a77-38db-4b88-84a8-24f761a495a9JSONGET https://api.bluedoor.sh/job-postings/v1/jobs/99b1a53555716798a783e306190b4b7ed74c4848/eventsJSON