Home › Companies › Physicalintelligence › ML Infra Engineer (Supercomputing)

ML Infra Engineer (Supercomputing)

Physicalintelligence · San Francisco · On Site · Active · Ashby

Job facts

Field	Value
Company	Physicalintelligence
Title	ML Infra Engineer (Supercomputing)
Normalized title	-
Department / team	Machine Learning / Machine Learning
Location	San Francisco, CA, United States
Work model	On Site
Employment type	Full Time
Salary	-
Status	active
ATS provider	Ashby
Posted / first seen	— / 2026-05-29
Changed / last seen	2026-05-29 / 2026-06-06

Related slices

Page	What it contains	Open
Company jobs	Active postings from Physicalintelligence.	Open
Company breakdowns	Role, location, ATS, and work model facets for this company.	Open
ATS provider jobs	Active postings observed through Ashby.	Open
Provider filtered search	The same provider as a filtered job collection.	Open
City jobs	Active postings in San Francisco.	Open
Department jobs	Active postings in Machine Learning.	Open
Work model jobs	Active On Site postings.	Open
Lifecycle events	Open, update, close, and reopen events for this posting.	Open
Original posting	Canonical source or apply URL captured from the ATS.	Open

Linked records

Company	Physicalintelligence
Source	2c3ebdb4-5d1a-4bdc-9b66-cfbb7d577518
ATS provider	Ashby

Description

Physical Intelligence builds general-purpose AI for the physical world. Training our models requires orchestrating thousands of accelerators across a heterogeneous fleet of GPU and TPU clusters — spanning different hardware generations, cloud providers, and cluster topologies. Today, researchers often need to know which cluster to target, what resources are available, and how to configure their jobs accordingly. That doesn't scale. We need a scheduling and compute layer that makes the right placement decision automatically — routing jobs to the best cluster based on availability, hardware fit, cost, and priority — so researchers can focus entirely on the science. This role owns that problem end-to-end: the scheduling systems, the placement logic, the cluster management layer, and the operational tooling that keeps it all running. This is not cloud DevOps. It's not about standing up clusters and walking away. It's a systems role for people who care about intelligent resource allocation, utilization, fault tolerance, and making large-scale distributed training seamless. The Team The ML Infrastructure team supports and accelerates PI’s core modeling efforts by building the systems that make large-scale training reliable, reproducible, and fast. You will work closely with ML Infra (training systems), data platform, and research teams to ensure compute scheduling is never the bottleneck. In This Role You Will - Own Intelligent Job Scheduling and Placement : Design and build multi-tenant scheduling systems that automatically place training jobs on the best available cluster based on hardware requirements, topology, availability, cost, and priority. Support fair resource sharing across teams and projects with quota management, priority tiers, and preemption policies. Abstract away cluster differences so researchers submit jobs without needing to know where they will land. - Scale Multi-cluster Orchestration : Build the control plane that manages the job lifecycle across diverse clusters (mixed GPU/TPU, multi-generation hardware, on-prem/cloud) and enables seamless job migration, failover, and re-scheduling. - Optimize Accelerator Utilization and Efficiency : Monitor and optimize GPU/TPU utilization across the entire fleet. Implement priority, preemption, queueing, and fairness policies that balance research velocity with cost efficiency. - Ensure Scaling and Stability : Implement fault detection, automatic recovery, and resilience for long-running multi-node training jobs. Manage health checking, node management, and scaling to thousands of accelerators. - Support Inference and Robot Deployment : Extend scheduling and orchestration to inference workloads, including deploying models to edge devices on physical robots. - Enhance Observability and Developer Experience : Build the dashboards, alerting, SLOs, and debugging tools necessary for researchers to understand job status and for the team to ensure high scheduling quality and cluster reliability. What We Hope You’ll Bring We’re intentionally flexible on exact background, but strong candidates usually have: - Strong software engineering fundamentals - Experience building or operating job scheduling / resource management systems at scale - Experience with large-scale compute clusters (GPU and/or TPU) - Familiarity with schedulers and orchestration systems (SLURM, Kubernetes, GKE, K3S, or internal equivalents) - Comfort reasoning about resource allocation, bin-packing, priority scheduling, and multi-tenancy - Understanding of how ML training workloads behave — long-running, multi-node, sensitive to stragglers, topology-dependent - A bias toward owning systems end-to-end, from design to operation - Enjoy working closely with researchers and unblocking fast-moving projects Bonus Points If You Have - Experience building multi-cluster or federated scheduling systems - Experience with TPU infrastructure (GCP TPU slices, Multislice, GKE) - Background in cluster resource managers (Borg, YARN, Mesos, or custom schedulers) - Linux systems engineering, networking, and infrastructure-as-code - NCCL/collective communication and topology-aware placement - Experience with capacity planning and cloud cost optimization at scale - Familiarity with JAX, PyTorch, or similar ML frameworks at the runtime/systems level In this role you will help scale and optimize our training systems and core model code. You’ll own critical infrastructure for large-scale training, from managing GPU/TPU compute and job orchestration to building reusable and efficient JAX training pipelines. You’ll work closely with researchers and model engineers to translate ideas into experiments—and those experiments into production training runs. This is a hands-on, high-leverage role at the intersection of ML, software engineering, and scalable infrastructure. The Team The ML Infrastructure team supports and accelerates PI’s core modeling efforts by building the systems that make large-scale training reliable, reproducible, and fast. The team works closely with research, data, and platform engineers to ensure models can scale from prototype to production-grade training runs. In This Role You Will - Own training/inference infrastructure: Design, implement, and maintain systems for large-scale model training, including scheduling, job management, checkpointing, and metrics/logging. - Scale distributed training: Work with researchers to scale JAX-based training across TPU and GPU clusters with minimal friction. - Optimize performance: Profile and improve memory usage, device utilization, throughput, and distributed synchronization. - Enable rapid iteration: Build abstractions for launching, monitoring, debugging, and reproducing experiments. - Manage compute resources: Ensure efficient allocation and utilization of cloud-based GPU/TPU compute while controlling cost. - Partner with researchers: Translate research needs into infra capabilities and guide best practices for training at scale. - Contribute to core training code: Evolve JAX model and training code to support new architectures, modalities, and evaluation metrics. What We Hope You’ll Bring - Strong software engineering fundamentals and experience building ML training infrastructure or internal platforms. - Hands-on large-scale training experience in JAX (preferred), PyTorch. - Familiarity with distributed training, multi-host setups, data loaders, and evaluation pipelines. - Experience managing training workloads on cloud platforms (e.g., SLURM, Kubernetes, GCP TPU/GKE, AWS). - Ability to debug and optimize performance bottlenecks across the training stack. - Strong cross-functional communication and ownership mindset. Bonus Points If You Have - Deep ML systems background (e.g., training compilers, runtime optimization, custom kernels). - Experience operating close to hardware (GPU/TPU performance tuning). - Background in robotics, multimodal models, or large-scale foundation models. - Experience designing abstractions that balance researcher flexibility with system reliability. Pursuant to the San Francisco Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records.

Full job record

Job ID	8c12a523d15545cb960c4ccc8641aabb49da76f5
Org ID	cd906f47-5869-4ca1-a998-6ececc4415d9
Source ID	2c3ebdb4-5d1a-4bdc-9b66-cfbb7d577518
Board ID	2c3ebdb4-5d1a-4bdc-9b66-cfbb7d577518
Provider	ashby
Provider Job Key	0307dd1e-ac14-47fd-b80c-e3ca6e82ae46
Title	ML Infra Engineer (Supercomputing)
Normalized Title	—
Status	active
Active	yes
Location Text	San Francisco
Department	Machine Learning
Team	Machine Learning
Employment Type	full_time
Workplace Type	on_site
Remote Policy	—
Country	United States
Region	CA
City	San Francisco
Salary Raw	—
Salary Min	—
Salary Max	—
Salary Currency	—
Salary Period	—
Source URL	https://jobs.ashbyhq.com/physicalintelligence/0307dd1e-ac14-47fd-b80c-e3ca6e82ae46
Apply URL	https://jobs.ashbyhq.com/physicalintelligence/0307dd1e-ac14-47fd-b80c-e3ca6e82ae46/application
First Seen At	2026-05-29 05:24:33Z
Last Seen At	2026-06-06 19:46:56Z
Last Checked At	2026-06-06 19:46:56Z
Last Changed At	2026-05-29 05:24:33Z
Inactive At	—
Source Posted At	—
Source Updated At	—
Raw Payload Uri	s3://job-postings-prod-raw-590183727216/raw/provider=ashby/board=physicalintelligence/date=2026-06-06/2026-06-06T19-46-54-171Z-1d9f1a9fc8809c4fa1f05644a5470438736cab824fa3c2de3c66be2affe2875b.json

Event Fields

{
  "content_hash": "9c50e317adf9f098a3fd85ede07dfd61c2a503e75550f136e445a3e8c87e137e",
  "source_hash": "29a3c322030f6b438073de7d62228872fffd2706b4f658fa0a83e6f865dbf913",
  "last_changed_at": "2026-05-29T05:24:33.691Z",
  "active_status": "active"
}

Parsed Structured

{
  "language": "en",
  "location": {
    "raw": "San Francisco",
    "city": "San Francisco",
    "region": "CA",
    "country": "United States",
    "is_remote": false,
    "confidence": 0.75
  },
  "salary_max": null,
  "salary_min": null,
  "inferred_at": "2026-06-06T19:46:56.762Z",
  "launch_scope": {
    "reason": "english_us_canada",
    "included": true,
    "language": "en",
    "location": {
      "raw": "San Francisco",
      "city": "San Francisco",
      "region": "CA",
      "country": "United States",
      "is_remote": false,
      "confidence": 0.75
    },
    "countries": [
      "United States"
    ]
  },
  "remote_policy": null,
  "salary_period": null,
  "workplace_type": "on_site",
  "salary_currency": null
}

Extensions

{}

Native Structured

{
  "id": "0307dd1e-ac14-47fd-b80c-e3ca6e82ae46",
  "team": "Machine Learning",
  "title": "ML Infra Engineer (Supercomputing)",
  "jobUrl": "https://jobs.ashbyhq.com/physicalintelligence/0307dd1e-ac14-47fd-b80c-e3ca6e82ae46",
  "address": null,
  "applyUrl": "https://jobs.ashbyhq.com/physicalintelligence/0307dd1e-ac14-47fd-b80c-e3ca6e82ae46/application",
  "isListed": true,
  "isRemote": false,
  "location": "San Francisco",
  "updatedAt": null,
  "apiVersion": "ashby-non-user-graphql-v1",
  "department": "Machine Learning",
  "publishedAt": null,
  "workplaceType": "OnSite",
  "employmentType": "FullTime",
  "secondaryLocations": []
}

Get this page with API

Rendered from the bluedoor Job Postings API. Reproduce it:

GET https://api.bluedoor.sh/job-postings/v1/jobs/8c12a523d15545cb960c4ccc8641aabb49da76f5?include=descriptionJSON

GET https://api.bluedoor.sh/job-postings/v1/orgs/cd906f47-5869-4ca1-a998-6ececc4415d9JSON

GET https://api.bluedoor.sh/job-postings/v1/sources/2c3ebdb4-5d1a-4bdc-9b66-cfbb7d577518JSON

GET https://api.bluedoor.sh/job-postings/v1/jobs/8c12a523d15545cb960c4ccc8641aabb49da76f5/eventsJSON

Docs · Get an API key