Home › Companies › Biohub › Staff AI Infrastructure Engineer

Staff AI Infrastructure Engineer

Biohub · Redwood City, CA (Hybrid) · Hybrid · Active · $241,000–$331,000 / year · Greenhouse

Job facts

Field	Value
Company	Biohub
Title	Staff AI Infrastructure Engineer
Normalized title	-
Department / team	AI Compute Platform
Location	Redwood City, CA, United States
Work model	Hybrid / Hybrid
Employment type	-
Salary	$241,000–$331,000 / year
Status	active
ATS provider	Greenhouse
Posted / first seen	2026-04-02 / 2026-05-29
Changed / last seen	2026-06-03 / 2026-06-06

Related slices

Page	What it contains	Open
Company jobs	Active postings from Biohub.	Open
Company breakdowns	Role, location, ATS, and work model facets for this company.	Open
ATS provider jobs	Active postings observed through Greenhouse.	Open
Provider filtered search	The same provider as a filtered job collection.	Open
City jobs	Active postings in Redwood City.	Open
Department jobs	Active postings in AI Compute Platform.	Open
Work model jobs	Active Hybrid postings.	Open
Lifecycle events	Open, update, close, and reopen events for this posting.	Open
Original posting	Canonical source or apply URL captured from the ATS.	Open

Linked records

Company	Biohub
Source	4db9d1c7-a618-4c41-a07b-92fa342ad8fa
ATS provider	Greenhouse

Description

Biohub is the first large-scale initiative bringing frontier AI models, massive compute, and frontier experimental capabilities under one roof. We're building a general-purpose system to accelerate scientific discovery, integrating frontier AI models, biological foundation models, and lab capabilities, with the ultimate goal of curing disease. Our technology powers scientists around the world, translating AI capabilities into tools that accelerate research everywhere. The Team The AI Cluster Production Engineering team is part of the AI Compute Platform organization at Biohub, a non-profit research lab committed to open science and open-source AI. We own the design, operation, and reliability of large-scale multi-GPU AI clusters that power frontier AI biology research: protein language models, genomic foundation models, and scientific reasoning systems built to be shared, not monetized. Our clusters run Slurm on Kubernetes infrastructure and support everything from day-to-day AI researcher workflows to multi-node hero training runs at thousands of GPUs. The team works at the intersection of AI tooling, distributed systems, HPC, and frontier AI, debugging deep AI infrastructure problems and building AI systems critical to the entire AI organization. The Opportunity CZ Biohub's mission is to cure or prevent all human disease. Achieving that requires training frontier-scale AI biology models, and that demands reliable, high-performance compute infrastructure. This is production engineering work at a frontier AI lab, with the twist that the mission is biology and the science is open. You'll keep GPU clusters running at high utilization, debug the toughest distributed systems failures, and build the operational foundations for scaling to multi-thousand GPU hero runs. The technical problems are genuinely hard (e.g., multi-node distributed training, InfiniBand fabrics, large-scale storage, Slurm at scale) inside an organization where the work is aimed at helping people, not optimizing ad revenue. What You'll Do Own reliability, observability, and incident response for multi-site GPU clusters running Slurm on Kubernetes. Build the systems, automation, and processes that keep clusters healthy, and that enable fast, efficient recovery when things break. Debug and resolve deep infrastructure failures across storage, networking, scheduling, and GPU compute layers. Build the tooling and operational patterns that make these failures easier to detect, diagnose, and prevent. Design and execute GPU cluster scaling plans, systematically validating storage, networking, interconnect, and scheduler behavior as clusters grow to support larger training runs. Build automation and tooling to manage cluster operations at scale: capacity planning, GPU utilization monitoring workload manager policy management, and pod lifecycle automation. Drive configuration-as-code practices, ensuring cluster state is reproducible and auditable, and managed through version-controlled pipelines. Collaborate directly with AI researchers and hero run leads to understand training workload patterns and design infrastructure that meets frontier-scale requirements. Own the vendor relationship on technical issues — escalating SEV1s, coordinating across multiple partners and network backbone teams, holding them accountable to root/proximate cause analysis and SLAs. Contribute to capacity planning: projecting GPU demand, managing cluster expansion across GPU generations, and coordinating multi-cluster strategy. Improve operational resilience, reducing mean time to detect and resolve incidents, reducing toil through automation, and developing runbooks that scale the team's operational knowledge beyond any individual. What You'll Bring 8+ years of AI/ML infrastructure engineering experience, with deep expertise in at least one of: HPC/Slurm cluster operations, Kubernetes at scale, distributed systems debugging, or GPU compute infrastructure. Strong Linux systems fundamentals — networking (TCP/IP, InfiniBand, RDMA, MTU/MSS/PMTUD), storage (NFS, VAST, WEKA, POSIX semantics), kernel internals (cgroups, namespaces, eBPF, sysctls). Hands-on experience with Kubernetes and cloud-native infrastructure — pod lifecycle, CNI plugins (Cilium preferred), StatefulSets, Helm, ArgoCD, or equivalent GitOps tooling. Experience with HPC workload managers — Slurm strongly preferred (QoS, partitions, preemption, accounting, Sunk/CoreWeave patterns a plus). Debugging instinct: ability to form hypotheses quickly, design controlled experiments, and root cause complex multi-system failures under pressure. You enjoy finding the hard bugs. Proficiency in Python and Bash for automation and tooling. Go, Rust, or C/C++ a plus. Experience with observability stacks — Prometheus/VictoriaMetrics, Grafana, DCGM metrics, distributed tracing. You know how to instrument systems you don't control. Excellent communication — you can write a crisp incident summary for researchers, a technical escalation to a vendor CTO, and a system design doc for teammates, all in the same day. Bonus: experience with distributed AI training infrastructure (NCCL, PyTorch DDP, multi-node job debugging, checkpoint/restart patterns, container environments for large-scale training). Compensation The Redwood City, CA base pay range for a new hire in this role is $241,000 - $331,000 . New hires are typically hired into the lower portion of the range, enabling employee growth in the range over time. Actual placement in range is based on job-related skills and experience, as evaluated throughout the interview process. Better Together As we grow, we’re excited to strengthen in-person connections and cultivate a collaborative, team-oriented environment. This role is a hybrid position requiring you to be onsite for at least 60% of the working month, approximately 3 days a week, with specific in-office days determined by the team’s manager. The exact schedule will be at the hiring manager's discretion and communicated during the interview process. Benefits for the Whole You We’re thankful to have an incredible team behind our work. To honor their commitment, we offer a wide range of benefits to support the people who make all we do possible. Provides a generous employer match on employee 401(k) contributions to support planning for the future. Paid time off to volunteer at an organization of your choice. Funding for select family-forming benefits. Relocation support for employees who need assistance moving If you’re interested in a role but your previous experience doesn’t perfectly align with each qualification in the job description, we still encourage you to apply as you may be the perfect fit for this or another role. #LI-Hybrid

Full job record

Job ID	ac1098ea15443bacfbf7780de456886fbeccba8c
Org ID	ffc1b481-3321-4001-ad91-ecf4f19245a9
Source ID	4db9d1c7-a618-4c41-a07b-92fa342ad8fa
Board ID	4db9d1c7-a618-4c41-a07b-92fa342ad8fa
Provider	greenhouse
Provider Job Key	7775820
Title	Staff AI Infrastructure Engineer
Normalized Title	—
Status	active
Active	yes
Location Text	Redwood City, CA (Hybrid)
Department	AI Compute Platform
Team	—
Employment Type	—
Workplace Type	hybrid
Remote Policy	hybrid
Country	United States
Region	CA
City	Redwood City
Salary Raw	base pay range for a new hire in this role is $241,000 - $331,000
Salary Min	241,000
Salary Max	331,000
Salary Currency	USD
Salary Period	year
Source URL	https://job-boards.greenhouse.io/biohub/jobs/7775820
Apply URL	https://job-boards.greenhouse.io/biohub/jobs/7775820
First Seen At	2026-05-29 22:41:12Z
Last Seen At	2026-06-06 20:17:46Z
Last Checked At	2026-06-06 20:17:46Z
Last Changed At	2026-06-03 10:45:19Z
Inactive At	—
Source Posted At	2026-04-02 19:24:46Z
Source Updated At	2026-06-02 20:06:41Z
Raw Payload Uri	s3://job-postings-prod-raw-590183727216/raw/provider=greenhouse/board=biohub/date=2026-06-06/2026-06-06T20-17-46-674Z-7109dc94f2a18f37b37b244e3c825c6372bd41f556ef0d07dd9e326213acdfd1.json

Event Fields

{
  "content_hash": "e0cfc20318b917377aedb9f573adf203afdbabbd23e800ab5b3ee338ef4668e5",
  "source_hash": "c2cc6c4fd70d49ad22732a8b5a375fbcad6e151bdb682f2b7239f19e13929f85",
  "last_changed_at": "2026-06-03T10:45:19.790Z",
  "active_status": "active"
}

Parsed Structured

{
  "language": "en",
  "location": {
    "raw": "Redwood City, CA (Hybrid)",
    "city": "Redwood City",
    "region": "CA",
    "country": "United States",
    "is_remote": false,
    "confidence": 0.9
  },
  "salary_max": 331000,
  "salary_min": 241000,
  "inferred_at": "2026-06-06T20:17:46.791Z",
  "launch_scope": {
    "reason": "english_us_canada",
    "included": true,
    "language": "en",
    "location": {
      "raw": "Redwood City, CA (Hybrid)",
      "city": "Redwood City",
      "region": "CA",
      "country": "United States",
      "is_remote": false,
      "confidence": 0.9
    },
    "countries": [
      "United States"
    ]
  },
  "remote_policy": "hybrid",
  "salary_period": "year",
  "workplace_type": "hybrid",
  "salary_currency": "USD"
}

Extensions

{}

Native Structured

{
  "title": "Staff AI Infrastructure Engineer",
  "offices": [
    {
      "id": 58664,
      "name": "Chan Zuckerberg Initiative",
      "location": "Redwood City, California, United States",
      "child_ids": [],
      "parent_id": null
    }
  ],
  "language": "en",
  "location": {
    "name": "Redwood City, CA (Hybrid)"
  },
  "metadata": [
    {
      "id": 167699,
      "name": "Careers Page - Dept.",
      "value": "Artificial Intelligence",
      "value_type": "single_select"
    },
    {
      "id": 176564,
      "name": "Careers Page - Team",
      "value": [
        "CZ Biohub Network"
      ],
      "value_type": "multi_select"
    },
    {
      "id": 177535,
      "name": "Brief Description",
      "value": "You’ll own reliability, observability, and incident response for our AI research Clusters — multi-site GPU clusters running Slurm/Sunk on Kubernetes. You will build resilient AI Clusters and be the last line of defense when production hero runs are at risk.",
      "value_type": "long_text"
    }
  ],
  "updated_at": "2026-06-02T16:06:41-04:00",
  "departments": [
    {
      "id": 347394,
      "name": "AI Compute Platform",
      "child_ids": [],
      "parent_id": 39963
    }
  ],
  "company_name": "Biohub",
  "requisition_id": 3401797,
  "first_published": "2026-04-02T15:24:46-04:00",
  "application_deadline": null
}

Get this page with API

Rendered from the bluedoor Job Postings API. Reproduce it:

GET https://api.bluedoor.sh/job-postings/v1/jobs/ac1098ea15443bacfbf7780de456886fbeccba8c?include=descriptionJSON

GET https://api.bluedoor.sh/job-postings/v1/orgs/ffc1b481-3321-4001-ad91-ecf4f19245a9JSON

GET https://api.bluedoor.sh/job-postings/v1/sources/4db9d1c7-a618-4c41-a07b-92fa342ad8faJSON

GET https://api.bluedoor.sh/job-postings/v1/jobs/ac1098ea15443bacfbf7780de456886fbeccba8c/eventsJSON

Docs · Get an API key