Home › Companies › Andromeda › Senior Site Reliability Engineer - AI Infrastructure

Senior Site Reliability Engineer - AI Infrastructure

Andromeda · Global Remote / San Francisco, CA · Remote · Active · Ashby

Job facts

Field	Value
Company	Andromeda
Title	Senior Site Reliability Engineer - AI Infrastructure
Normalized title	-
Department / team	Engineering / Engineering
Location	San Francisco, CA, United States
Work model	Remote / Remote
Employment type	Full Time
Salary	-
Status	active
ATS provider	Ashby
Posted / first seen	— / 2026-05-29
Changed / last seen	2026-05-29 / 2026-06-06

Related slices

Page	What it contains	Open
Company jobs	Active postings from Andromeda.	Open
Company breakdowns	Role, location, ATS, and work model facets for this company.	Open
ATS provider jobs	Active postings observed through Ashby.	Open
Provider filtered search	The same provider as a filtered job collection.	Open
City jobs	Active postings in San Francisco.	Open
Department jobs	Active postings in Engineering.	Open
Work model jobs	Active Remote postings.	Open
Lifecycle events	Open, update, close, and reopen events for this posting.	Open
Original posting	Canonical source or apply URL captured from the ATS.	Open

Linked records

Company	Andromeda
Source	82090970-642e-47c5-99bd-35d641779fd1
ATS provider	Ashby

Description

Senior Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco · Full-Time About Andromeda Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers. We began with a single managed cluster — but it filled almost instantly. Since then, we’ve been quietly building the systems, network, and orchestration layer that makes the world’s AI infrastructure more accessible. Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. Our platform routes training and inference jobs across global supply, unlocking flexibility and efficiency in one of the fastest-growing markets on earth. Our long-term vision is to build the liquidity layer for global AI compute — a marketplace that moves the infrastructure and workloads powering AGI not dissimilar to the flows of capital in the world’s financial markets. We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering. The Role This is not a generalist SRE role. You will design, operate, and debug large-scale GPU infrastructure used for distributed training and inference, working directly with customers pushing the limits of modern AI systems. We’re looking for engineers who have personally run GPU clusters in production, understand the failure modes of distributed training, and can reason about performance from network fabric → kernel → framework. What You’ll Own GPU Cluster Architecture: Design and evolve multi-provider, multi-region GPU compute clusters optimized for large-scale training. Make topology-aware scheduling, networking, and storage decisions that directly impact training throughput and cost efficiency. Customer Technical Partnership: Serve as the primary technical point of contact for customers running large-scale training workloads. Onboard, troubleshoot, and optimize, often in real time. Reliability & Performance Engineering: Define SLOs and error budgets that account for the unique failure modes of GPU infrastructure (ECC errors, NVLink degradation, NCCL timeouts). Own capacity planning across heterogeneous GPU fleets optimized for training throughput. Networking & Fabric Health: Ensure the health and performance of high-speed interconnects (InfiniBand, RoCE, NVLink) that underpin distributed training. Diagnose and resolve fabric-level issues that degrade collective operations. Observability: Build deep visibility into GPU utilization, memory pressure, interconnect throughput, training job performance, and hardware health. Go well beyond standard infrastructure metrics. Automation & Tooling: Build production-grade automation for cluster provisioning, GPU health checks, job scheduling, self-healing, and firmware/driver lifecycle management. Incident Leadership: Lead incident response for complex, multi-layer failures spanning hardware, networking, orchestration, and ML frameworks. Drive blameless postmortems and systemic fixes. What We’re Looking For GPU Systems Expertise: Deep, hands-on experience operating large-scale GPU clusters (NVIDIA A100/H100/B200 or equivalent). You understand GPU memory hierarchies, ECC behavior, thermal throttling, and hardware failure modes from direct experience not documentation. High-Performance Networking: Production experience with InfiniBand, RoCE, or NVLink fabrics in the context of distributed training. You can diagnose why an all-reduce is slow, identify a degraded link in a fat-tree topology, and reason about congestion control at scale. Distributed Training & ML Frameworks: Working knowledge of how large training jobs actually run — NCCL, CUDA, PyTorch distributed, DeepSpeed, Megatron, FSDP, or similar. You don't need to write the models, but you need to understand what's happening at the systems level when a 1,000-GPU training run stalls. Linux & Systems Internals: Expert-level Linux knowledge: kernel tuning, driver management (NVIDIA drivers, CUDA toolkit), cgroup/namespace internals, performance profiling at the syscall and hardware level. Kubernetes & Orchestration: Strong experience running Kubernetes in production with GPU workloads, including device plugins, topology-aware scheduling, multi-cluster federation, and custom operators. Experience with Slurm or other HPC schedulers is equally valued. Automation & Software Engineering: Strong engineering skills in Python, Go, or Bash. You build production-grade tools and services, not just scripts. Infrastructure-as-Code proficiency (Terraform, Helm, Ansible, or equivalent). Observability & Monitoring: Hands-on experience building monitoring and alerting for GPU infrastructure, not just Prometheus/Grafana basics, but GPU-specific telemetry (DCGM, nvidia-smi, fabric manager metrics) integrated into actionable dashboards. Incident Management: Proven track record leading incident response for complex distributed systems where the failure could be in hardware, firmware, networking, drivers, orchestration, or application code and you need to narrow it down fast. Strong Candidates May Have Distributed Storage: Experience with high-performance parallel file systems (VAST, Weka, Lustre, GPFS) and the checkpoint I/O and data-loading bottlenecks that come with large training runs. Training Optimization: Experience profiling and optimizing distributed training performance: identifying stragglers, tuning collective communication strategies, improving MFU (Model FLOPs Utilization), and reducing idle GPU time across large runs. Cluster Buildout & Hardware: Experience involved in physical cluster design - rack layout, power/cooling constraints, network topology design, and hardware validation/burn-in at scale. Team Leadership: Experience leading or mentoring a team of infrastructure engineers. We're growing and need people who raise the bar for everyone around them. Why You’ll Love It Here This is a high-impact, senior builder’s role. You’ll have significant ownership and autonomy to shape how our systems run at a foundational level, working directly with customers and providers while architecting the infrastructure backbone for reliable, scalable AI compute. You’ll influence technical direction and help define what world-class AI infrastructure operations look like. Andromeda Cluster is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Full job record

Job ID	84f3f1d2e93a3d45731c37ef018861903e568cd7
Org ID	ca14e017-3161-4f6f-b107-4115fd34e62b
Source ID	82090970-642e-47c5-99bd-35d641779fd1
Board ID	82090970-642e-47c5-99bd-35d641779fd1
Provider	ashby
Provider Job Key	c0129afa-ddfb-47e1-9968-b3dbf27b5d55
Title	Senior Site Reliability Engineer - AI Infrastructure
Normalized Title	—
Status	active
Active	yes
Location Text	Global Remote / San Francisco, CA
Department	Engineering
Team	Engineering
Employment Type	full_time
Workplace Type	remote
Remote Policy	remote
Country	United States
Region	CA
City	San Francisco
Salary Raw	—
Salary Min	—
Salary Max	—
Salary Currency	—
Salary Period	—
Source URL	https://jobs.ashbyhq.com/andromeda/c0129afa-ddfb-47e1-9968-b3dbf27b5d55
Apply URL	https://jobs.ashbyhq.com/andromeda/c0129afa-ddfb-47e1-9968-b3dbf27b5d55/application
First Seen At	2026-05-29 06:04:24Z
Last Seen At	2026-06-06 09:26:06Z
Last Checked At	2026-06-06 09:26:06Z
Last Changed At	2026-05-29 06:04:24Z
Inactive At	—
Source Posted At	—
Source Updated At	—
Raw Payload Uri	s3://job-postings-prod-raw-590183727216/raw/provider=ashby/board=andromeda/date=2026-06-06/2026-06-06T09-26-00-497Z-9b71a2f109ce98375c24379878e82782d30f21411c01d57595702f8f382e797f.json

Event Fields

{
  "content_hash": "d0cff5550b63ac28ca35b6c40d641c67799924adf140b39e0d980368c90430e7",
  "source_hash": "1957b3998032eaf37028a260ebd6f13bf754e08fb9e1afcc98dcc120923cccf0",
  "last_changed_at": "2026-05-29T06:04:24.135Z",
  "active_status": "active"
}

Parsed Structured

{
  "language": "en",
  "location": {
    "raw": "Global Remote / San Francisco, CA",
    "city": "San Francisco",
    "region": "CA",
    "country": "United States",
    "is_remote": true,
    "confidence": 0.9
  },
  "salary_max": null,
  "salary_min": null,
  "inferred_at": "2026-06-06T09:26:06.428Z",
  "launch_scope": {
    "reason": "english_us_canada",
    "included": true,
    "language": "en",
    "location": {
      "raw": "Global Remote / San Francisco, CA",
      "city": "San Francisco",
      "region": "CA",
      "country": "United States",
      "is_remote": true,
      "confidence": 0.9
    },
    "countries": [
      "United States"
    ]
  },
  "remote_policy": "remote",
  "salary_period": null,
  "workplace_type": "remote",
  "salary_currency": null
}

Extensions

{}

Native Structured

{
  "id": "c0129afa-ddfb-47e1-9968-b3dbf27b5d55",
  "team": "Engineering",
  "title": "Senior Site Reliability Engineer - AI Infrastructure ",
  "jobUrl": "https://jobs.ashbyhq.com/andromeda/c0129afa-ddfb-47e1-9968-b3dbf27b5d55",
  "address": null,
  "applyUrl": "https://jobs.ashbyhq.com/andromeda/c0129afa-ddfb-47e1-9968-b3dbf27b5d55/application",
  "isListed": true,
  "isRemote": true,
  "location": "Global Remote / San Francisco, CA",
  "updatedAt": null,
  "apiVersion": "ashby-non-user-graphql-v1",
  "department": "Engineering",
  "publishedAt": null,
  "workplaceType": "Remote",
  "employmentType": "FullTime",
  "secondaryLocations": []
}

Get this page with API

Rendered from the bluedoor Job Postings API. Reproduce it:

GET https://api.bluedoor.sh/job-postings/v1/jobs/84f3f1d2e93a3d45731c37ef018861903e568cd7?include=descriptionJSON

GET https://api.bluedoor.sh/job-postings/v1/orgs/ca14e017-3161-4f6f-b107-4115fd34e62bJSON

GET https://api.bluedoor.sh/job-postings/v1/sources/82090970-642e-47c5-99bd-35d641779fd1JSON

GET https://api.bluedoor.sh/job-postings/v1/jobs/84f3f1d2e93a3d45731c37ef018861903e568cd7/eventsJSON

Docs · Get an API key