Home › Companies › Andromeda › Site Reliability Engineer - AI Infrastructure

Site Reliability Engineer - AI Infrastructure

Andromeda · Global Remote / San Francisco, CA · Remote · Active · Ashby

Job facts

Field	Value
Company	Andromeda
Title	Site Reliability Engineer - AI Infrastructure
Normalized title	-
Department / team	Engineering / Engineering
Location	San Francisco, CA, United States
Work model	Remote / Remote
Employment type	Full Time
Salary	-
Status	active
ATS provider	Ashby
Posted / first seen	— / 2026-05-29
Changed / last seen	2026-05-29 / 2026-06-06

Related slices

Page	What it contains	Open
Company jobs	Active postings from Andromeda.	Open
Company breakdowns	Role, location, ATS, and work model facets for this company.	Open
ATS provider jobs	Active postings observed through Ashby.	Open
Provider filtered search	The same provider as a filtered job collection.	Open
City jobs	Active postings in San Francisco.	Open
Department jobs	Active postings in Engineering.	Open
Work model jobs	Active Remote postings.	Open
Lifecycle events	Open, update, close, and reopen events for this posting.	Open
Original posting	Canonical source or apply URL captured from the ATS.	Open

Linked records

Company	Andromeda
Source	82090970-642e-47c5-99bd-35d641779fd1
ATS provider	Ashby

Description

Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco · Full-Time About Andromeda Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers. We began with a single managed cluster — but it filled almost instantly. Since then, we’ve been quietly building the systems, network, and orchestration layer that makes the world’s AI infrastructure more accessible. Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. Our platform routes training and inference jobs across global supply, unlocking flexibility and efficiency in one of the fastest-growing markets on earth. Our long-term vision is to build the liquidity layer for global AI compute — a marketplace that moves the infrastructure and workloads powering AGI not dissimilar to the flows of capital in the world's financial markets. We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering. What You’ll Do Provision, configure, and operate Kubernetes-based clusters for customers across multiple providers. Build automation and tooling to streamline cluster deployments and integrations. Debug customer issues across networking, storage, scheduling, and system layers. Improve reliability and scalability of both training and inference infrastructure. Design and implement monitoring, alerting, and observability for critical systems. Collaborate with engineering and product teams to plan and deliver infrastructure for new services. Participate in on-call and incident response, leading postmortems and reliability improvements. What We’re Looking For 5+ years experience in SRE, DevOps, or infrastructure engineering roles. Strong Linux systems and networking fundamentals. Deep experience with Kuber Kubernetes and container orchestration at scale. Proficiency with Infrastructure-as-Code (Terraform, Helm, Ansible, etc.). Strong automation and scripting skills (Python, Go, or Bash). Experience with observability stacks (Prometheus, Grafana, Loki, Datadog, etc.). Track record of operating production systems and leading incident response. Nice to Have Exposure to ML/AI infrastructure or GPU-based systems (CUDA, Slurm, Triton, etc.). Familiarity with high-performance networking (InfiniBand, NVLink) or distributed storage (VAST, Weka, Ceph). Customer-facing support or consulting experience. Why You’ll Love It Here This is a builder’s role. You’ll have ownership and autonomy to shape how our systems run, working directly with customers and providers while building the foundation for reliable, scalable AI infrastructure.

Full job record

Job ID	3a30c77faa9739d4f9dc8771ea3f412b89f7950d
Org ID	ca14e017-3161-4f6f-b107-4115fd34e62b
Source ID	82090970-642e-47c5-99bd-35d641779fd1
Board ID	82090970-642e-47c5-99bd-35d641779fd1
Provider	ashby
Provider Job Key	af010369-e891-4700-9aa7-d90670919dbb
Title	Site Reliability Engineer - AI Infrastructure
Normalized Title	—
Status	active
Active	yes
Location Text	Global Remote / San Francisco, CA
Department	Engineering
Team	Engineering
Employment Type	full_time
Workplace Type	remote
Remote Policy	remote
Country	United States
Region	CA
City	San Francisco
Salary Raw	—
Salary Min	—
Salary Max	—
Salary Currency	—
Salary Period	—
Source URL	https://jobs.ashbyhq.com/andromeda/af010369-e891-4700-9aa7-d90670919dbb
Apply URL	https://jobs.ashbyhq.com/andromeda/af010369-e891-4700-9aa7-d90670919dbb/application
First Seen At	2026-05-29 06:04:24Z
Last Seen At	2026-06-06 09:26:06Z
Last Checked At	2026-06-06 09:26:06Z
Last Changed At	2026-05-29 06:04:24Z
Inactive At	—
Source Posted At	—
Source Updated At	—
Raw Payload Uri	s3://job-postings-prod-raw-590183727216/raw/provider=ashby/board=andromeda/date=2026-06-06/2026-06-06T09-26-00-497Z-9b71a2f109ce98375c24379878e82782d30f21411c01d57595702f8f382e797f.json

Event Fields

{
  "content_hash": "f511462c52bbb220b1e983b64faacc817a8a45a1b394d56f1fcf5aef57032fa8",
  "source_hash": "a6d7a00884b4765bb004e02baef68fe72e94edc2bc3f2f36410f1816b87b8971",
  "last_changed_at": "2026-05-29T06:04:24.135Z",
  "active_status": "active"
}

Parsed Structured

{
  "language": "en",
  "location": {
    "raw": "Global Remote / San Francisco, CA",
    "city": "San Francisco",
    "region": "CA",
    "country": "United States",
    "is_remote": true,
    "confidence": 0.9
  },
  "salary_max": null,
  "salary_min": null,
  "inferred_at": "2026-06-06T09:26:06.430Z",
  "launch_scope": {
    "reason": "english_us_canada",
    "included": true,
    "language": "en",
    "location": {
      "raw": "Global Remote / San Francisco, CA",
      "city": "San Francisco",
      "region": "CA",
      "country": "United States",
      "is_remote": true,
      "confidence": 0.9
    },
    "countries": [
      "United States"
    ]
  },
  "remote_policy": "remote",
  "salary_period": null,
  "workplace_type": "remote",
  "salary_currency": null
}

Extensions

{}

Native Structured

{
  "id": "af010369-e891-4700-9aa7-d90670919dbb",
  "team": "Engineering",
  "title": "Site Reliability Engineer - AI Infrastructure",
  "jobUrl": "https://jobs.ashbyhq.com/andromeda/af010369-e891-4700-9aa7-d90670919dbb",
  "address": null,
  "applyUrl": "https://jobs.ashbyhq.com/andromeda/af010369-e891-4700-9aa7-d90670919dbb/application",
  "isListed": true,
  "isRemote": true,
  "location": "Global Remote / San Francisco, CA",
  "updatedAt": null,
  "apiVersion": "ashby-non-user-graphql-v1",
  "department": "Engineering",
  "publishedAt": null,
  "workplaceType": "Remote",
  "employmentType": "FullTime",
  "secondaryLocations": []
}

Get this page with API

Rendered from the bluedoor Job Postings API. Reproduce it:

GET https://api.bluedoor.sh/job-postings/v1/jobs/3a30c77faa9739d4f9dc8771ea3f412b89f7950d?include=descriptionJSON

GET https://api.bluedoor.sh/job-postings/v1/orgs/ca14e017-3161-4f6f-b107-4115fd34e62bJSON

GET https://api.bluedoor.sh/job-postings/v1/sources/82090970-642e-47c5-99bd-35d641779fd1JSON

GET https://api.bluedoor.sh/job-postings/v1/jobs/3a30c77faa9739d4f9dc8771ea3f412b89f7950d/eventsJSON

Docs · Get an API key