bluedoor data·Job Postings API·bluedoor.sh ↗

HomeCompaniesMoonliteSr. Site Reliability Engineer (SRE)

Sr. Site Reliability Engineer (SRE)

Moonlite · Chicago, IL or Remote · Remote · Active · $165,000–$225,000 / year · Greenhouse

Job facts

FieldValue
CompanyMoonlite
TitleSr. Site Reliability Engineer (SRE)
Normalized title-
Department / teamEngineering
LocationChicago, IL, United States
Work modelRemote / Remote
Employment type-
Salary$165,000–$225,000 / year
Statusactive
ATS providerGreenhouse
Posted / first seen2025-10-31 / 2026-05-29
Changed / last seen2026-05-29 / 2026-06-06

Related slices

PageWhat it containsOpen
Company jobsActive postings from Moonlite.Open
Company breakdownsRole, location, ATS, and work model facets for this company.Open
ATS provider jobsActive postings observed through Greenhouse.Open
Provider filtered searchThe same provider as a filtered job collection.Open
City jobsActive postings in Chicago.Open
Department jobsActive postings in Engineering.Open
Work model jobsActive Remote postings.Open
Lifecycle eventsOpen, update, close, and reopen events for this posting.Open
Original postingCanonical source or apply URL captured from the ATS.Open

Linked records

CompanyMoonlite
Sourcec34d04ea-24cb-4178-adbb-d118ee2dabff
ATS providerGreenhouse

Description

Moonlite delivers high-performance AI infrastructure for organizations running intensive computational research, large-scale model training, and demanding data processing workloads.We provide infrastructure deployed in our facilities or co-located in yours, delivering flexible on-demand or reserved compute that feels like an extension of your existing data center. Our team of AI infrastructure specialists combines bare-metal performance with cloud-native operational simplicity, enabling research teams and enterprises to deploy demanding AI workloads with enterprise-grade reliability and compliance. Your Role: You will be instrumental in building and operating production-grade AI infrastructure with deep Kubernetes expertise at its core. Working closely with our systems engineers, network engineers, and platform engineering team, you’ll architect and operate the Kubernetes infrastructure that powers our control plane and orchestrates compute, storage, and networking at scale. This role requires deep understanding of Kubernetes internals, custom resource definitions (CRDs), storage and network integrations, and building production-grade clusters from the ground up (not just deploying in managed environments). You'll ensure enterprise-grade reliability while establishing the automation, observability, and operational practices. Job Responsibilities Kubernetes Infrastructure Engineering: Design, build, and operate production Kubernetes clusters on bare-metal infrastructure – including cluster bootstrapping, control plane architecture, etcd management, and scaling strategies for high-performance compute workloads. Kubernetes Networking & CNIs: Implement and operate custom Kubernetes networking solutions with SR-IOV for high-performance GPU interconnects, multi-tenancy isolation and advanced networking policies. Configure CNI plugins and network segmentation for research workloads. Custom Operators & Controllers: Develop and maintain custom Kubernetes operators and controllers for bare-metal provisioning, infrastructure lifecycle management, and resource orchestration across compute, storage, and networking domains. GPU Infrastructure Integration: Deploy and optimize NVIDIA GPU operators, device plugins, and other custom scheduling logic for GPU workload placement and utilization optimization. Platform Integration & Storage: Build deep integrations between Kubernetes and underlying infrastructure including CSI drivers for storage, custom admission controllers for policy enforcement, and scheduling extensions for specialized hardware placement. Infrastructure Automation: Design and implement automation using Terraform, Ansible, Helm, and custom operators to orchestrate infrastructure workflows and enable deployments across multiple regions. Production Operations & Reliability: Manage production bare-metal infrastructure across multiple regions. Build systems ensuring high availability, fault tolerance, and graceful degradation – establishing SLIs, SLOs, and monitoring to meet enterprise reliability commitments. Observability & Incident Response: Build comprehensive monitoring, logging, and alerting using Prometheus, Grafana, and ELK stack. Lead incident response, conduct postmortems, and implement preventative measures to improve reliability and reduce MTTR. Performance & Capacity Planning: Identify and resolve performance bottlenecks across infrastructure domains. Monitor utilization trends, forecast capacity needs, and optimize resource allocation for various workloads. Requirements Experience: 5+ years in SRE, DevOps, or infrastructure engineering roles with proven experience operating production infrastructure at scale. Kubernetes Infrastructure Expertise: Deep hands-on experience building and operating production Kubernetes clusters on bare-metal infrastructure – not just deploying workloads in managed clusters. Must understand cluster bootstrapping, control plane architecture, etcd operations, and scaling strategies. Kubernetes Internals & Integration: Strong understanding of Kubernetes internals including custom resource definitions (CRDs), operators, controllers, admission webhooks, and scheduling. Experience integrating storage (CSI drivers), networking (CNI, SR-IOV), and specialized hardware (GPU device plugins) with Kubernetes. Linux Systems Experience: Strong fundamentals in Linux systems administration, performance tuning, troubleshooting, and automation in production environments. Infrastructure Automation: Proficiency with infrastructure-as-code tools (Terraform, Ansible, Helm) and building automation to reduce operational overhead. Networking Fundamentals: Solid understanding of networking concepts including IPAM, DNS, DHCP, VLAN/VXLAN, routing, load balancing, and experience troubleshooting network issues in production. Observability & Monitoring: Experience building and maintaining comprehensive monitoring solutions using tools like Prometheus, Grafana, and centralized logging systems. Reliability Practices: Understanding of SRE principles including SLIs/SLOs/SLAs, error budgets, incident management, and blameless postmortems. Scripting & Automation: Strong scripting skills in Go, Python, or Bash for automation, tooling development, and operational efficiency. Problem-Solving Under Pressure: Demonstrated ability to troubleshoot complex issues under pressure, manage incidents effectively, and communicate clearly during outages. Collaboration & Communication: Excellent communication skills and ability to work across teams including systems engineers, network engineers, and software developers. Preferred Qualifications Experience building custom Kubernetes operators or controllers for infrastructure orchestration Deep familiarity with Kubernetes networking (Calico, Cilium, Multus), service mesh technologies, and network policy management Experience with GPU workload orchestration including NVIDIA GPU Operator, MIG, time-slicing, and device plugins Background with advanced Kubernetes features including custom schedulers, admission controllers, and API server extensions Experience with Kubernetes cluster federation or multi-cluster management Knowledge of high-performance networking technologies (InfiniBand, RDMA, RoCE) and their integration with Kubernetes Experience with enterprise storage systems (VAST, Lightbits, Ceph, or similar) Familiarity with configuration management at scale and GitOps practices Understanding of security best practices for Kubernetes and bare-metal infrastructure Experience operating infrastructure in regulated industries or co-located data center environments Background supporting research institutions, technical computing environments, or enterprise AI infrastructure Key Technologies Kubernetes, Linux, Terraform, Ansible, Prometheus, Grafana, ELK Stack, Go, Python, Bash, NVIDIA GPU Technologies, High-Performance Networking, Enterprise Storage Systems Why Moonlite Build Critical Research Infrastructure: Your work will directly enable quantitative research teams and AI practitioners to push the boundaries of what's possible in financial modeling and AI research. Enterprise Impact: Build and operate infrastructure that supports mission-critical research and AI workloads for leading financial institutions and research organizations. Technical Excellence: Join an infrastructure team focused on delivering enterprise-grade reliability while pushing the boundaries of high-performance computing capabilities. Hands-On Ownership: As part of our growing infrastructure team, you'll have significant ownership over critical systems and the autonomy to influence our operational practices and technology choices. Industry Leadership: Work alongside experienced infrastructure professionals who have built and operated systems for the most demanding computing environments. We offer a competitive total compensation package combining a competitive base salary, startup equity, and industry-leading benefits. The total compensation range for this role is $165,000 – $225,000, which includes both base salary and equity. Actual compensation will be determined based on experience, skills, and market alignment. We provide generous benefits, including a 6% 401(k) match, fully covered health insurance premiums, and other comprehensive offerings to support your well-being and success as we grow together. #li-remote

Full job record

Job ID08c00d745958221925d7707e007b6ed2d545568e
Org ID05c08557-fba3-4578-bcad-3ad7c44782b8
Source IDc34d04ea-24cb-4178-adbb-d118ee2dabff
Board IDc34d04ea-24cb-4178-adbb-d118ee2dabff
Providergreenhouse
Provider Job Key4971694008
TitleSr. Site Reliability Engineer (SRE)
Normalized Title
Statusactive
Activeyes
Location TextChicago, IL or Remote
DepartmentEngineering
Team
Employment Type
Workplace Typeremote
Remote Policyremote
CountryUnited States
RegionIL
CityChicago
Salary Rawcompensation range for this role is $165,000 – $225,000, which includes both base salary and equity
Salary Min165,000
Salary Max225,000
Salary CurrencyUSD
Salary Periodyear
Source URLhttps://job-boards.greenhouse.io/moonlite/jobs/4971694008
Apply URLhttps://job-boards.greenhouse.io/moonlite/jobs/4971694008
First Seen At2026-05-29 23:01:56Z
Last Seen At2026-06-06 07:34:58Z
Last Checked At2026-06-06 07:34:58Z
Last Changed At2026-05-29 23:01:56Z
Inactive At
Source Posted At2025-10-31 15:09:07Z
Source Updated At2026-05-21 17:46:30Z
Raw Payload Uris3://job-postings-prod-raw-590183727216/raw/provider=greenhouse/board=moonlite/date=2026-06-06/2026-06-06T07-34-58-229Z-6a80e12c4a642ca83f7314eecff60daa32d21fb2c6d8f61a28136617dc667bbb.json
Event Fields
{
  "content_hash": "7268e67d0f63bceef4ae0692f4786d63319296ac93e33d8ee12aeee8bc9119a7",
  "source_hash": "0d899f214135efc4ecf2c52a76882e15df19ff0b48192c94991524a1f930552c",
  "last_changed_at": "2026-05-29T23:01:56.603Z",
  "active_status": "active"
}
Parsed Structured
{
  "language": "en",
  "location": {
    "raw": "Chicago, IL",
    "city": "Chicago",
    "region": "IL",
    "country": "United States",
    "is_remote": true,
    "confidence": 0.9
  },
  "salary_max": 225000,
  "salary_min": 165000,
  "inferred_at": "2026-06-06T07:34:58.299Z",
  "launch_scope": {
    "reason": "english_us_canada",
    "included": true,
    "language": "en",
    "location": {
      "raw": "Chicago, IL",
      "city": "Chicago",
      "region": "IL",
      "country": "United States",
      "is_remote": true,
      "confidence": 0.9
    },
    "countries": [
      "United States"
    ]
  },
  "remote_policy": "remote",
  "salary_period": "year",
  "workplace_type": "remote",
  "salary_currency": "USD"
}
Extensions
{}
Native Structured
{
  "title": "Sr. Site Reliability Engineer (SRE)",
  "offices": [
    {
      "id": 4026241008,
      "name": "United States (Remote)",
      "location": null,
      "child_ids": [],
      "parent_id": null
    }
  ],
  "language": "en",
  "location": {
    "name": "Chicago, IL or Remote"
  },
  "metadata": [],
  "updated_at": "2026-05-21T13:46:30-04:00",
  "departments": [
    {
      "id": 4030427008,
      "name": "Engineering",
      "child_ids": [],
      "parent_id": null
    }
  ],
  "company_name": "Moonlite",
  "requisition_id": 4360896008,
  "first_published": "2025-10-31T11:09:07-04:00",
  "application_deadline": null
}
Get this page with API

Rendered from the bluedoor Job Postings API. Reproduce it:

GET https://api.bluedoor.sh/job-postings/v1/jobs/08c00d745958221925d7707e007b6ed2d545568e?include=descriptionJSON
GET https://api.bluedoor.sh/job-postings/v1/orgs/05c08557-fba3-4578-bcad-3ad7c44782b8JSON
GET https://api.bluedoor.sh/job-postings/v1/sources/c34d04ea-24cb-4178-adbb-d118ee2dabffJSON
GET https://api.bluedoor.sh/job-postings/v1/jobs/08c00d745958221925d7707e007b6ed2d545568e/eventsJSON