Home › Companies › Iru › Senior Site Reliability Engineer

Senior Site Reliability Engineer

Iru · Miami · Hybrid · Active · Lever

Job facts

Field	Value
Company	Iru
Title	Senior Site Reliability Engineer
Normalized title	-
Department / team	R&D / Engineering
Location	Miami, FL, United States
Work model	Hybrid / Hybrid
Employment type	Full Time
Salary	-
Status	active
ATS provider	Lever
Posted / first seen	2026-04-10 / 2026-05-29
Changed / last seen	2026-05-29 / 2026-06-06

Related slices

Page	What it contains	Open
Company jobs	Active postings from Iru.	Open
Company breakdowns	Role, location, ATS, and work model facets for this company.	Open
ATS provider jobs	Active postings observed through Lever.	Open
Provider filtered search	The same provider as a filtered job collection.	Open
City jobs	Active postings in Miami.	Open
Department jobs	Active postings in R&D.	Open
Work model jobs	Active Hybrid postings.	Open
Lifecycle events	Open, update, close, and reopen events for this posting.	Open
Original posting	Canonical source or apply URL captured from the ATS.	Open

Linked records

Company	Iru
Source	280f567e-c20d-4c1f-9d80-e5a1aa4d714f
ATS provider	Lever

Description

About Iru Iru is the AI-powered security & IT platform used by the world’s fastest-growing companies to secure their users, apps, and devices. Built for the AI era, Iru unifies identity & access, endpoint security & management, and compliance automation—collapsing the stack and giving IT & security time and control back. Iru is backed by some of the smartest investors in tech—General Catalyst, Tiger Global, Felicis, Greycroft, and First Round Capital. In July 2024, Iru raised $100 million from General Catalyst, valuing the company at $850 million. Customers include Notion, Cursor, Lovable, Replit, and Mercor, and Iru partners with industry leaders such as ServiceNow and AWS. Iru was named to Forbes’ America’s Best Startup Employers 2025 list for employee engagement and satisfaction. The Opportunity We are looking for a Senior SRE to own how we detect, respond to, and learn from incidents, and to drive consistent observability across services and teams. This role sits at the intersection of reliability engineering and cross-team enablement—you will work alongside our Infrastructure team to complement their platform-building work with a sharp focus on operational excellence and measurable reliability. You will partner with engineering and platform teams to reduce MTTD and MTTR, and to make reliability measurable, repeatable, and ultimately team-owned. Benefits & Perks Competitive salary Hybrid work environment (3 days in office per week) 100% individual and dependent medical + dental + vision coverage 401(K) with a 4% company match 20 days PTO Iru Wellness Week the first week in July Equity for full-time employees In-office lunch stipend provided Up to 16 weeks of paid leave for new parents Paid Family and Medical Leave Modern Health mental health benefits for individuals and dependents Fertility benefits Working Advantage employee discounts Onsite fitness center Free parking Exciting opportunities for career growth We are excited to be serving a significant need for a fast-growing market, and are proud of the high-performing team we have brought together so far. If you’re someone who wants to engage in new, exciting projects that will challenge your skills in the best way possible, we would love to connect with you. At Iru, we believe in fostering an inclusive environment in which employees feel encouraged to share their unique perspectives, leverage their strengths, and act authentically. We know that diverse teams are strong teams, and welcome those from all backgrounds and varying experiences. Iru is proud to be an equal opportunity employer committed to diversity and inclusion in the workplace. Qualified applicants will be considered for employment without regard to race, color, religion, national origin, age, sex, sexual orientation, gender identity, physical or mental disability, protected veteran or military status or any other status protected by applicable law. #LI-Hybrid What You Will Do Lead and refine the incident lifecycle: detection, triage, communication, mitigation, resolution, and post-incident review. Define and maintain severity models, escalation paths, on-call expectations, and runbooks/playbooks—keeping them current and usable under pressure. Facilitate blameless postmortems; turn findings into tracked remediations and shared learning that reduces repeat incidents. Improve coordination during major incidents: roles, tooling, customer/stakeholder updates, and handoffs. Partner with security, support, and product on incident communications and regulatory or contractual obligations where applicable. Observability Standardization & SLI/SLO Evangelism Establish and maintain organization-wide standards for metrics, logs, and traces in Datadog—including naming conventions, cardinality, retention, and sampling—so teams can instrument consistently and confidently. Define and drive adoption of SLOs, SLIs, and error budgets across engineering teams; meet teams where they are—bootstrapping SLI/SLO programs for teams starting from scratch and improving rigor for teams that already have them, with the long-term goal of teams owning their own observability. Build and maintain reusable Datadog dashboard templates, monitor templates, and alerting patterns that teams can adopt and adapt—reducing the activation energy for doing observability well. Champion golden signals and RED/USE-style alerting philosophies; align alerts with user-impacting symptoms, not just low-level infrastructure noise. Partner with the Infrastructure team on observability stack decisions, multi-tenancy, cost controls, and data lifecycle. Continuously reduce alert noise through threshold tuning, ownership assignment, and on-call load management. Reliability Culture Mentor engineers on operational excellence, safe deployment practices, and production readiness; help engineering teams grow their own reliability instincts. Contribute to capacity planning, chaos/game-day exercises, and reliability reviews for critical changes. Serve as a connective layer between the SRE and Infrastructure teams—aligning on tooling, standards, and shared goals. Requirements Experience: 5+ years in SRE, production engineering, or equivalent, including on-call responsibility for customer-facing systems. Incidents: Proven experience running or significantly improving incident response (process, tooling, or both) in a distributed systems environment. Observability: Deep, hands-on experience with Datadog—building dashboards, monitors, and instrumentation standards across multiple teams or services. Experience with metrics, logging, and tracing at scale. SLI/SLO Programs: Demonstrated experience defining SLOs/SLIs and error budget policies in production; comfortable working with teams to codify the metrics their reliability posture is based on. Systems: Strong understanding of Linux, networking, distributed systems failure modes, and cloud or hybrid infrastructure (Kubernetes, load balancers, databases, queues). Automation: Proficiency in at least one of Go, Python, or similar for tooling and automation; comfort with IaC concepts (Terraform or equivalent). Communication: Clear written and verbal communication; ability to facilitate discussions during high-pressure incidents and deliberate postmortems alike. Collaboration: Track record of influencing without direct authority and driving adoption across engineering teams. Nice to Have Experience with OpenTelemetry or similar vendor-neutral instrumentation strategies. Familiarity with PagerDuty, Incident.io , Opsgenie, or similar; Statuspage or equivalent for external communications. Experience in a hyper-growth startup environment. Experience in regulated or high-compliance environments. Contributions to internal developer platforms or shared reliability tooling. What Success Looks Like Fewer repeated incidents and clearer, actionable postmortem outcomes that teams act on. Engineering teams across the org have well-defined SLIs/SLOs they own and actively use to drive reliability decisions. A shared Datadog observability layer with consistent signals, templated dashboards, and actionable alerts tied to user impact. Engineers know how to instrument, where to look, and how to respond—with sustainable, well-supported on-call.

Full job record

Job ID	90a8472b24b304df128e8dd54ec7efed2a630784
Org ID	e40a55cb-6f98-49ac-b5bd-35cb0dc5eedf
Source ID	280f567e-c20d-4c1f-9d80-e5a1aa4d714f
Board ID	280f567e-c20d-4c1f-9d80-e5a1aa4d714f
Provider	lever
Provider Job Key	609c2da4-46cc-4d00-a048-5819189a0bb8
Title	Senior Site Reliability Engineer
Normalized Title	—
Status	active
Active	yes
Location Text	Miami
Department	R&D
Team	Engineering
Employment Type	Full-Time
Workplace Type	hybrid
Remote Policy	hybrid
Country	United States
Region	FL
City	Miami
Salary Raw	—
Salary Min	—
Salary Max	—
Salary Currency	—
Salary Period	—
Source URL	https://jobs.lever.co/iru/609c2da4-46cc-4d00-a048-5819189a0bb8
Apply URL	https://jobs.lever.co/iru/609c2da4-46cc-4d00-a048-5819189a0bb8/apply
First Seen At	2026-05-29 07:00:32Z
Last Seen At	2026-06-06 19:42:37Z
Last Checked At	2026-06-06 19:42:37Z
Last Changed At	2026-05-29 07:00:32Z
Inactive At	—
Source Posted At	2026-04-10 20:27:21Z
Source Updated At	—
Raw Payload Uri	s3://job-postings-prod-raw-590183727216/raw/provider=lever/board=iru/date=2026-06-06/2026-06-06T19-42-36-168Z-db6b499f52013e7499a58980059b17434258927a49b53b0fd04083f8b1f50752.json

Event Fields

{
  "content_hash": "2b149c8578381b1130bf689e742d107ea5202a490fd0b065d7c0b5e68d89d1be",
  "source_hash": "e741b5bb1545f1920ede3d07079fe712519a2cc1a7848800c96d23dbd266bb01",
  "last_changed_at": "2026-05-29T07:00:32.584Z",
  "active_status": "active"
}

Parsed Structured

{
  "language": "en",
  "location": {
    "raw": "Miami",
    "city": "Miami",
    "region": "FL",
    "country": "United States",
    "is_remote": false,
    "confidence": 0.75
  },
  "salary_max": null,
  "salary_min": null,
  "inferred_at": "2026-06-06T19:42:37.057Z",
  "launch_scope": {
    "reason": "english_us_canada",
    "included": true,
    "language": "en",
    "location": {
      "raw": "Miami",
      "city": "Miami",
      "region": "FL",
      "country": "United States",
      "is_remote": false,
      "confidence": 0.75
    },
    "countries": [
      "United States"
    ]
  },
  "remote_policy": "hybrid",
  "salary_period": null,
  "workplace_type": "hybrid",
  "salary_currency": null
}

Extensions

{}

Native Structured

{
  "lists": [
    {
      "text": "What You Will Do",
      "content": "<div>\n\n<li>Lead and refine the incident lifecycle: detection, triage, communication, mitigation, resolution, and post-incident review.</li>\n<li>Define and maintain severity models, escalation paths, on-call expectations, and runbooks/playbooks—keeping them current and usable under pressure.</li>\n<li>Facilitate blameless postmortems; turn findings into tracked remediations and shared learning that reduces repeat incidents.</li>\n<li>Improve coordination during major incidents: roles, tooling, customer/stakeholder updates, and handoffs.</li>\n<li>Partner with security, support, and product on incident communications and regulatory or contractual obligations where applicable.</li>\n\n</div>"
    },
    {
      "text": "Observability Standardization & SLI/SLO Evangelism",
      "content": "<div>\n\n<li>Establish and maintain organization-wide standards for metrics, logs, and traces in Datadog—including naming conventions, cardinality, retention, and sampling—so teams can instrument consistently and confidently.</li>\n<li>Define and drive adoption of SLOs, SLIs, and error budgets across engineering teams; meet teams where they are—bootstrapping SLI/SLO programs for teams starting from scratch and improving rigor for teams that already have them, with the long-term goal of teams owning their own observability.</li>\n<li>Build and maintain reusable Datadog dashboard templates, monitor templates, and alerting patterns that teams can adopt and adapt—reducing the activation energy for doing observability well.</li>\n<li>Champion golden signals and RED/USE-style alerting philosophies; align alerts with user-impacting symptoms, not just low-level infrastructure noise.</li>\n<li>Partner with the Infrastructure team on observability stack decisions, multi-tenancy, cost controls, and data lifecycle.</li>\n<li>Continuously reduce alert noise through threshold tuning, ownership assignment, and on-call load management.</li>\n\n</div>"
    },
    {
      "text": "Reliability Culture",
      "content": "<div>\n\n<li>Mentor engineers on operational excellence, safe deployment practices, and production readiness; help engineering teams grow their own reliability instincts.</li>\n<li>Contribute to capacity planning, chaos/game-day exercises, and reliability reviews for critical changes.</li>\n<li>Serve as a connective layer between the SRE and Infrastructure teams—aligning on tooling, standards, and shared goals.</li>\n\n</div>"
    },
    {
      "text": "Requirements",
      "content": "<div>\n\n<li><strong>Experience:</strong> 5+ years in SRE, production engineering, or equivalent, including on-call responsibility for customer-facing systems.</li>\n<li><strong>Incidents:</strong> Proven experience running or significantly improving incident response (process, tooling, or both) in a distributed systems environment.</li>\n<li><strong>Observability:</strong> Deep, hands-on experience with Datadog—building dashboards, monitors, and instrumentation standards across multiple teams or services. Experience with metrics, logging, and tracing at scale.</li>\n<li><strong>SLI/SLO Programs:</strong> Demonstrated experience defining SLOs/SLIs and error budget policies in production; comfortable working with teams to codify the metrics their reliability posture is based on.</li>\n<li><strong>Systems:</strong> Strong understanding of Linux, networking, distributed systems failure modes, and cloud or hybrid infrastructure (Kubernetes, load balancers, databases, queues).</li>\n<li><strong>Automation:</strong> Proficiency in at least one of Go, Python, or similar for tooling and automation; comfort with IaC concepts (Terraform or equivalent).</li>\n<li><strong>Communication:</strong> Clear written and verbal communication; ability to facilitate discussions during high-pressure incidents and deliberate postmortems alike.</li>\n<li><strong>Collaboration:</strong> Track record of influencing without direct authority and driving adoption across engineering teams.</li>\n\n</div>"
    },
    {
      "text": "Nice to Have",
      "content": "<div>\n\n<li>Experience with OpenTelemetry or similar vendor-neutral instrumentation strategies.</li>\n<li>Familiarity with PagerDuty, <a href=\"http://Incident.io\">Incident.io</a>, Opsgenie, or similar; Statuspage or equivalent for external communications.</li>\n<li>Experience in a hyper-growth startup environment.</li>\n<li>Experience in regulated or high-compliance environments.</li>\n<li>Contributions to internal developer platforms or shared reliability tooling.</li>\n\n</div>"
    },
    {
      "text": "What Success Looks Like",
      "content": "<div>\n\n<li>Fewer repeated incidents and clearer, actionable postmortem outcomes that teams act on.</li>\n<li>Engineering teams across the org have well-defined SLIs/SLOs they own and actively use to drive reliability decisions.</li>\n<li>A shared Datadog observability layer with consistent signals, templated dashboards, and actionable alerts tied to user impact.</li>\n<li>Engineers know how to instrument, where to look, and how to respond—with sustainable, well-supported on-call.</li>\n\n</div>"
    }
  ],
  "country": "US",
  "createdAt": 1775852841532,
  "updatedAt": null,
  "categories": {
    "team": "Engineering",
    "location": "Miami",
    "commitment": "Full-Time",
    "department": "R&D",
    "allLocations": [
      "Miami"
    ]
  },
  "salaryRange": null,
  "workplaceType": "hybrid"
}

Get this page with API

Rendered from the bluedoor Job Postings API. Reproduce it:

GET https://api.bluedoor.sh/job-postings/v1/jobs/90a8472b24b304df128e8dd54ec7efed2a630784?include=descriptionJSON

GET https://api.bluedoor.sh/job-postings/v1/orgs/e40a55cb-6f98-49ac-b5bd-35cb0dc5eedfJSON

GET https://api.bluedoor.sh/job-postings/v1/sources/280f567e-c20d-4c1f-9d80-e5a1aa4d714fJSON

GET https://api.bluedoor.sh/job-postings/v1/jobs/90a8472b24b304df128e8dd54ec7efed2a630784/eventsJSON

Docs · Get an API key