Home › Companies › GRAIL › Staff Site Reliability Engineer (SRE) | Dev Ops Engineer #4770

Staff Site Reliability Engineer (SRE) | Dev Ops Engineer #4770

GRAIL · Menlo Park, CA · On Site · Active · $169,000–$224,000 / year · Lever

Job facts

Field	Value
Company	GRAIL
Title	Staff Site Reliability Engineer (SRE) \| Dev Ops Engineer #4770
Normalized title	-
Department / team	Research & Development / Technology
Location	Menlo Park, CA, United States
Work model	On Site
Employment type	Full Time
Salary	$169,000–$224,000 / year
Status	active
ATS provider	Lever
Posted / first seen	2026-04-21 / 2026-05-29
Changed / last seen	2026-05-29 / 2026-06-06

Related slices

Page	What it contains	Open
Company jobs	Active postings from GRAIL.	Open
Company breakdowns	Role, location, ATS, and work model facets for this company.	Open
ATS provider jobs	Active postings observed through Lever.	Open
Provider filtered search	The same provider as a filtered job collection.	Open
City jobs	Active postings in Menlo Park.	Open
Department jobs	Active postings in Research & Development.	Open
Work model jobs	Active On Site postings.	Open
Lifecycle events	Open, update, close, and reopen events for this posting.	Open
Original posting	Canonical source or apply URL captured from the ATS.	Open

Linked records

Company	GRAIL
Source	0b51bc78-9954-4d3f-b406-840a8771181c
ATS provider	Lever

Description

Our mission is to detect cancer early, when it can be cured. We are working to change the trajectory of cancer mortality and bring stakeholders together to adopt innovative, safe, and effective technologies that can transform cancer care. We are a healthcare company, pioneering new technologies to advance early cancer detection. We have built a multi-disciplinary organization of scientists, engineers, and physicians and we are using the power of next-generation sequencing (NGS), population-scale clinical studies, and state-of-the-art computer science and data science to overcome one of medicine’s greatest challenges. GRAIL is headquartered in the bay area of California, with locations in Washington, D.C., North Carolina, and the United Kingdom. It is supported by leading global investors and pharmaceutical, technology, and healthcare companies. For more information, please visit grail.com GRAIL is seeking a Staff Site Reliability / DevOps Engineer to lead the reliability, scalability, and security of our cloud-native platform. This role operates at the intersection of infrastructure engineering, platform strategy, and organizational leadership, supporting systems that power large-scale data processing and cutting-edge cancer detection technologies. You will define and drive infrastructure standards across teams, represent reliability and performance in architecture decisions, and build systems that scale well beyond your direct ownership. This is a highly technical, high-impact role combining hands-on engineering with cross-functional influence and mentorship. Onsite Expectations You will work on-site full-time at our office located in Menlo Park, California. Beginning in Fall 2026, you will work at our new headquarters in Sunnyvale, California. The expected, full-time, annual base pay scale for this position is $169K - $224K. Actual base pay will consider skills, experience, and location. This role may be eligible for other forms of compensation, including an annual bonus and/or incentives, subject to the terms of the applicable plans and Company discretion. This range reflects a good-faith estimate of the range that the Company reasonably expects to pay for the position upon hire; the actual compensation offered may vary depending on factors such as the candidate’s qualifications. Employees in this role are also eligible for GRAIL’s comprehensive and competitive benefits package, offered in accordance with our applicable plans and policies. This package currently includes flexible time-off or vacation; a 401(k) retirement plan with employer match; medical, dental, and vision coverage; and carefully selected mindfulness programs. GRAIL is an equal employment opportunity employer, and we are committed to building a workplace where every individual can thrive, contribute, and grow. All qualified applicants will receive consideration for employment without regard to race, color, religion, national origin, sex, gender, gender identity, sexual orientation, age, disability, status as a protected veteran, , or any other class or characteristic protected by applicable federal, state, and local laws. Additionally, GRAIL will consider for employment qualified applicants with arrest and conviction records in a manner consistent with applicable law and provide reasonable accommodations to qualified individuals with disabilities. Please contact us at [email protected] if you require an accommodation to apply for an open position. GRAIL maintains a drug-free workplace. We welcome job-seekers from all backgrounds to join us! Reponsibilities Design, build, and operate highly available, fault-tolerant cloud infrastructure across AWS, GCP, and/or Azure Architect and maintain scalable CI/CD pipelines and deployment frameworks for enterprise-grade software delivery Lead infrastructure-as-code adoption and maturity using tools such as Terraform, CloudFormation, and Ansible Own Kubernetes reliability across multi-cluster environments, including upgrades, scaling, and workload lifecycle management Establish and evolve observability platforms (metrics, logs, traces) and define SLO/SLI frameworks across teams Lead incident response for critical outages, drive root cause analysis, and implement preventative improvements Optimize infrastructure for cost, performance, and scalability, partnering closely with engineering and finance stakeholders Define and enforce DevOps, reliability, and security best practices across the organization Partner cross-functionally with engineering, data, QA, security, and IT teams to design resilient systems Mentor engineers and contribute to technical leadership through design reviews, standards, and knowledge sharing These responsibilities summarize the role’s primary responsibilities and are not an exhaustive list. They may change at the company’s discretion. What Success Looks Like in Your First Year Conduct a comprehensive assessment of the current infrastructure, drive infrastructure-as-code adoption to 95%+ across critical systems, and establish clear health and reliability baselines for the Kubernetes platform Standardize observability using modern tooling and implement an SLO/SLI framework adopted across multiple product teams, including defined SLAs for critical data systems Strengthen security and compliance posture across cloud environments by implementing consistent baselines, launching a compliance-as-code framework, and reducing mean time to resolution (MTTR) for production incidents Define, document, and drive adoption of engineering standards, best practices, and operational guidelines across platform and product teams Develop and align stakeholders on a forward-looking platform reliability and infrastructure roadmap Demonstrate measurable mentorship and technical leadership impact across the engineering organization Evaluate and provide recommendations on emerging infrastructure needs, including support for AI/ML and advanced data workloads Required Qualifications BS in Computer Science, Engineering, or related field, or equivalent experience 8+ years of experience in Site Reliability Engineering, DevOps, or platform engineering Strong hands-on experience with at least one major cloud platform (AWS, GCP, or Azure) Experience implementing infrastructure-as-code solutions (Terraform, CloudFormation, or similar) Experience designing and operating CI/CD pipelines (e.g., GitLab CI, GitHub Actions, Jenkins) Hands-on experience with Kubernetes and containerized systems in production environments Proficiency in scripting or programming for automation (e.g., Python, Go, Bash, or PowerShell) Experience with observability and monitoring tools (e.g., Prometheus, Grafana, OpenTelemetry, Datadog) Strong understanding of networking, security, and distributed systems fundamentals Experience working in regulated environments and familiarity with frameworks such as ISO 27001, NIST, SOC 2, or HIPAA Preferred Qualifications 10+ years of experience in SRE, DevOps, or infrastructure engineering Experience operating multi-cluster Kubernetes environments (e.g., EKS, GKE) at scale Familiarity with GitOps practices (e.g., ArgoCD, Flux) Experience with data platforms and pipelines (e.g., Kafka, Airflow, Spark, Snowflake, BigQuery) Experience implementing SLO/SLI frameworks and reliability practices across multiple teams Strong background in cloud security, including IAM, zero-trust architecture, and secrets management Experience with compliance-as-code and security tooling (e.g., OPA, Snyk, Checkov) Exposure to AI/ML or large-scale data infrastructure workloads Experience in healthcare, biotech, or other regulated industries Relevant cloud or Kubernetes certifications (e.g., AWS DevOps, CKA/CKS, GCP DevOps) Physical Demands and Working Environment Standard office environment with hybrid flexibility Participation in on-call rotation and after-hours support for critical systems may be required Frequent collaboration with cross-functional and senior stakeholders Fast-paced, dynamic environment with emphasis on reliability, scalability, and innovation Adaptability and Growth Expectation As the organization evolves, responsibilities may expand or shift to meet business needs. This may include: Taking on additional technical or leadership responsibilities Participating in cross-functional initiatives and strategic projects Adapting to new technologies, tools, and methodologies Supporting other teams during periods of high demand

Full job record

Job ID	f80c55125f5447fa631d972637111fbae67033cd
Org ID	da017cf0-d8f4-4983-b827-bd93edb1aeae
Source ID	0b51bc78-9954-4d3f-b406-840a8771181c
Board ID	0b51bc78-9954-4d3f-b406-840a8771181c
Provider	lever
Provider Job Key	d9e5a9c8-bcef-4a73-a602-47d29198c398
Title	Staff Site Reliability Engineer (SRE) \| Dev Ops Engineer #4770
Normalized Title	—
Status	active
Active	yes
Location Text	Menlo Park, CA
Department	Research & Development
Team	Technology
Employment Type	Full-Time
Workplace Type	on_site
Remote Policy	—
Country	United States
Region	CA
City	Menlo Park
Salary Raw	base pay scale for this position is $169K - $224K
Salary Min	169,000
Salary Max	224,000
Salary Currency	USD
Salary Period	year
Source URL	https://jobs.lever.co/grailbio/d9e5a9c8-bcef-4a73-a602-47d29198c398
Apply URL	https://jobs.lever.co/grailbio/d9e5a9c8-bcef-4a73-a602-47d29198c398/apply
First Seen At	2026-05-29 07:00:19Z
Last Seen At	2026-06-06 19:00:57Z
Last Checked At	2026-06-06 19:00:57Z
Last Changed At	2026-05-29 07:00:19Z
Inactive At	—
Source Posted At	2026-04-21 23:37:02Z
Source Updated At	—
Raw Payload Uri	s3://job-postings-prod-raw-590183727216/raw/provider=lever/board=grailbio/date=2026-06-06/2026-06-06T19-00-56-560Z-89cc73d7d17ade7023fae023de6f802b45431441e821f0f2b2ef7caedd11f9d1.json

Event Fields

{
  "content_hash": "4a6427bb4e93fa8db0db16b3032d399ff1456876a3f5d841a350d13269cfb08f",
  "source_hash": "139b3b8db1f27f94d03a7c529782c892c994021430b1578ff7a2715b1e52ce58",
  "last_changed_at": "2026-05-29T07:00:19.546Z",
  "active_status": "active"
}

Parsed Structured

{
  "language": "en",
  "location": {
    "raw": "Menlo Park, CA",
    "city": "Menlo Park",
    "region": "CA",
    "country": "United States",
    "is_remote": false,
    "confidence": 0.9
  },
  "salary_max": 224000,
  "salary_min": 169000,
  "inferred_at": "2026-06-06T19:00:57.641Z",
  "launch_scope": {
    "reason": "english_us_canada",
    "included": true,
    "language": "en",
    "location": {
      "raw": "Menlo Park, CA",
      "city": "Menlo Park",
      "region": "CA",
      "country": "United States",
      "is_remote": false,
      "confidence": 0.9
    },
    "countries": [
      "United States"
    ]
  },
  "remote_policy": null,
  "salary_period": "year",
  "workplace_type": "on_site",
  "salary_currency": "USD"
}

Extensions

{}

Native Structured

{
  "lists": [
    {
      "text": "Reponsibilities",
      "content": "<div>\n<ul data-start=\"1481\" data-end=\"2638\">\n<li data-section-id=\"yfgj5h\" data-start=\"1481\" data-end=\"1595\">Design, build, and operate highly available, fault-tolerant cloud infrastructure across AWS, GCP, and/or Azure</li>\n<li data-section-id=\"1p2jm5t\" data-start=\"1596\" data-end=\"1712\">Architect and maintain scalable CI/CD pipelines and deployment frameworks for enterprise-grade software delivery</li>\n<li data-section-id=\"2ibzn5\" data-start=\"1713\" data-end=\"1825\">Lead infrastructure-as-code adoption and maturity using tools such as Terraform, CloudFormation, and Ansible</li>\n<li data-section-id=\"yavgnt\" data-start=\"1826\" data-end=\"1954\">Own Kubernetes reliability across multi-cluster environments, including upgrades, scaling, and workload lifecycle management</li>\n<li data-section-id=\"11yjs04\" data-start=\"1955\" data-end=\"2070\">Establish and evolve observability platforms (metrics, logs, traces) and define SLO/SLI frameworks across teams</li>\n<li data-section-id=\"w34y48\" data-start=\"2071\" data-end=\"2186\">Lead incident response for critical outages, drive root cause analysis, and implement preventative improvements</li>\n<li data-section-id=\"wzn25s\" data-start=\"2187\" data-end=\"2315\">Optimize infrastructure for cost, performance, and scalability, partnering closely with engineering and finance stakeholders</li>\n<li data-section-id=\"1rbupar\" data-start=\"2316\" data-end=\"2411\">Define and enforce DevOps, reliability, and security best practices across the organization</li>\n<li data-section-id=\"zhdbsi\" data-start=\"2412\" data-end=\"2521\">Partner cross-functionally with engineering, data, QA, security, and IT teams to design resilient systems</li>\n<li data-section-id=\"1ac0vnp\" data-start=\"2522\" data-end=\"2638\">Mentor engineers and contribute to technical leadership through design reviews, standards, and knowledge sharing</li>\n\n<p data-start=\"2640\" data-end=\"2787\"><em data-start=\"2640\" data-end=\"2787\">These responsibilities summarize the role’s primary responsibilities and are not an exhaustive list. They may change at the company’s discretion.</em></p>\n<h3 data-section-id=\"1valr74\" data-start=\"2794\" data-end=\"2843\"><span role=\"text\"><strong data-start=\"2797\" data-end=\"2843\">What Success Looks Like in Your First Year</strong></span></h3>\n<ul data-start=\"2845\" data-end=\"3947\">\n<li data-section-id=\"8hptjx\" data-start=\"2845\" data-end=\"3066\">Conduct a comprehensive assessment of the current infrastructure, drive infrastructure-as-code adoption to 95%+ across critical systems, and establish clear health and reliability baselines for the Kubernetes platform</li>\n<li data-section-id=\"18bngct\" data-start=\"3067\" data-end=\"3240\">Standardize observability using modern tooling and implement an SLO/SLI framework adopted across multiple product teams, including defined SLAs for critical data systems</li>\n<li data-section-id=\"1449u9s\" data-start=\"3241\" data-end=\"3462\">Strengthen security and compliance posture across cloud environments by implementing consistent baselines, launching a compliance-as-code framework, and reducing mean time to resolution (MTTR) for production incidents</li>\n<li data-section-id=\"1v80089\" data-start=\"3463\" data-end=\"3606\">Define, document, and drive adoption of engineering standards, best practices, and operational guidelines across platform and product teams</li>\n<li data-section-id=\"d8uly2\" data-start=\"3607\" data-end=\"3710\">Develop and align stakeholders on a forward-looking platform reliability and infrastructure roadmap</li>\n<li data-section-id=\"9wa3el\" data-start=\"3711\" data-end=\"3816\">Demonstrate measurable mentorship and technical leadership impact across the engineering organization</li>\n<li data-section-id=\"of3kjk\" data-start=\"3817\" data-end=\"3947\">Evaluate and provide recommendations on emerging infrastructure needs, including support for AI/ML and advanced data workloads</li>\n\n</ul></ul></div>"
    },
    {
      "text": "Required Qualifications",
      "content": "<div>\n<ul data-start=\"3986\" data-end=\"4959\">\n<li data-section-id=\"k7oxl3\" data-start=\"3986\" data-end=\"4069\">BS in Computer Science, Engineering, or related field, or equivalent experience</li>\n<li data-section-id=\"ilmp6d\" data-start=\"4070\" data-end=\"4161\">8+ years of experience in Site Reliability Engineering, DevOps, or platform engineering</li>\n<li data-section-id=\"ga8ep0\" data-start=\"4162\" data-end=\"4252\">Strong hands-on experience with at least one major cloud platform (AWS, GCP, or Azure)</li>\n<li data-section-id=\"scnxka\" data-start=\"4253\" data-end=\"4353\">Experience implementing infrastructure-as-code solutions (Terraform, CloudFormation, or similar)</li>\n<li data-section-id=\"vkuyuf\" data-start=\"4354\" data-end=\"4451\">Experience designing and operating CI/CD pipelines (e.g., GitLab CI, GitHub Actions, Jenkins)</li>\n<li data-section-id=\"14di64g\" data-start=\"4452\" data-end=\"4544\">Hands-on experience with Kubernetes and containerized systems in production environments</li>\n<li data-section-id=\"kahizm\" data-start=\"4545\" data-end=\"4643\">Proficiency in scripting or programming for automation (e.g., Python, Go, Bash, or PowerShell)</li>\n<li data-section-id=\"o1mzaf\" data-start=\"4644\" data-end=\"4750\">Experience with observability and monitoring tools (e.g., Prometheus, Grafana, OpenTelemetry, Datadog)</li>\n<li data-section-id=\"1pjlxxg\" data-start=\"4751\" data-end=\"4837\">Strong understanding of networking, security, and distributed systems fundamentals</li>\n<li data-section-id=\"1mdgv73\" data-start=\"4838\" data-end=\"4959\">Experience working in regulated environments and familiarity with frameworks such as ISO 27001, NIST, SOC 2, or HIPAA</li>\n\n</ul></div>"
    },
    {
      "text": "Preferred Qualifications",
      "content": "<div>\n<ul data-start=\"4999\" data-end=\"5830\">\n<li data-section-id=\"1w2iinm\" data-start=\"4999\" data-end=\"5072\">10+ years of experience in SRE, DevOps, or infrastructure engineering</li>\n<li data-section-id=\"1fqtb0b\" data-start=\"5073\" data-end=\"5161\">Experience operating multi-cluster Kubernetes environments (e.g., EKS, GKE) at scale</li>\n<li data-section-id=\"1a1a2zh\" data-start=\"5162\" data-end=\"5220\">Familiarity with GitOps practices (e.g., ArgoCD, Flux)</li>\n<li data-section-id=\"1vz2oq6\" data-start=\"5221\" data-end=\"5320\">Experience with data platforms and pipelines (e.g., Kafka, Airflow, Spark, Snowflake, BigQuery)</li>\n<li data-section-id=\"1s9elm5\" data-start=\"5321\" data-end=\"5415\">Experience implementing SLO/SLI frameworks and reliability practices across multiple teams</li>\n<li data-section-id=\"wlky7q\" data-start=\"5416\" data-end=\"5519\">Strong background in cloud security, including IAM, zero-trust architecture, and secrets management</li>\n<li data-section-id=\"17ulhci\" data-start=\"5520\" data-end=\"5606\">Experience with compliance-as-code and security tooling (e.g., OPA, Snyk, Checkov)</li>\n<li data-section-id=\"g6nc5s\" data-start=\"5607\" data-end=\"5673\">Exposure to AI/ML or large-scale data infrastructure workloads</li>\n<li data-section-id=\"10pyhge\" data-start=\"5674\" data-end=\"5742\">Experience in healthcare, biotech, or other regulated industries</li>\n<li data-section-id=\"19ee7kt\" data-start=\"5743\" data-end=\"5830\">Relevant cloud or Kubernetes certifications (e.g., AWS DevOps, CKA/CKS, GCP DevOps)</li>\n\n</ul></div>"
    },
    {
      "text": "Physical Demands and Working Environment ",
      "content": "<div>\n\n<li data-section-id=\"suqz05\" data-start=\"5884\" data-end=\"5939\">Standard office environment with hybrid flexibility</li>\n<li data-section-id=\"v4sn0v\" data-start=\"5940\" data-end=\"6038\">Participation in on-call rotation and after-hours support for critical systems may be required</li>\n<li data-section-id=\"14numxj\" data-start=\"6039\" data-end=\"6111\">Frequent collaboration with cross-functional and senior stakeholders</li>\n<li data-section-id=\"1727gq0\" data-start=\"6112\" data-end=\"6205\">Fast-paced, dynamic environment with emphasis on reliability, scalability, and innovation</li>\n\n</div>"
    },
    {
      "text": "Adaptability and Growth Expectation",
      "content": "<div>\n<h3 data-section-id=\"kj71q7\" data-start=\"6212\" data-end=\"6254\"><strong>As the organization evolves, responsibilities may expand or shift to meet business needs. This may include:</strong></h3>\n<ul data-start=\"6365\" data-end=\"6619\">\n<li data-section-id=\"1k0m9pv\" data-start=\"6365\" data-end=\"6430\">Taking on additional technical or leadership responsibilities</li>\n<li data-section-id=\"1h8ciw8\" data-start=\"6431\" data-end=\"6503\">Participating in cross-functional initiatives and strategic projects</li>\n<li data-section-id=\"1hmqax0\" data-start=\"6504\" data-end=\"6562\">Adapting to new technologies, tools, and methodologies</li>\n<li data-section-id=\"1nazd2w\" data-start=\"6563\" data-end=\"6619\">Supporting other teams during periods of high demand</li>\n\n</ul></div>"
    }
  ],
  "country": "US",
  "createdAt": 1776814622731,
  "updatedAt": null,
  "categories": {
    "team": "Technology",
    "location": "Menlo Park, CA",
    "commitment": "Full-Time",
    "department": "Research & Development",
    "allLocations": [
      "Menlo Park, CA"
    ]
  },
  "salaryRange": null,
  "workplaceType": "onsite"
}

Get this page with API

Rendered from the bluedoor Job Postings API. Reproduce it:

GET https://api.bluedoor.sh/job-postings/v1/jobs/f80c55125f5447fa631d972637111fbae67033cd?include=descriptionJSON

GET https://api.bluedoor.sh/job-postings/v1/orgs/da017cf0-d8f4-4983-b827-bd93edb1aeaeJSON

GET https://api.bluedoor.sh/job-postings/v1/sources/0b51bc78-9954-4d3f-b406-840a8771181cJSON

GET https://api.bluedoor.sh/job-postings/v1/jobs/f80c55125f5447fa631d972637111fbae67033cd/eventsJSON

Docs · Get an API key