Home › Companies › Canvasmedical › Applied AI Software Engineer

Applied AI Software Engineer

Canvasmedical · San Francisco, CA / Remote · Hybrid · Active · $300,000–$400,000 / year · Lever

Job facts

Field	Value
Company	Canvasmedical
Title	Applied AI Software Engineer
Normalized title	-
Department / team	Engineering / Engineering - AI
Location	San Francisco, CA, United States
Work model	Hybrid / Hybrid
Employment type	Full Time
Salary	$300,000–$400,000 / year
Status	active
ATS provider	Lever
Posted / first seen	2025-06-04 / 2026-05-29
Changed / last seen	2026-05-29 / 2026-06-06

Related slices

Page	What it contains	Open
Company jobs	Active postings from Canvasmedical.	Open
Company breakdowns	Role, location, ATS, and work model facets for this company.	Open
ATS provider jobs	Active postings observed through Lever.	Open
Provider filtered search	The same provider as a filtered job collection.	Open
City jobs	Active postings in San Francisco.	Open
Department jobs	Active postings in Engineering.	Open
Work model jobs	Active Hybrid postings.	Open
Lifecycle events	Open, update, close, and reopen events for this posting.	Open
Original posting	Canonical source or apply URL captured from the ATS.	Open

Linked records

Company	Canvasmedical
Source	764803a8-0d55-4d53-a795-3f2466cc4f11
ATS provider	Lever

Description

Canvas Medical is the electronic medical records (EMR) and payments development platform for healthcare. We build modern, elegant front- and back-end tooling to enable new ways for developers and clinicians to collaborate to solve healthcare’s toughest challenges. Canvas is institutionally backed by some of the greatest technology investors in the world (funded notable health tech companies such as GoodRx, Oscar Health, and Hims & Hers Health). The Role We’re hiring an Applied AI Software Engineer to lead evaluations for agents in development and the post-deployment fleet of agents operating in Canvas to automate work for our customers. You will help develop agents in Canvas using state of the art foundation model inference and fine-tuning APIs along with our server-side SDK. The server-side SDK provides extensive tools and virtually all the context necessary for excellent agent performance. You’ll be responsible for designing and running rigorous evaluation experiments that measure performance, safety, and reliability across a wide variety of clinical, operational, and financial use cases. This role is ideal for someone with deep experience evaluating LLM-based agents at scale. You’ll create high-fidelity unit evals and end-to-end evaluations, define expert-determined ground truth outcomes, and manage iterations across model variants, prompts, tool use, and context window configurations. Your work will directly inform model selection, fine-tuning, and go/no-go decisions for AI features used in production settings. You’ll collaborate with product, ML engineering, and clinical informatics teams to ensure that Canvas's AI agents are not only capable, but trustworthy and robust under real-world healthcare constraints. You will also work with technical product marketers and developer advocates to help our broader developer community and the broader market understand the uniquely differentiated value of agents in Canvas. Canvas Medical provides equal employment opportunities to all employees and applicants for employment without regard to race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status. Who You Are You have extensive hands-on experience evaluating LLM-based systems, including multi-agent architectures and prompt-based pipelines. You are deeply familiar with foundation model APIs (OpenAI, Claude, Gemini, etc.) and how to systematically benchmark agent performance using those models in applied settings. You care about correctness and reproducibility and have built or contributed to frameworks for automated evals, annotation pipelines, and experiment tracking. You bring structure to ambiguity and know how to define “correctness” in complex, nuanced domains. You are comfortable collaborating across engineering, product, and clinical subject matter experts. You are not afraid of complexity and are energized by the rigor required in healthcare deployments. What You’ll Do Design and execute large-scale evaluation plans for LLM-based agents performing clinical documentation, scheduling, billing, communications, and general workflow automation tasks. Build end-to-end test harnesses that validate model behavior under different configurations (prompt templates, context sources, tool availability, etc.). Partner with clinicians to define accurate expected outcomes (gold standard) for performance comparisons in domains of clinical consequence, and partner with other subject matter experts in other non-clinical domains. Run and replicate experiments across multiple models, parameters, and interaction types to determine optimal configurations. Deploy and maintain ongoing sampling for post-deployment governance of agent fleets. Analyze results and summarize tradeoffs in clarity for product and engineering stakeholders, as well as for technical stakeholders among our customers and the broader market. Take ownership over internal eval tooling and infrastructure, ensuring speed, rigor, and reproducibility. Identify and recommend candidates for reinforcement fine-tuning or retrieval augmentation based on gaps identified in evals. What Success Looks Like at 90 Days An expanded set of robust evaluation suites exists for all major AI features currently in development and in production. We have well-defined correctness criteria for each workflow and a reliable source of expert-determined outcome objects. Product and engineering teams have integrated your evaluation tools into their daily workflows. Evaluation results are clearly documented and reproducible, enabling trust in the performance trajectory. Your have effectively engaged your marketing counterparts to translate your work into key messages to the market and to Canvas customers. Qualifications 5+ years of experience in applied machine learning or AI engineering, with a focus on evaluation and benchmarking. Proficiency with foundation model APIs and experience orchestrating complex agent behaviors via prompts or tools. Experience designing and running high-throughput evaluation pipelines, ideally including human-in-the-loop or expert-labeled benchmarks. Superlative Python engineering skills and familiarity with experiment management tools and data engineering toolsets in general including, yes, SQL and database management. Familiarity with clinical or healthcare data is a strong plus. Experience with reinforcement fine-tuning, model monitoring, or RLHF is a plus. Research shows that women and other minority groups might avoid applying if they don’t meet 100% of the qualifications. We encourage you to apply even if you don’t meet everything listed in the job posting.

Full job record

Job ID	4f77a96e6b51a66f16c3d5db0c72fdc5cfb8c2b3
Org ID	857ad535-bdee-47cd-ab1b-9a081f41eb91
Source ID	764803a8-0d55-4d53-a795-3f2466cc4f11
Board ID	764803a8-0d55-4d53-a795-3f2466cc4f11
Provider	lever
Provider Job Key	188bcb78-cf6d-4ceb-a6de-f5b0556bc8df
Title	Applied AI Software Engineer
Normalized Title	—
Status	active
Active	yes
Location Text	San Francisco, CA / Remote
Department	Engineering
Team	Engineering - AI
Employment Type	Full Time
Workplace Type	hybrid
Remote Policy	hybrid
Country	United States
Region	CA
City	San Francisco
Salary Raw	USD 300000-400000 per-year-salary
Salary Min	300,000
Salary Max	400,000
Salary Currency	USD
Salary Period	year
Source URL	https://jobs.lever.co/canvasmedical/188bcb78-cf6d-4ceb-a6de-f5b0556bc8df
Apply URL	https://jobs.lever.co/canvasmedical/188bcb78-cf6d-4ceb-a6de-f5b0556bc8df/apply
First Seen At	2026-05-29 06:57:51Z
Last Seen At	2026-06-06 07:56:06Z
Last Checked At	2026-06-06 07:56:06Z
Last Changed At	2026-05-29 06:57:51Z
Inactive At	—
Source Posted At	2025-06-04 04:52:28Z
Source Updated At	—
Raw Payload Uri	s3://job-postings-prod-raw-590183727216/raw/provider=lever/board=canvasmedical/date=2026-06-06/2026-06-06T07-56-05-812Z-40895189d39c7fa816dbe2bf2e1350b9e053383e86c26bc6d66c5398944d912f.json

Event Fields

{
  "content_hash": "4359e3912164d0abd727a6d96e2674b84b41fbdb3029994c22ebbaf2c7d77747",
  "source_hash": "e94139edc48953db0c76ab3ea5ed5863dcd1185cde450ab1feb65577f965601b",
  "last_changed_at": "2026-05-29T06:57:51.714Z",
  "active_status": "active"
}

Parsed Structured

{
  "language": "en",
  "location": {
    "raw": "San Francisco, CA / Remote",
    "city": "San Francisco",
    "region": "CA",
    "country": "United States",
    "is_remote": true,
    "confidence": 0.9
  },
  "salary_max": 400000,
  "salary_min": 300000,
  "inferred_at": "2026-06-06T07:56:06.007Z",
  "launch_scope": {
    "reason": "english_us_canada",
    "included": true,
    "language": "en",
    "location": {
      "raw": "San Francisco, CA / Remote",
      "city": "San Francisco",
      "region": "CA",
      "country": "United States",
      "is_remote": true,
      "confidence": 0.9
    },
    "countries": [
      "United States"
    ]
  },
  "remote_policy": "hybrid",
  "salary_period": "year",
  "workplace_type": "hybrid",
  "salary_currency": "USD"
}

Extensions

{}

Native Structured

{
  "lists": [
    {
      "text": "Who You Are",
      "content": "<li>You have extensive hands-on experience evaluating LLM-based systems, including multi-agent architectures and prompt-based pipelines.</li><li>You are deeply familiar with foundation model APIs (OpenAI, Claude, Gemini, etc.) and how to systematically benchmark agent performance using those models in applied settings.</li><li>You care about correctness and reproducibility and have built or contributed to frameworks for automated evals, annotation pipelines, and experiment tracking.</li><li>You bring structure to ambiguity and know how to define “correctness” in complex, nuanced domains.</li><li>You are comfortable collaborating across engineering, product, and clinical subject matter experts.</li><li>You are not afraid of complexity and are energized by the rigor required in healthcare deployments.</li>"
    },
    {
      "text": "What You’ll Do",
      "content": "<li>Design and execute large-scale evaluation plans for LLM-based agents performing clinical documentation, scheduling, billing, communications, and general workflow automation tasks.</li><li>Build end-to-end test harnesses that validate model behavior under different configurations (prompt templates, context sources, tool availability, etc.).</li><li>Partner with clinicians to define accurate expected outcomes (gold standard) for performance comparisons in domains of clinical consequence, and partner with other subject matter experts in other non-clinical domains.</li><li>Run and replicate experiments across multiple models, parameters, and interaction types to determine optimal configurations.</li><li>Deploy and maintain ongoing sampling for post-deployment governance of agent fleets.</li><li>Analyze results and summarize tradeoffs in clarity for product and engineering stakeholders, as well as for technical stakeholders among our customers and the broader market.</li><li>Take ownership over internal eval tooling and infrastructure, ensuring speed, rigor, and reproducibility.</li><li>Identify and recommend candidates for reinforcement fine-tuning or retrieval augmentation based on gaps identified in evals.</li>"
    },
    {
      "text": "What Success Looks Like at 90 Days",
      "content": "<li>An expanded set of robust evaluation suites exists for all major AI features currently in development and in production.</li><li>We have well-defined correctness criteria for each workflow and a reliable source of expert-determined outcome objects.</li><li>Product and engineering teams have integrated your evaluation tools into their daily workflows.</li><li>Evaluation results are clearly documented and reproducible, enabling trust in the performance trajectory.</li><li>Your have effectively engaged your marketing counterparts to translate your work into key messages to the market and to Canvas customers.</li>"
    },
    {
      "text": "Qualifications",
      "content": "<li>5+ years of experience in applied machine learning or AI engineering, with a focus on evaluation and benchmarking.</li><li>Proficiency with foundation model APIs and experience orchestrating complex agent behaviors via prompts or tools.</li><li>Experience designing and running high-throughput evaluation pipelines, ideally including human-in-the-loop or expert-labeled benchmarks.</li><li>Superlative Python engineering skills and familiarity with experiment management tools and data engineering toolsets in general including, yes, SQL and database management.</li><li>Familiarity with clinical or healthcare data is a strong plus.</li><li>Experience with reinforcement fine-tuning, model monitoring, or RLHF is a plus.</li><li>Research shows that women and other minority groups might avoid applying if they don’t meet 100% of the qualifications. We encourage you to apply even if you don’t meet everything listed in the job posting.</li>"
    }
  ],
  "country": "US",
  "createdAt": 1749012748108,
  "updatedAt": null,
  "categories": {
    "team": "Engineering - AI",
    "location": "San Francisco, CA / Remote",
    "commitment": "Full Time",
    "department": "Engineering",
    "allLocations": [
      "San Francisco, CA / Remote"
    ]
  },
  "salaryRange": {
    "max": 400000,
    "min": 300000,
    "currency": "USD",
    "interval": "per-year-salary"
  },
  "workplaceType": "hybrid"
}

Get this page with API

Rendered from the bluedoor Job Postings API. Reproduce it:

GET https://api.bluedoor.sh/job-postings/v1/jobs/4f77a96e6b51a66f16c3d5db0c72fdc5cfb8c2b3?include=descriptionJSON

GET https://api.bluedoor.sh/job-postings/v1/orgs/857ad535-bdee-47cd-ab1b-9a081f41eb91JSON

GET https://api.bluedoor.sh/job-postings/v1/sources/764803a8-0d55-4d53-a795-3f2466cc4f11JSON

GET https://api.bluedoor.sh/job-postings/v1/jobs/4f77a96e6b51a66f16c3d5db0c72fdc5cfb8c2b3/eventsJSON

Docs · Get an API key