bluedoor data·Job Postings API·bluedoor.sh ↗

HomeCompaniesAgile Robots SeWorking Student (m/f/d) LLM Agent Evaluation & Benchmarking

Working Student (m/f/d) LLM Agent Evaluation & Benchmarking

Agile Robots Se · Germany, Munich (HQ) · Active · Personio

Job facts

FieldValue
CompanyAgile Robots Se
TitleWorking Student (m/f/d) LLM Agent Evaluation & Benchmarking
Normalized title-
Department / teamAI Platform / Internships & Working Students
LocationGermany, Munich (HQ)
Work model-
Employment typePart Time
Salary-
Statusactive
ATS providerPersonio
Posted / first seen2026-05-28 / 2026-05-30
Changed / last seen2026-05-30 / 2026-06-06

Related slices

PageWhat it containsOpen
Company jobsActive postings from Agile Robots Se.Open
Company breakdownsRole, location, ATS, and work model facets for this company.Open
ATS provider jobsActive postings observed through Personio.Open
Provider filtered searchThe same provider as a filtered job collection.Open
Department jobsActive postings in AI Platform.Open
Lifecycle eventsOpen, update, close, and reopen events for this posting.Open
Original postingCanonical source or apply URL captured from the ATS.Open

Linked records

CompanyAgile Robots Se
Sourcebcb1fbae-6077-4ee3-833d-67baf488bf90
ATS providerPersonio

Description

About the role We are looking for a Working Student (m/f/d) LLM Agent Evaluation & Benchmarking . In this role, you will design and build an agent-agnostic benchmarking harness, run comparative evaluations across frontier and local models, and translate findings into prompt, guard, and tool-schema improvements. Your Responsibilities Harness Development: Design and build an agent-agnostic benchmarking harness that executes versioned task suites against frontier and local models with reproducible, version-controlled runs. Task Suite Design: Define and maintain evaluation task suites that measure task success, grounding accuracy, latency, and cost across the agent portfolio. Model Evaluation: Run periodic head-to-head evaluations across models to produce structured comparisons that support model selection decisions. Eval Reporting: Analyze evaluation outputs and produce reports and visualizations that communicate findings clearly to agent owners. Improvement Feedback: Translate evaluation findings into concrete changes to prompts, guard logic, and tool schemas in collaboration with agent owners. Essential Skills Academic Background: Currently enrolled in a Master's programme in Computer Science, Machine Learning, or Data Science. Python Engineering: Ability to write well-structured Python for tooling and automation, including test frameworks such as pytest, dependency management, reproducible execution, and basic CI pipeline configuration. Eval Frameworks: Working familiarity with LLM evaluation frameworks such as LangSmith, Ragas, Inspect AI, or lm-evaluation-harness. Agent Concepts: Working understanding of LLM APIs (OpenAI, Anthropic, Ollama), prompt structure, and how multi-step agent systems are built and instrumented. Experimental Design: Ability to design controlled comparisons, define success metrics, and interpret results across multiple evaluation conditions. Beneficial Skills Data Analysis: Familiarity with statistical comparison methods and data handling using numpy, pandas, and scikit-learn. Reporting Tools: Familiarity with data visualization and reporting using tools such as Plotly, Streamlit, or notebooks. Agent Frameworks: Familiarity with agent orchestration frameworks such as LangChain or LangGraph. What we offer Practical learning opportunities to complement your studies. Dynamic high-tech company combined with financial soundness and world class investors. Join an interdisciplinary, international team with 60+ different nationalities in a collaborative work environment. Corporate Benefits Program that covers health, mobility and learning with 100 € net per month. Modern office facilities with a rooftop terrace overlooking Munich, free drinks & fruits, and regular company events contribute to a good working environment.

Full job record

Job IDec6389829cf44e1b4ed891fe2d53c4e8265280d6
Org IDcbcab16d-d77f-4aae-95e4-f537194009c8
Source IDbcb1fbae-6077-4ee3-833d-67baf488bf90
Board IDbcb1fbae-6077-4ee3-833d-67baf488bf90
Providerpersonio
Provider Job Key2650461
TitleWorking Student (m/f/d) LLM Agent Evaluation & Benchmarking
Normalized Title
Statusactive
Activeyes
Location TextGermany, Munich (HQ)
DepartmentAI Platform
TeamInternships & Working Students
Employment Typepart_time
Workplace Type
Remote Policy
Country
Region
City
Salary Raw
Salary Min
Salary Max
Salary Currency
Salary Period
Source URLhttps://agile-robots-se.jobs.personio.de/job/2650461?language=en
Apply URLhttps://agile-robots-se.jobs.personio.de/job/2650461?language=en
First Seen At2026-05-30 06:05:02Z
Last Seen At2026-06-06 07:54:12Z
Last Checked At2026-06-06 07:54:12Z
Last Changed At2026-05-30 06:05:02Z
Inactive At
Source Posted At2026-05-28 16:48:33Z
Source Updated At
Raw Payload Uris3://job-postings-prod-raw-590183727216/raw/provider=personio/board=agile-robots-se.de/date=2026-06-06/2026-06-06T07-54-11-954Z-784a826e6f3211779d98f861e26fa61d9ac5e980c0b6fc37cf918f969809edb8.json
Event Fields
{
  "content_hash": "11d96d7cd68ee275bfea4c7abd164cb40975c5e45ec4134def18b668752d257a",
  "source_hash": "e25533344f53c0970e04deab33292957115bd2889965f16108d76c0e66a9deeb",
  "last_changed_at": "2026-05-30T06:05:02.379Z",
  "active_status": "active"
}
Parsed Structured
{
  "language": "en",
  "location": {
    "raw": "Germany, Munich (HQ)",
    "city": null,
    "region": null,
    "country": null,
    "is_remote": false,
    "confidence": 0.8
  },
  "salary_max": null,
  "salary_min": null,
  "inferred_at": "2026-06-06T07:54:12.965Z",
  "launch_scope": {
    "reason": "personio_production_catalog",
    "included": true,
    "location": {
      "raw": "Germany, Munich (HQ)",
      "city": null,
      "region": null,
      "country": null,
      "is_remote": false,
      "confidence": 0.8
    },
    "countries": []
  },
  "remote_policy": null,
  "salary_period": null,
  "workplace_type": null,
  "salary_currency": null
}
Extensions
{}
Native Structured
{
  "id": "2650461",
  "name": "Working Student (m/f/d) LLM Agent Evaluation & Benchmarking",
  "office": "Germany, Munich (HQ)",
  "keywords": [],
  "schedule": "part-time",
  "createdAt": "2026-05-28T16:48:33+00:00",
  "seniority": "student",
  "department": "AI Platform",
  "occupation": "general_and_other_it_software",
  "subcompany": "Agile Robots SE",
  "employmentType": "working_student",
  "jobDescriptions": [
    {
      "name": "About the role",
      "value": "<p style=\"font-family:Arial;font-size:14px;\">We are looking for a <strong>Working Student (m/f/d) LLM Agent Evaluation & Benchmarking</strong>. In this role, you will design and build an agent-agnostic benchmarking harness, run comparative evaluations across frontier and local models, and translate findings into prompt, guard, and tool-schema improvements.</p>"
    },
    {
      "name": "Your Responsibilities",
      "value": "<ul><li style=\"font-family:Arial;font-size:14px;\"><strong>Harness Development:</strong> Design and build an agent-agnostic benchmarking harness that executes versioned task suites against frontier and local models with reproducible, version-controlled runs.</li><li style=\"font-family:Arial;font-size:14px;\"><strong>Task Suite Design:</strong> Define and maintain evaluation task suites that measure task success, grounding accuracy, latency, and cost across the agent portfolio.</li><li style=\"font-family:Arial;font-size:14px;\"><strong>Model Evaluation:</strong> Run periodic head-to-head evaluations across models to produce structured comparisons that support model selection decisions.</li><li style=\"font-family:Arial;font-size:14px;\"><strong>Eval Reporting:</strong> Analyze evaluation outputs and produce reports and visualizations that communicate findings clearly to agent owners.</li><li style=\"font-family:Arial;font-size:14px;\"><strong>Improvement Feedback:</strong> Translate evaluation findings into concrete changes to prompts, guard logic, and tool schemas in collaboration with agent owners.</li></ul>"
    },
    {
      "name": "Essential Skills",
      "value": "<ul><li style=\"font-family:Arial;font-size:14px;\"><strong>Academic Background:</strong> Currently enrolled in a Master's programme in Computer Science, Machine Learning, or Data Science.</li><li style=\"font-family:Arial;font-size:14px;\"><strong>Python Engineering:</strong> Ability to write well-structured Python for tooling and automation, including test frameworks such as pytest, dependency management, reproducible execution, and basic CI pipeline configuration.</li><li style=\"font-family:Arial;font-size:14px;\"><strong>Eval Frameworks:</strong> Working familiarity with LLM evaluation frameworks such as LangSmith, Ragas, Inspect AI, or lm-evaluation-harness.</li><li style=\"font-family:Arial;font-size:14px;\"><strong>Agent Concepts:</strong> Working understanding of LLM APIs (OpenAI, Anthropic, Ollama), prompt structure, and how multi-step agent systems are built and instrumented.</li><li style=\"font-family:Arial;font-size:14px;\"><strong>Experimental Design:</strong> Ability to design controlled comparisons, define success metrics, and interpret results across multiple evaluation conditions.</li></ul>"
    },
    {
      "name": "Beneficial Skills",
      "value": "<ul><li style=\"font-family:Arial;font-size:14px;\"><strong>Data Analysis:</strong> Familiarity with statistical comparison methods and data handling using numpy, pandas, and scikit-learn.</li><li style=\"font-family:Arial;font-size:14px;\"><strong>Reporting Tools:</strong> Familiarity with data visualization and reporting using tools such as Plotly, Streamlit, or notebooks.</li><li style=\"font-family:Arial;font-size:14px;\"><strong>Agent Frameworks:</strong> Familiarity with agent orchestration frameworks such as LangChain or LangGraph.</li></ul>"
    },
    {
      "name": "What we offer",
      "value": "<ul style=\"border:0px solid;font-family:Inter, '-apple-system', 'system-ui', 'Segoe UI', Roboto, 'Helvetica Neue', 'Open Sans', 'system-ui', '-apple-system', 'Segoe UI', Roboto, Ubuntu, Cantarell, 'Noto Sans', sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji';margin:8px 0px;padding:0px 0px 0px 24px;color:rgb(66,66,66);font-size:14px;font-style:normal;font-weight:400;text-transform:none;background-color:rgb(255,255,255);\"><li style=\"border:0px solid;font-family:Arial, Helvetica, sans-serif;list-style-type:disc;margin:0px;font-size:14px;\">Practical learning opportunities to complement your studies.</li><li style=\"border:0px solid;font-family:Arial, Helvetica, sans-serif;list-style-type:disc;margin:0px;font-size:14px;\">Dynamic high-tech company combined with financial soundness and world class investors.</li><li style=\"border:0px solid;font-family:Arial, Helvetica, sans-serif;list-style-type:disc;margin:0px;font-size:14px;\">Join an interdisciplinary, international team with 60+ different nationalities in a collaborative work environment.</li><li style=\"border:0px solid;font-family:Arial, Helvetica, sans-serif;list-style-type:disc;margin:0px;font-size:14px;\">Corporate Benefits Program that covers health, mobility and learning with 100 € net per month.</li><li style=\"border:0px solid;font-family:Arial, Helvetica, sans-serif;list-style-type:disc;margin:0px;font-size:14px;\">Modern office facilities with a rooftop terrace overlooking Munich, free drinks & fruits, and regular company events contribute to a good working environment.</li></ul>"
    }
  ],
  "occupationCategory": "it_software",
  "recruitingCategory": "Internships & Working Students"
}
Get this page with API

Rendered from the bluedoor Job Postings API. Reproduce it:

GET https://api.bluedoor.sh/job-postings/v1/jobs/ec6389829cf44e1b4ed891fe2d53c4e8265280d6?include=descriptionJSON
GET https://api.bluedoor.sh/job-postings/v1/orgs/cbcab16d-d77f-4aae-95e4-f537194009c8JSON
GET https://api.bluedoor.sh/job-postings/v1/sources/bcb1fbae-6077-4ee3-833d-67baf488bf90JSON
GET https://api.bluedoor.sh/job-postings/v1/jobs/ec6389829cf44e1b4ed891fe2d53c4e8265280d6/eventsJSON