Home › Companies › Agile Robots Se › Working Student (m/f/d) LLM Agent Evaluation & Benchmarking
Working Student (m/f/d) LLM Agent Evaluation & Benchmarking
Agile Robots Se · Germany, Munich (HQ) · Active · Personio
Job facts
| Field | Value |
|---|---|
| Company | Agile Robots Se |
| Title | Working Student (m/f/d) LLM Agent Evaluation & Benchmarking |
| Normalized title | - |
| Department / team | AI Platform / Internships & Working Students |
| Location | Germany, Munich (HQ) |
| Work model | - |
| Employment type | Part Time |
| Salary | - |
| Status | active |
| ATS provider | Personio |
| Posted / first seen | 2026-05-28 / 2026-05-30 |
| Changed / last seen | 2026-05-30 / 2026-06-06 |
Related slices
| Page | What it contains | Open |
|---|---|---|
| Company jobs | Active postings from Agile Robots Se. | Open |
| Company breakdowns | Role, location, ATS, and work model facets for this company. | Open |
| ATS provider jobs | Active postings observed through Personio. | Open |
| Provider filtered search | The same provider as a filtered job collection. | Open |
| Department jobs | Active postings in AI Platform. | Open |
| Lifecycle events | Open, update, close, and reopen events for this posting. | Open |
| Original posting | Canonical source or apply URL captured from the ATS. | Open |
Linked records
| Company | Agile Robots Se |
| Source | bcb1fbae-6077-4ee3-833d-67baf488bf90 |
| ATS provider | Personio |
Description
About the role
We are looking for a Working Student (m/f/d) LLM Agent Evaluation & Benchmarking . In this role, you will design and build an agent-agnostic benchmarking harness, run comparative evaluations across frontier and local models, and translate findings into prompt, guard, and tool-schema improvements.
Your Responsibilities
Harness Development: Design and build an agent-agnostic benchmarking harness that executes versioned task suites against frontier and local models with reproducible, version-controlled runs. Task Suite Design: Define and maintain evaluation task suites that measure task success, grounding accuracy, latency, and cost across the agent portfolio. Model Evaluation: Run periodic head-to-head evaluations across models to produce structured comparisons that support model selection decisions. Eval Reporting: Analyze evaluation outputs and produce reports and visualizations that communicate findings clearly to agent owners. Improvement Feedback: Translate evaluation findings into concrete changes to prompts, guard logic, and tool schemas in collaboration with agent owners.
Essential Skills
Academic Background: Currently enrolled in a Master's programme in Computer Science, Machine Learning, or Data Science. Python Engineering: Ability to write well-structured Python for tooling and automation, including test frameworks such as pytest, dependency management, reproducible execution, and basic CI pipeline configuration. Eval Frameworks: Working familiarity with LLM evaluation frameworks such as LangSmith, Ragas, Inspect AI, or lm-evaluation-harness. Agent Concepts: Working understanding of LLM APIs (OpenAI, Anthropic, Ollama), prompt structure, and how multi-step agent systems are built and instrumented. Experimental Design: Ability to design controlled comparisons, define success metrics, and interpret results across multiple evaluation conditions.
Beneficial Skills
Data Analysis: Familiarity with statistical comparison methods and data handling using numpy, pandas, and scikit-learn. Reporting Tools: Familiarity with data visualization and reporting using tools such as Plotly, Streamlit, or notebooks. Agent Frameworks: Familiarity with agent orchestration frameworks such as LangChain or LangGraph.
What we offer
Practical learning opportunities to complement your studies. Dynamic high-tech company combined with financial soundness and world class investors. Join an interdisciplinary, international team with 60+ different nationalities in a collaborative work environment. Corporate Benefits Program that covers health, mobility and learning with 100 € net per month. Modern office facilities with a rooftop terrace overlooking Munich, free drinks & fruits, and regular company events contribute to a good working environment.
Full job record
| Job ID | ec6389829cf44e1b4ed891fe2d53c4e8265280d6 |
| Org ID | cbcab16d-d77f-4aae-95e4-f537194009c8 |
| Source ID | bcb1fbae-6077-4ee3-833d-67baf488bf90 |
| Board ID | bcb1fbae-6077-4ee3-833d-67baf488bf90 |
| Provider | personio |
| Provider Job Key | 2650461 |
| Title | Working Student (m/f/d) LLM Agent Evaluation & Benchmarking |
| Normalized Title | — |
| Status | active |
| Active | yes |
| Location Text | Germany, Munich (HQ) |
| Department | AI Platform |
| Team | Internships & Working Students |
| Employment Type | part_time |
| Workplace Type | — |
| Remote Policy | — |
| Country | — |
| Region | — |
| City | — |
| Salary Raw | — |
| Salary Min | — |
| Salary Max | — |
| Salary Currency | — |
| Salary Period | — |
| Source URL | https://agile-robots-se.jobs.personio.de/job/2650461?language=en |
| Apply URL | https://agile-robots-se.jobs.personio.de/job/2650461?language=en |
| First Seen At | 2026-05-30 06:05:02Z |
| Last Seen At | 2026-06-06 07:54:12Z |
| Last Checked At | 2026-06-06 07:54:12Z |
| Last Changed At | 2026-05-30 06:05:02Z |
| Inactive At | — |
| Source Posted At | 2026-05-28 16:48:33Z |
| Source Updated At | — |
| Raw Payload Uri | s3://job-postings-prod-raw-590183727216/raw/provider=personio/board=agile-robots-se.de/date=2026-06-06/2026-06-06T07-54-11-954Z-784a826e6f3211779d98f861e26fa61d9ac5e980c0b6fc37cf918f969809edb8.json |
Event Fields
{
"content_hash": "11d96d7cd68ee275bfea4c7abd164cb40975c5e45ec4134def18b668752d257a",
"source_hash": "e25533344f53c0970e04deab33292957115bd2889965f16108d76c0e66a9deeb",
"last_changed_at": "2026-05-30T06:05:02.379Z",
"active_status": "active"
}Parsed Structured
{
"language": "en",
"location": {
"raw": "Germany, Munich (HQ)",
"city": null,
"region": null,
"country": null,
"is_remote": false,
"confidence": 0.8
},
"salary_max": null,
"salary_min": null,
"inferred_at": "2026-06-06T07:54:12.965Z",
"launch_scope": {
"reason": "personio_production_catalog",
"included": true,
"location": {
"raw": "Germany, Munich (HQ)",
"city": null,
"region": null,
"country": null,
"is_remote": false,
"confidence": 0.8
},
"countries": []
},
"remote_policy": null,
"salary_period": null,
"workplace_type": null,
"salary_currency": null
}Extensions
{}Native Structured
{
"id": "2650461",
"name": "Working Student (m/f/d) LLM Agent Evaluation & Benchmarking",
"office": "Germany, Munich (HQ)",
"keywords": [],
"schedule": "part-time",
"createdAt": "2026-05-28T16:48:33+00:00",
"seniority": "student",
"department": "AI Platform",
"occupation": "general_and_other_it_software",
"subcompany": "Agile Robots SE",
"employmentType": "working_student",
"jobDescriptions": [
{
"name": "About the role",
"value": "<p style=\"font-family:Arial;font-size:14px;\">We are looking for a <strong>Working Student (m/f/d) LLM Agent Evaluation & Benchmarking</strong>. In this role, you will design and build an agent-agnostic benchmarking harness, run comparative evaluations across frontier and local models, and translate findings into prompt, guard, and tool-schema improvements.</p>"
},
{
"name": "Your Responsibilities",
"value": "<ul><li style=\"font-family:Arial;font-size:14px;\"><strong>Harness Development:</strong> Design and build an agent-agnostic benchmarking harness that executes versioned task suites against frontier and local models with reproducible, version-controlled runs.</li><li style=\"font-family:Arial;font-size:14px;\"><strong>Task Suite Design:</strong> Define and maintain evaluation task suites that measure task success, grounding accuracy, latency, and cost across the agent portfolio.</li><li style=\"font-family:Arial;font-size:14px;\"><strong>Model Evaluation:</strong> Run periodic head-to-head evaluations across models to produce structured comparisons that support model selection decisions.</li><li style=\"font-family:Arial;font-size:14px;\"><strong>Eval Reporting:</strong> Analyze evaluation outputs and produce reports and visualizations that communicate findings clearly to agent owners.</li><li style=\"font-family:Arial;font-size:14px;\"><strong>Improvement Feedback:</strong> Translate evaluation findings into concrete changes to prompts, guard logic, and tool schemas in collaboration with agent owners.</li></ul>"
},
{
"name": "Essential Skills",
"value": "<ul><li style=\"font-family:Arial;font-size:14px;\"><strong>Academic Background:</strong> Currently enrolled in a Master's programme in Computer Science, Machine Learning, or Data Science.</li><li style=\"font-family:Arial;font-size:14px;\"><strong>Python Engineering:</strong> Ability to write well-structured Python for tooling and automation, including test frameworks such as pytest, dependency management, reproducible execution, and basic CI pipeline configuration.</li><li style=\"font-family:Arial;font-size:14px;\"><strong>Eval Frameworks:</strong> Working familiarity with LLM evaluation frameworks such as LangSmith, Ragas, Inspect AI, or lm-evaluation-harness.</li><li style=\"font-family:Arial;font-size:14px;\"><strong>Agent Concepts:</strong> Working understanding of LLM APIs (OpenAI, Anthropic, Ollama), prompt structure, and how multi-step agent systems are built and instrumented.</li><li style=\"font-family:Arial;font-size:14px;\"><strong>Experimental Design:</strong> Ability to design controlled comparisons, define success metrics, and interpret results across multiple evaluation conditions.</li></ul>"
},
{
"name": "Beneficial Skills",
"value": "<ul><li style=\"font-family:Arial;font-size:14px;\"><strong>Data Analysis:</strong> Familiarity with statistical comparison methods and data handling using numpy, pandas, and scikit-learn.</li><li style=\"font-family:Arial;font-size:14px;\"><strong>Reporting Tools:</strong> Familiarity with data visualization and reporting using tools such as Plotly, Streamlit, or notebooks.</li><li style=\"font-family:Arial;font-size:14px;\"><strong>Agent Frameworks:</strong> Familiarity with agent orchestration frameworks such as LangChain or LangGraph.</li></ul>"
},
{
"name": "What we offer",
"value": "<ul style=\"border:0px solid;font-family:Inter, '-apple-system', 'system-ui', 'Segoe UI', Roboto, 'Helvetica Neue', 'Open Sans', 'system-ui', '-apple-system', 'Segoe UI', Roboto, Ubuntu, Cantarell, 'Noto Sans', sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji';margin:8px 0px;padding:0px 0px 0px 24px;color:rgb(66,66,66);font-size:14px;font-style:normal;font-weight:400;text-transform:none;background-color:rgb(255,255,255);\"><li style=\"border:0px solid;font-family:Arial, Helvetica, sans-serif;list-style-type:disc;margin:0px;font-size:14px;\">Practical learning opportunities to complement your studies.</li><li style=\"border:0px solid;font-family:Arial, Helvetica, sans-serif;list-style-type:disc;margin:0px;font-size:14px;\">Dynamic high-tech company combined with financial soundness and world class investors.</li><li style=\"border:0px solid;font-family:Arial, Helvetica, sans-serif;list-style-type:disc;margin:0px;font-size:14px;\">Join an interdisciplinary, international team with 60+ different nationalities in a collaborative work environment.</li><li style=\"border:0px solid;font-family:Arial, Helvetica, sans-serif;list-style-type:disc;margin:0px;font-size:14px;\">Corporate Benefits Program that covers health, mobility and learning with 100 € net per month.</li><li style=\"border:0px solid;font-family:Arial, Helvetica, sans-serif;list-style-type:disc;margin:0px;font-size:14px;\">Modern office facilities with a rooftop terrace overlooking Munich, free drinks & fruits, and regular company events contribute to a good working environment.</li></ul>"
}
],
"occupationCategory": "it_software",
"recruitingCategory": "Internships & Working Students"
}Get this page with API
Rendered from the bluedoor Job Postings API. Reproduce it:
GET https://api.bluedoor.sh/job-postings/v1/jobs/ec6389829cf44e1b4ed891fe2d53c4e8265280d6?include=descriptionJSONGET https://api.bluedoor.sh/job-postings/v1/orgs/cbcab16d-d77f-4aae-95e4-f537194009c8JSONGET https://api.bluedoor.sh/job-postings/v1/sources/bcb1fbae-6077-4ee3-833d-67baf488bf90JSONGET https://api.bluedoor.sh/job-postings/v1/jobs/ec6389829cf44e1b4ed891fe2d53c4e8265280d6/eventsJSON