Home › Companies › Agile Robots Se › Working Student (m/f/d) LLM Agent Evaluation & Benchmarking

Working Student (m/f/d) LLM Agent Evaluation & Benchmarking

Agile Robots Se · Germany, Munich (HQ) · Active · Personio

Job facts

Field	Value
Company	Agile Robots Se
Title	Working Student (m/f/d) LLM Agent Evaluation & Benchmarking
Normalized title	-
Department / team	AI Platform / Internships & Working Students
Location	Germany, Munich (HQ)
Work model	-
Employment type	Part Time
Salary	-
Status	active
ATS provider	Personio
Posted / first seen	2026-05-28 / 2026-05-30
Changed / last seen	2026-05-30 / 2026-06-06

Related slices

Page	What it contains	Open
Company jobs	Active postings from Agile Robots Se.	Open
Company breakdowns	Role, location, ATS, and work model facets for this company.	Open
ATS provider jobs	Active postings observed through Personio.	Open
Provider filtered search	The same provider as a filtered job collection.	Open
Department jobs	Active postings in AI Platform.	Open
Lifecycle events	Open, update, close, and reopen events for this posting.	Open
Original posting	Canonical source or apply URL captured from the ATS.	Open

Linked records

Company	Agile Robots Se
Source	bcb1fbae-6077-4ee3-833d-67baf488bf90
ATS provider	Personio

Description

About the role We are looking for a Working Student (m/f/d) LLM Agent Evaluation & Benchmarking . In this role, you will design and build an agent-agnostic benchmarking harness, run comparative evaluations across frontier and local models, and translate findings into prompt, guard, and tool-schema improvements. Your Responsibilities Harness Development: Design and build an agent-agnostic benchmarking harness that executes versioned task suites against frontier and local models with reproducible, version-controlled runs. Task Suite Design: Define and maintain evaluation task suites that measure task success, grounding accuracy, latency, and cost across the agent portfolio. Model Evaluation: Run periodic head-to-head evaluations across models to produce structured comparisons that support model selection decisions. Eval Reporting: Analyze evaluation outputs and produce reports and visualizations that communicate findings clearly to agent owners. Improvement Feedback: Translate evaluation findings into concrete changes to prompts, guard logic, and tool schemas in collaboration with agent owners. Essential Skills Academic Background: Currently enrolled in a Master's programme in Computer Science, Machine Learning, or Data Science. Python Engineering: Ability to write well-structured Python for tooling and automation, including test frameworks such as pytest, dependency management, reproducible execution, and basic CI pipeline configuration. Eval Frameworks: Working familiarity with LLM evaluation frameworks such as LangSmith, Ragas, Inspect AI, or lm-evaluation-harness. Agent Concepts: Working understanding of LLM APIs (OpenAI, Anthropic, Ollama), prompt structure, and how multi-step agent systems are built and instrumented. Experimental Design: Ability to design controlled comparisons, define success metrics, and interpret results across multiple evaluation conditions. Beneficial Skills Data Analysis: Familiarity with statistical comparison methods and data handling using numpy, pandas, and scikit-learn. Reporting Tools: Familiarity with data visualization and reporting using tools such as Plotly, Streamlit, or notebooks. Agent Frameworks: Familiarity with agent orchestration frameworks such as LangChain or LangGraph. What we offer Practical learning opportunities to complement your studies. Dynamic high-tech company combined with financial soundness and world class investors. Join an interdisciplinary, international team with 60+ different nationalities in a collaborative work environment. Corporate Benefits Program that covers health, mobility and learning with 100 € net per month. Modern office facilities with a rooftop terrace overlooking Munich, free drinks & fruits, and regular company events contribute to a good working environment.

Full job record

Job ID	ec6389829cf44e1b4ed891fe2d53c4e8265280d6
Org ID	cbcab16d-d77f-4aae-95e4-f537194009c8
Source ID	bcb1fbae-6077-4ee3-833d-67baf488bf90
Board ID	bcb1fbae-6077-4ee3-833d-67baf488bf90
Provider	personio
Provider Job Key	2650461
Title	Working Student (m/f/d) LLM Agent Evaluation & Benchmarking
Normalized Title	—
Status	active
Active	yes
Location Text	Germany, Munich (HQ)
Department	AI Platform
Team	Internships & Working Students
Employment Type	part_time
Workplace Type	—
Remote Policy	—
Country	—
Region	—
City	—
Salary Raw	—
Salary Min	—
Salary Max	—
Salary Currency	—
Salary Period	—
Source URL	https://agile-robots-se.jobs.personio.de/job/2650461?language=en
Apply URL	https://agile-robots-se.jobs.personio.de/job/2650461?language=en
First Seen At	2026-05-30 06:05:02Z
Last Seen At	2026-06-06 07:54:12Z
Last Checked At	2026-06-06 07:54:12Z
Last Changed At	2026-05-30 06:05:02Z
Inactive At	—
Source Posted At	2026-05-28 16:48:33Z
Source Updated At	—
Raw Payload Uri	s3://job-postings-prod-raw-590183727216/raw/provider=personio/board=agile-robots-se.de/date=2026-06-06/2026-06-06T07-54-11-954Z-784a826e6f3211779d98f861e26fa61d9ac5e980c0b6fc37cf918f969809edb8.json

Event Fields

{
  "content_hash": "11d96d7cd68ee275bfea4c7abd164cb40975c5e45ec4134def18b668752d257a",
  "source_hash": "e25533344f53c0970e04deab33292957115bd2889965f16108d76c0e66a9deeb",
  "last_changed_at": "2026-05-30T06:05:02.379Z",
  "active_status": "active"
}

Parsed Structured

{
  "language": "en",
  "location": {
    "raw": "Germany, Munich (HQ)",
    "city": null,
    "region": null,
    "country": null,
    "is_remote": false,
    "confidence": 0.8
  },
  "salary_max": null,
  "salary_min": null,
  "inferred_at": "2026-06-06T07:54:12.965Z",
  "launch_scope": {
    "reason": "personio_production_catalog",
    "included": true,
    "location": {
      "raw": "Germany, Munich (HQ)",
      "city": null,
      "region": null,
      "country": null,
      "is_remote": false,
      "confidence": 0.8
    },
    "countries": []
  },
  "remote_policy": null,
  "salary_period": null,
  "workplace_type": null,
  "salary_currency": null
}

Extensions

{}

Native Structured

{
  "id": "2650461",
  "name": "Working Student (m/f/d) LLM Agent Evaluation & Benchmarking",
  "office": "Germany, Munich (HQ)",
  "keywords": [],
  "schedule": "part-time",
  "createdAt": "2026-05-28T16:48:33+00:00",
  "seniority": "student",
  "department": "AI Platform",
  "occupation": "general_and_other_it_software",
  "subcompany": "Agile Robots SE",
  "employmentType": "working_student",
  "jobDescriptions": [
    {
      "name": "About the role",
      "value": "<p style=\"font-family:Arial;font-size:14px;\">We are looking for a <strong>Working Student (m/f/d) LLM Agent Evaluation & Benchmarking</strong>. In this role, you will design and build an agent-agnostic benchmarking harness, run comparative evaluations across frontier and local models, and translate findings into prompt, guard, and tool-schema improvements.</p>"
    },
    {
      "name": "Your Responsibilities",
      "value": "<ul><li style=\"font-family:Arial;font-size:14px;\"><strong>Harness Development:</strong> Design and build an agent-agnostic benchmarking harness that executes versioned task suites against frontier and local models with reproducible, version-controlled runs.</li><li style=\"font-family:Arial;font-size:14px;\"><strong>Task Suite Design:</strong> Define and maintain evaluation task suites that measure task success, grounding accuracy, latency, and cost across the agent portfolio.</li><li style=\"font-family:Arial;font-size:14px;\"><strong>Model Evaluation:</strong> Run periodic head-to-head evaluations across models to produce structured comparisons that support model selection decisions.</li><li style=\"font-family:Arial;font-size:14px;\"><strong>Eval Reporting:</strong> Analyze evaluation outputs and produce reports and visualizations that communicate findings clearly to agent owners.</li><li style=\"font-family:Arial;font-size:14px;\"><strong>Improvement Feedback:</strong> Translate evaluation findings into concrete changes to prompts, guard logic, and tool schemas in collaboration with agent owners.</li></ul>"
    },
    {
      "name": "Essential Skills",
      "value": "<ul><li style=\"font-family:Arial;font-size:14px;\"><strong>Academic Background:</strong> Currently enrolled in a Master's programme in Computer Science, Machine Learning, or Data Science.</li><li style=\"font-family:Arial;font-size:14px;\"><strong>Python Engineering:</strong> Ability to write well-structured Python for tooling and automation, including test frameworks such as pytest, dependency management, reproducible execution, and basic CI pipeline configuration.</li><li style=\"font-family:Arial;font-size:14px;\"><strong>Eval Frameworks:</strong> Working familiarity with LLM evaluation frameworks such as LangSmith, Ragas, Inspect AI, or lm-evaluation-harness.</li><li style=\"font-family:Arial;font-size:14px;\"><strong>Agent Concepts:</strong> Working understanding of LLM APIs (OpenAI, Anthropic, Ollama), prompt structure, and how multi-step agent systems are built and instrumented.</li><li style=\"font-family:Arial;font-size:14px;\"><strong>Experimental Design:</strong> Ability to design controlled comparisons, define success metrics, and interpret results across multiple evaluation conditions.</li></ul>"
    },
    {
      "name": "Beneficial Skills",
      "value": "<ul><li style=\"font-family:Arial;font-size:14px;\"><strong>Data Analysis:</strong> Familiarity with statistical comparison methods and data handling using numpy, pandas, and scikit-learn.</li><li style=\"font-family:Arial;font-size:14px;\"><strong>Reporting Tools:</strong> Familiarity with data visualization and reporting using tools such as Plotly, Streamlit, or notebooks.</li><li style=\"font-family:Arial;font-size:14px;\"><strong>Agent Frameworks:</strong> Familiarity with agent orchestration frameworks such as LangChain or LangGraph.</li></ul>"
    },
    {
      "name": "What we offer",
      "value": "<ul style=\"border:0px solid;font-family:Inter, '-apple-system', 'system-ui', 'Segoe UI', Roboto, 'Helvetica Neue', 'Open Sans', 'system-ui', '-apple-system', 'Segoe UI', Roboto, Ubuntu, Cantarell, 'Noto Sans', sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji';margin:8px 0px;padding:0px 0px 0px 24px;color:rgb(66,66,66);font-size:14px;font-style:normal;font-weight:400;text-transform:none;background-color:rgb(255,255,255);\"><li style=\"border:0px solid;font-family:Arial, Helvetica, sans-serif;list-style-type:disc;margin:0px;font-size:14px;\">Practical learning opportunities to complement your studies.</li><li style=\"border:0px solid;font-family:Arial, Helvetica, sans-serif;list-style-type:disc;margin:0px;font-size:14px;\">Dynamic high-tech company combined with financial soundness and world class investors.</li><li style=\"border:0px solid;font-family:Arial, Helvetica, sans-serif;list-style-type:disc;margin:0px;font-size:14px;\">Join an interdisciplinary, international team with 60+ different nationalities in a collaborative work environment.</li><li style=\"border:0px solid;font-family:Arial, Helvetica, sans-serif;list-style-type:disc;margin:0px;font-size:14px;\">Corporate Benefits Program that covers health, mobility and learning with 100 € net per month.</li><li style=\"border:0px solid;font-family:Arial, Helvetica, sans-serif;list-style-type:disc;margin:0px;font-size:14px;\">Modern office facilities with a rooftop terrace overlooking Munich, free drinks & fruits, and regular company events contribute to a good working environment.</li></ul>"
    }
  ],
  "occupationCategory": "it_software",
  "recruitingCategory": "Internships & Working Students"
}

Get this page with API

Rendered from the bluedoor Job Postings API. Reproduce it:

GET https://api.bluedoor.sh/job-postings/v1/jobs/ec6389829cf44e1b4ed891fe2d53c4e8265280d6?include=descriptionJSON

GET https://api.bluedoor.sh/job-postings/v1/orgs/cbcab16d-d77f-4aae-95e4-f537194009c8JSON

GET https://api.bluedoor.sh/job-postings/v1/sources/bcb1fbae-6077-4ee3-833d-67baf488bf90JSON

GET https://api.bluedoor.sh/job-postings/v1/jobs/ec6389829cf44e1b4ed891fe2d53c4e8265280d6/eventsJSON

Docs · Get an API key