evaluation-dataset

Here are 2 public repositories matching this topic...

A curated collection of the world’s most advanced benchmark datasets for evaluating Large Language Model (LLM) Agents.

🧠 Discover and evaluate advanced benchmark datasets for Large Language Model agents to enhance performance assessment in real-world tasks.

Add a description, image, and links to the evaluation-dataset topic page so that developers can more easily learn about it.

To associate your repository with the evaluation-dataset topic, visit your repo's landing page and select "manage topics."