feat: Benchmark Talk2Scholars (T2S) Agent Using Open-Source Life Science QA and Scientific Reasoning Datasets

**Is your feature request related to a problem? Please describe.**

To ensure the efficacy of the Talk2Scholars (T2S) agent in addressing life science queries and performing scientific reasoning, it's essential to benchmark its performance against established open-source datasets. Currently, there's no systematic evaluation in place to assess T2S's capabilities in these domains.

**Describe the solution you'd like**

Implement a benchmarking framework for T2S that includes:

1. **Dataset Selection:** Identify and curate relevant open-source benchmarks such as:

   * **BioASQ-QA:** A dataset containing biomedical questions with reference answers, reflecting real information needs of biomedical experts.&#x20;
   * **SciQA:** A scientific QA benchmark leveraging the Open Research Knowledge Graph (ORKG), encompassing questions from various research fields.&#x20;
   * **LAB-Bench:** A dataset evaluating language models on practical biology research tasks, including literature search and data analysis.&#x20;
   * **SciKnowEval:** A framework assessing large language models across multiple levels of scientific knowledge in domains like biology and chemistry. [Nature](https://www.nature.com/articles/s41597-023-02068-4)][1], [Nature](https://www.nature.com/articles/s41598-023-33607-z)[2], [arXiv](https://arxiv.org/abs/2407.10362)[3], [arXiv](https://arxiv.org/abs/2406.09098)[4])

2. **Evaluation Metrics:** Define appropriate metrics such as accuracy, precision, recall, and F1-score to assess T2S's performance on these datasets.

3. **Benchmarking Pipeline:** Develop an automated pipeline to run T2S on the selected datasets, collect results, and compute evaluation metrics.

4. **Reporting:** Generate comprehensive reports detailing T2S's performance, highlighting strengths and areas for improvement.

**Describe alternatives you've considered**

* Manual evaluation of T2S's responses, which is time-consuming and may lack consistency.
* Using proprietary datasets, which may not be accessible or reproducible for the broader community.

**Additional context**

Benchmarking T2S against these open-source datasets will provide insights into its current capabilities and guide future enhancements. It will also facilitate comparisons with other models and promote transparency in its development.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Benchmark Talk2Scholars (T2S) Agent Using Open-Source Life Science QA and Scientific Reasoning Datasets #194

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: Benchmark Talk2Scholars (T2S) Agent Using Open-Source Life Science QA and Scientific Reasoning Datasets #194

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions