-
Notifications
You must be signed in to change notification settings - Fork 41
Description
Is your feature request related to a problem? Please describe.
To ensure the efficacy of the Talk2Scholars (T2S) agent in addressing life science queries and performing scientific reasoning, it's essential to benchmark its performance against established open-source datasets. Currently, there's no systematic evaluation in place to assess T2S's capabilities in these domains.
Describe the solution you'd like
Implement a benchmarking framework for T2S that includes:
-
Dataset Selection: Identify and curate relevant open-source benchmarks such as:
- BioASQ-QA: A dataset containing biomedical questions with reference answers, reflecting real information needs of biomedical experts.
- SciQA: A scientific QA benchmark leveraging the Open Research Knowledge Graph (ORKG), encompassing questions from various research fields.
- LAB-Bench: A dataset evaluating language models on practical biology research tasks, including literature search and data analysis.
- SciKnowEval: A framework assessing large language models across multiple levels of scientific knowledge in domains like biology and chemistry. Nature][1], Nature[2], arXiv[3], arXiv[4])
-
Evaluation Metrics: Define appropriate metrics such as accuracy, precision, recall, and F1-score to assess T2S's performance on these datasets.
-
Benchmarking Pipeline: Develop an automated pipeline to run T2S on the selected datasets, collect results, and compute evaluation metrics.
-
Reporting: Generate comprehensive reports detailing T2S's performance, highlighting strengths and areas for improvement.
Describe alternatives you've considered
- Manual evaluation of T2S's responses, which is time-consuming and may lack consistency.
- Using proprietary datasets, which may not be accessible or reproducible for the broader community.
Additional context
Benchmarking T2S against these open-source datasets will provide insights into its current capabilities and guide future enhancements. It will also facilitate comparisons with other models and promote transparency in its development.