Skip to content

feat: Benchmark Talk2Scholars (T2S) Agent Using Open-Source Life Science QA and Scientific Reasoning Datasets #194

@ansh-info

Description

@ansh-info

Is your feature request related to a problem? Please describe.

To ensure the efficacy of the Talk2Scholars (T2S) agent in addressing life science queries and performing scientific reasoning, it's essential to benchmark its performance against established open-source datasets. Currently, there's no systematic evaluation in place to assess T2S's capabilities in these domains.

Describe the solution you'd like

Implement a benchmarking framework for T2S that includes:

  1. Dataset Selection: Identify and curate relevant open-source benchmarks such as:

    • BioASQ-QA: A dataset containing biomedical questions with reference answers, reflecting real information needs of biomedical experts.
    • SciQA: A scientific QA benchmark leveraging the Open Research Knowledge Graph (ORKG), encompassing questions from various research fields.
    • LAB-Bench: A dataset evaluating language models on practical biology research tasks, including literature search and data analysis.
    • SciKnowEval: A framework assessing large language models across multiple levels of scientific knowledge in domains like biology and chemistry. Nature][1], Nature[2], arXiv[3], arXiv[4])
  2. Evaluation Metrics: Define appropriate metrics such as accuracy, precision, recall, and F1-score to assess T2S's performance on these datasets.

  3. Benchmarking Pipeline: Develop an automated pipeline to run T2S on the selected datasets, collect results, and compute evaluation metrics.

  4. Reporting: Generate comprehensive reports detailing T2S's performance, highlighting strengths and areas for improvement.

Describe alternatives you've considered

  • Manual evaluation of T2S's responses, which is time-consuming and may lack consistency.
  • Using proprietary datasets, which may not be accessible or reproducible for the broader community.

Additional context

Benchmarking T2S against these open-source datasets will provide insights into its current capabilities and guide future enhancements. It will also facilitate comparisons with other models and promote transparency in its development.

Metadata

Metadata

Assignees

Labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions