Skip to content

Topic/prompt evaluation#15

Open
muhammadzaman12 wants to merge 20 commits intomsrmainfrom
Topic/PromptEvaluation
Open

Topic/prompt evaluation#15
muhammadzaman12 wants to merge 20 commits intomsrmainfrom
Topic/PromptEvaluation

Conversation

@muhammadzaman12
Copy link
Collaborator

@muhammadzaman12 muhammadzaman12 commented Jan 19, 2024

Task: prompt the AI with questions from the provided prompt_libray file and use AI to evaluate the generated response

What's been done:

  • Iteration of questions and prompting the AI
  • recorded response and response time
  • evaluated response with a follow up prompt
  • evaluate and compare two responses
  • recorded question, response, evaluation, response time into an output file.

TODO:

  • Experiment with additional AI-assisted evaluation techniques(In-Progress)

Note:
This pull request includes a Jupyter Notebook file and a functions.py file. The primary logic is implemented in the functions.py file, while the Jupyter Notebook executes and showcases the functionality.

@muhammadzaman12 muhammadzaman12 removed the request for review from ciesko January 22, 2024 22:38
@muhammadzaman12 muhammadzaman12 marked this pull request as ready for review January 22, 2024 23:12
@LandMagic

This comment was marked as resolved.

@muhammadzaman12
Copy link
Collaborator Author

I'm getting errors when I try to run the notebook locally at this step:

Process each row and store the results in a new DataFrame

The error is: Error Question 0 processed unsuccessfully: Error in making the API request.

I get that for each question. I also tried it with the app running in the background. The prior steps all process successfully. Do I need to do something different for testing?

I should probably include a readme file. This does need to have the backend running locally it uses http://127.0.0.1:5000 as default base url or you can specify a url in the env file as BASE_URL. Can you confirm if the backend is getting a post request at the /conversation endpoint?

@LandMagic

This comment was marked as resolved.

@muhammadzaman12
Copy link
Collaborator Author

muhammadzaman12 commented Jan 23, 2024

I should probably include a readme file. This does need to have the backend running locally it uses http://127.0.0.1:5000 as default base url or you can specify a url in the env file as BASE_URL. Can you confirm if the backend is getting a post request at the /conversation endpoint?

With the application running, I get: INFO:werkzeug:127.0.0.1 - - [23/Jan/2024 13:28:01] "POST /conversation HTTP/1.1" 200 -

However, I'm getting this error as well: Debugging middleware caught exception in streamed response at a point where response headers were already sent. Traceback (most recent call last): File "C:\Users\ccohn\source\repos\msrchat\backend\orchestrators\Orchestrator.py", line 398, in stream_with_data response["model"] = lineJson["model"] ^^^^^^^^^^^^^^^^^ KeyError: 'model'

Do I need to use a certain endpoint/model/preview version, something something?

⁠ahh I completely forgot in order for this to work the value of AZURE_OPENAI_STREAM should be set to False in the projects env file.

@LandMagic

This comment was marked as duplicate.

"metadata": {},
"outputs": [],
"source": [
"comparison_output_dataset = pd.DataFrame(results, columns=['Question', 'Answer 1', 'Answer 2', 'Response Time 1 (s)', 'Response Time 2 (s)', 'Evaluation',])"
Copy link
Collaborator

@lakshmicas lakshmicas Jan 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the expectation from this comparison? There is no change in the system between the first run and this one, right ? So what does this result tell us ?

Copy link
Collaborator Author

@muhammadzaman12 muhammadzaman12 Jan 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran with a few assumptions that in each prompt/request would generate a different response in prod, I will however restructure the script to run comparisons based on the previous runs of the script and not within the same session/run.

Copy link
Collaborator

@bscheurm bscheurm Jan 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same session/run isn't necessarily a concern. The concern is that the AI pipeline/orchestrator is the same and would return essentially the same answer each time making the comparisons less useful.

- prompt (str): The user prompt to be processed and compared.
"""
try:
answer_1, response_time_1, _ = process_prompt(prompt)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be running the same AI pipeline/orchestrator twice on the same input prompt. And since the temperature is 0 the responses will most likely be the same or very close. Is there a plan to make the orchestrator configurable so we can compare the results of changing the system prompt and/or search strategy etc.?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will rewrite the script to do comparison with answers from the previous run and not within the same run/session.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will still be the same AI orchestrator/pipeline though, so the answers to compare will be very close if not identical. The real value in the comparison is to be able to evaluate against differences in the pipeline itself, e.g., a different prompt, a different search strategy, parameter tweaks, etc.

@LandMagic
Copy link
Collaborator

Just a couple of notes -

  1. I was able to run this, but it took close to 20 minutes to evaluate the 5 questions through all the steps. That might be ok for dev, but do we want this on production code? (even if it's 4 minutes per prompt, that's kind of long)

  2. If you run the notebook more than once, it does not overwrite existing files and throws an error. That might not matter, it's easy to update those file names, but I'm not sure what the expectation is and if the user should be able to run this multiple times.

@muhammadzaman12
Copy link
Collaborator Author

muhammadzaman12 commented Jan 24, 2024

Just a couple of notes -

  1. I was able to run this, but it took close to 20 minutes to evaluate the 5 questions through all the steps. That might be ok for dev, but do we want this on production code? (even if it's 4 minutes per prompt, that's kind of long)
  2. If you run the notebook more than once, it does not overwrite existing files and throws an error. That might not matter, it's easy to update those file names, but I'm not sure what the expectation is and if the user should be able to run this multiple times.
  1. a single post request for me takes upwards of a minute when running the project locally, tried hitting through postman and got the same result...I'm not sure if optimizing response times would be within the scope of my task, was under the assumption that would change in prod.
  2. That only happens if you have the output files open when re-running the notebook.

@LandMagic LandMagic self-assigned this Jan 25, 2024
raise Exception(f"Error in parsing the API response.\n{e}")


def paraphrase_question(prompt: str):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose for paraphrasing?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for the question is that the Azure OpenAI endpoints used by the main app (and by extension the eval scripts) already do query rewriting.

wanitkapan

This comment was marked as spam.

Repository owner deleted a comment from wanitkapan Feb 12, 2024
wanitkapan

This comment was marked as spam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Comments