Topic/prompt evaluation by muhammadzaman12 · Pull Request #15 · ciesko/msrchat

muhammadzaman12 · 2024-01-19T00:44:14Z

Task: prompt the AI with questions from the provided prompt_libray file and use AI to evaluate the generated response

What's been done:

Iteration of questions and prompting the AI
recorded response and response time
evaluated response with a follow up prompt
evaluate and compare two responses
recorded question, response, evaluation, response time into an output file.

TODO:

Experiment with additional AI-assisted evaluation techniques(In-Progress)

Note:
This pull request includes a Jupyter Notebook file and a functions.py file. The primary logic is implemented in the functions.py file, while the Jupyter Notebook executes and showcases the functionality.

…c/PromptEvaluation

Added response comparison

muhammadzaman12 · 2024-01-23T19:13:14Z

I'm getting errors when I try to run the notebook locally at this step:

Process each row and store the results in a new DataFrame

The error is: Error Question 0 processed unsuccessfully: Error in making the API request.

I get that for each question. I also tried it with the app running in the background. The prior steps all process successfully. Do I need to do something different for testing?

I should probably include a readme file. This does need to have the backend running locally it uses http://127.0.0.1:5000 as default base url or you can specify a url in the env file as BASE_URL. Can you confirm if the backend is getting a post request at the /conversation endpoint?

Update notebook with question limit

…c/PromptEvaluation

muhammadzaman12 · 2024-01-23T21:34:38Z

I should probably include a readme file. This does need to have the backend running locally it uses http://127.0.0.1:5000 as default base url or you can specify a url in the env file as BASE_URL. Can you confirm if the backend is getting a post request at the /conversation endpoint?

With the application running, I get: INFO:werkzeug:127.0.0.1 - - [23/Jan/2024 13:28:01] "POST /conversation HTTP/1.1" 200 -

However, I'm getting this error as well: Debugging middleware caught exception in streamed response at a point where response headers were already sent. Traceback (most recent call last): File "C:\Users\ccohn\source\repos\msrchat\backend\orchestrators\Orchestrator.py", line 398, in stream_with_data response["model"] = lineJson["model"] ^^^^^^^^^^^^^^^^^ KeyError: 'model'

Do I need to use a certain endpoint/model/preview version, something something?

⁠ahh I completely forgot in order for this to work the value of AZURE_OPENAI_STREAM should be set to False in the projects env file.

lakshmicas · 2024-01-24T19:19:29Z

evaluation/prompt_eval.ipynb

+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "comparison_output_dataset = pd.DataFrame(results, columns=['Question', 'Answer 1', 'Answer 2', 'Response Time 1 (s)', 'Response Time 2 (s)', 'Evaluation',])"


What is the expectation from this comparison? There is no change in the system between the first run and this one, right ? So what does this result tell us ?

I ran with a few assumptions that in each prompt/request would generate a different response in prod, I will however restructure the script to run comparisons based on the previous runs of the script and not within the same session/run.

Same session/run isn't necessarily a concern. The concern is that the AI pipeline/orchestrator is the same and would return essentially the same answer each time making the comparisons less useful.

evaluation/functions.py

bscheurm · 2024-01-24T19:55:39Z

evaluation/functions.py

+    - prompt (str): The user prompt to be processed and compared.
+    """
+    try:
+        answer_1, response_time_1, _ = process_prompt(prompt)


This appears to be running the same AI pipeline/orchestrator twice on the same input prompt. And since the temperature is 0 the responses will most likely be the same or very close. Is there a plan to make the orchestrator configurable so we can compare the results of changing the system prompt and/or search strategy etc.?

Will rewrite the script to do comparison with answers from the previous run and not within the same run/session.

It will still be the same AI orchestrator/pipeline though, so the answers to compare will be very close if not identical. The real value in the comparison is to be able to evaluate against differences in the pipeline itself, e.g., a different prompt, a different search strategy, parameter tweaks, etc.

evaluation/functions.py

LandMagic · 2024-01-24T20:20:12Z

Just a couple of notes -

I was able to run this, but it took close to 20 minutes to evaluate the 5 questions through all the steps. That might be ok for dev, but do we want this on production code? (even if it's 4 minutes per prompt, that's kind of long)
If you run the notebook more than once, it does not overwrite existing files and throws an error. That might not matter, it's easy to update those file names, but I'm not sure what the expectation is and if the user should be able to run this multiple times.

muhammadzaman12 · 2024-01-24T20:38:46Z

Just a couple of notes -

I was able to run this, but it took close to 20 minutes to evaluate the 5 questions through all the steps. That might be ok for dev, but do we want this on production code? (even if it's 4 minutes per prompt, that's kind of long)

If you run the notebook more than once, it does not overwrite existing files and throws an error. That might not matter, it's easy to update those file names, but I'm not sure what the expectation is and if the user should be able to run this multiple times.

a single post request for me takes upwards of a minute when running the project locally, tried hitting through postman and got the same result...I'm not sure if optimizing response times would be within the scope of my task, was under the assumption that would change in prod.
That only happens if you have the output files open when re-running the notebook.

…c/PromptEvaluation

evaluation/functions.py

bscheurm · 2024-01-29T17:26:42Z

evaluation/functions.py

+        raise Exception(f"Error in parsing the API response.\n{e}")
+

+def paraphrase_question(prompt: str):


What is the purpose for paraphrasing?

The reason for the question is that the Azure OpenAI endpoints used by the main app (and by extension the eval scripts) already do query rewriting.

…c/PromptEvaluation

Remove eval func calls

muhammadzaman12 added 4 commits January 18, 2024 10:32

Added prompt evaluation script

74f98b6

cleared outputs

7cce6b3

Merge branch 'msrmain' of https://github.com/ciesko/msrchat into Topi…

fbf2c41

…c/PromptEvaluation

Updated evaluation function

42fee13

muhammadzaman12 requested review from bscheurm, ciesko and tendau January 19, 2024 00:44

muhammadzaman12 self-assigned this Jan 19, 2024

muhammadzaman12 added 3 commits January 19, 2024 14:44

Refactored code

26f25e4

Merge branch 'msrmain' of https://github.com/ciesko/msrchat into Topi…

526cf62

…c/PromptEvaluation

Refactored code

045f12e

Added response comparison

muhammadzaman12 removed the request for review from ciesko January 22, 2024 22:38

updated requirements-dev.txt

a2c5c35

muhammadzaman12 marked this pull request as ready for review January 22, 2024 23:12

muhammadzaman12 requested review from LandMagic and ciesko January 22, 2024 23:15

Update doc string

59d17db

This comment was marked as resolved.

Sign in to view

muhammadzaman12 added 3 commits January 23, 2024 11:29

Add better exception messages

90c7f4a

Update notebook with question limit

Added Readme

126eaf1

Merge branch 'msrmain' of https://github.com/ciesko/msrchat into Topi…

8271075

…c/PromptEvaluation

This comment was marked as resolved.

Sign in to view

Update readme

1e08677

This comment was marked as duplicate.

Sign in to view

lakshmicas reviewed Jan 24, 2024

View reviewed changes

bscheurm reviewed Jan 24, 2024

View reviewed changes

LandMagic self-assigned this Jan 25, 2024

muhammadzaman12 added 2 commits January 29, 2024 08:04

Merge branch 'msrmain' of https://github.com/ciesko/msrchat into Topi…

2d6ac66

…c/PromptEvaluation

Update prompt evaluation

03db4a8

bscheurm reviewed Jan 29, 2024

View reviewed changes

reformatted papraphrase string

e169cd2

This comment was marked as spam.

Sign in to view

Repository owner deleted a comment from wanitkapan Feb 12, 2024

This comment was marked as spam.

Sign in to view

muhammadzaman12 added 4 commits February 16, 2024 04:23

Merge branch 'msrmain' of https://github.com/ciesko/msrchat into Topi…

7b54485

…c/PromptEvaluation

Remove comparison evaluation

78390ae

Update requirements-dev.txt

806397c

Update readme

c7e95b6

Remove eval func calls

		raise Exception(f"Error in parsing the API response.\n{e}")


		def paraphrase_question(prompt: str):

Conversation

muhammadzaman12 commented Jan 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

muhammadzaman12 commented Jan 23, 2024

Process each row and store the results in a new DataFrame

Uh oh!

This comment was marked as resolved.

muhammadzaman12 commented Jan 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as duplicate.

lakshmicas Jan 24, 2024 • edited by bscheurm Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

muhammadzaman12 Jan 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bscheurm Jan 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bscheurm Jan 24, 2024

Choose a reason for hiding this comment

Uh oh!

muhammadzaman12 Jan 24, 2024

Choose a reason for hiding this comment

Uh oh!

bscheurm Jan 24, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LandMagic commented Jan 24, 2024

Uh oh!

muhammadzaman12 commented Jan 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

bscheurm Jan 29, 2024

Choose a reason for hiding this comment

Uh oh!

bscheurm Jan 31, 2024

Choose a reason for hiding this comment

Uh oh!

This comment was marked as spam.

Uh oh!

This comment was marked as spam.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

muhammadzaman12 commented Jan 19, 2024 •

edited

Loading

muhammadzaman12 commented Jan 23, 2024 •

edited

Loading

lakshmicas Jan 24, 2024 •

edited by bscheurm

Loading

muhammadzaman12 Jan 24, 2024 •

edited

Loading

bscheurm Jan 24, 2024 •

edited

Loading

muhammadzaman12 commented Jan 24, 2024 •

edited

Loading