Conversation
…c/PromptEvaluation
Added response comparison
This comment was marked as resolved.
This comment was marked as resolved.
I should probably include a readme file. This does need to have the backend running locally it uses http://127.0.0.1:5000 as default base url or you can specify a url in the env file as BASE_URL. Can you confirm if the backend is getting a post request at the /conversation endpoint? |
Update notebook with question limit
…c/PromptEvaluation
This comment was marked as resolved.
This comment was marked as resolved.
ahh I completely forgot in order for this to work the value of AZURE_OPENAI_STREAM should be set to False in the projects env file. |
This comment was marked as duplicate.
This comment was marked as duplicate.
evaluation/prompt_eval.ipynb
Outdated
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "comparison_output_dataset = pd.DataFrame(results, columns=['Question', 'Answer 1', 'Answer 2', 'Response Time 1 (s)', 'Response Time 2 (s)', 'Evaluation',])" |
There was a problem hiding this comment.
What is the expectation from this comparison? There is no change in the system between the first run and this one, right ? So what does this result tell us ?
There was a problem hiding this comment.
I ran with a few assumptions that in each prompt/request would generate a different response in prod, I will however restructure the script to run comparisons based on the previous runs of the script and not within the same session/run.
There was a problem hiding this comment.
Same session/run isn't necessarily a concern. The concern is that the AI pipeline/orchestrator is the same and would return essentially the same answer each time making the comparisons less useful.
evaluation/functions.py
Outdated
| - prompt (str): The user prompt to be processed and compared. | ||
| """ | ||
| try: | ||
| answer_1, response_time_1, _ = process_prompt(prompt) |
There was a problem hiding this comment.
This appears to be running the same AI pipeline/orchestrator twice on the same input prompt. And since the temperature is 0 the responses will most likely be the same or very close. Is there a plan to make the orchestrator configurable so we can compare the results of changing the system prompt and/or search strategy etc.?
There was a problem hiding this comment.
Will rewrite the script to do comparison with answers from the previous run and not within the same run/session.
There was a problem hiding this comment.
It will still be the same AI orchestrator/pipeline though, so the answers to compare will be very close if not identical. The real value in the comparison is to be able to evaluate against differences in the pipeline itself, e.g., a different prompt, a different search strategy, parameter tweaks, etc.
|
Just a couple of notes -
|
|
evaluation/functions.py
Outdated
| raise Exception(f"Error in parsing the API response.\n{e}") | ||
|
|
||
|
|
||
| def paraphrase_question(prompt: str): |
There was a problem hiding this comment.
What is the purpose for paraphrasing?
There was a problem hiding this comment.
The reason for the question is that the Azure OpenAI endpoints used by the main app (and by extension the eval scripts) already do query rewriting.
…c/PromptEvaluation
Remove eval func calls
Task: prompt the AI with questions from the provided prompt_libray file and use AI to evaluate the generated response
What's been done:
TODO:
Note:
This pull request includes a Jupyter Notebook file and a functions.py file. The primary logic is implemented in the functions.py file, while the Jupyter Notebook executes and showcases the functionality.