Conversation
Signed-off-by: Manali Latkar <manali.latkar@ibm.com>
d9e3136 to
3a0bc7e
Compare
| MAX_WORDS_BEST_PERFORMANCE = 2000 | ||
| MAX_WORDS_DEGRADED_PERFORMANCE = 21500 |
mkumatag
left a comment
There was a problem hiding this comment.
Let us push the design doc somewhere in this repo itself so that it becomes handy while reviewing the PR
| default_max_input_length = 6000 | ||
| default_prompt_template_token_count = 250 | ||
| default_max_summary_length = 1000 | ||
| default_max_file_size_mb = 10 |
There was a problem hiding this comment.
10MB is really huge file, have you tested with 10MB file size?
There was a problem hiding this comment.
handling file size separately for pdf and text files. more details added in the proposal PR.
| default_temperature = 0.0 | ||
| default_max_input_length = 6000 | ||
| default_prompt_template_token_count = 250 | ||
| default_max_summary_length = 1000 |
There was a problem hiding this comment.
Wanted a big enough number for our maximum input size which is 21500 words. 21500 words -> 1000 words = good summary. Anything larger than 1000 won't be a summary.
| default_prompt_template_token_count = 250 | ||
| default_max_summary_length = 1000 | ||
| default_max_file_size_mb = 10 | ||
| default_summarization_temperature = 0.3 |
| default_max_summary_length = 1000 | ||
| default_max_file_size_mb = 10 | ||
| default_summarization_temperature = 0.3 | ||
| default_summarization_stop_words = "\n\n,Note,Word Count,Revised Summary" |
There was a problem hiding this comment.
what is the purpose of these stop words?
There was a problem hiding this comment.
@mkumatag When I tested this, the LLM was responding with extra data that started with some of these stop words. When I started sending them, the response was always clean.
There was a problem hiding this comment.
this could be because of the prompt: Summarize the following text in approximately {summary_length} words
There was a problem hiding this comment.
we are supposed to take the summary length as input from the user. So we have to include it in the prompt for the output to stick to that limit.
|
|
||
| if __name__ == "__main__": | ||
| initialize_models() | ||
| uvicorn.run(app, host="0.0.0.0", port=8000) |
There was a problem hiding this comment.
and also how user will run this service alone vs with rag service
There was a problem hiding this comment.
Currently this needs to be run along with rag service as per Sebasitan's slide.
To run alone, this would require vllm as prerequisite, once we figure out the custom usecase/template problem, then we can address it I believe.
There was a problem hiding this comment.
I'm talking about an individual summarisation service itself, may be we will explore post merging this..
| @@ -0,0 +1,200 @@ | |||
| import os | |||
There was a problem hiding this comment.
add some readme in this directory
|
|
||
| # Log warning if text exceeds the best performance threshold | ||
| if word_count > MAX_WORDS_BEST_PERFORMANCE: | ||
| logger.info(f"Input text exceeds maximum word limit of {MAX_WORDS_BEST_PERFORMANCE} words. Performance may be degraded.") |
There was a problem hiding this comment.
I'm not sure without displaying anything else and just be this logging message how much is going to helpful for the admin who is running this solution
There was a problem hiding this comment.
@mkumatag should we send a "degraded_performance": True in the response?
There was a problem hiding this comment.
I think we should log other details to identify the request like what is the type of content user has passed & word count
There was a problem hiding this comment.
Q: What will be the admin/user action on this? IMO: I think we should consider collecting via some perf counter for monitoring purposes. Otherwise, exposing such information to end users won't be very helpful. If necessary, we should limit character count(input) so that we can be efficient and return an error if it exceeds.
There was a problem hiding this comment.
when we say degraded_performance, do we mean qualitatively? isn't max_model_len = 32k should take care of this ?
| detail="Server busy. Try again shortly." | ||
| ) | ||
| try: | ||
| summary = query_vllm_completions(llm_endpoint, prompt, llm_model, settings.llm_max_tokens, settings.summarization_temperature) |
There was a problem hiding this comment.
llm_max_tokens is set to 512 tokens which is too less if you are considering of 1000 output words
There was a problem hiding this comment.
@mkumatag you are right, I had changed the max_tokens while testing larger inputs, but missed adding the change here. Will probably come up with a max_tokens value based on the summary length given by the user.
| file_content = file.file.read() | ||
|
|
||
| # Validate file size | ||
| if len(file_content) > MAX_FILE_SIZE_BYTES: |
There was a problem hiding this comment.
I think the file size should vary based on file type, .txt would consume less space compared to pdf.
Please do some exercise to come up with a valid file sizes.
There was a problem hiding this comment.
have addressed the separate max file sizes in my document proposal.
|
|
||
| # Log warning if text exceeds the best performance threshold | ||
| if word_count > MAX_WORDS_BEST_PERFORMANCE: | ||
| logger.info(f"Input text exceeds maximum word limit of {MAX_WORDS_BEST_PERFORMANCE} words. Performance may be degraded.") |
There was a problem hiding this comment.
I think we should log other details to identify the request like what is the type of content user has passed & word count
| except requests.exceptions.RequestException as e: | ||
| concurrency_limiter.release() | ||
| error_details = str(e) | ||
| if e.response is not None: | ||
| error_details += f", Response Text: {e.response.text}" | ||
| logger.error(f"Error calling vLLM API: {error_details}") | ||
|
|
||
| raise HTTPException( | ||
| status_code=500, | ||
| detail=error_details | ||
| ) | ||
| except Exception as e: | ||
| concurrency_limiter.release() | ||
| logger.error(f"Error calling vLLM API: {e}") | ||
| raise HTTPException( | ||
| status_code=500, | ||
| detail=str(e) | ||
| ) |
There was a problem hiding this comment.
Can we move this block to llm_utils itself like how its handled in other query methods?

This PR adds a separate microservice for summarization. Input to be summarized can be given by file or plaintext. A summary length has to be given as well.