Add endpoint design doc for Digitize Document service#304
Add endpoint design doc for Digitize Document service#304dharaneeshvrd wants to merge 4 commits intoIBM:mainfrom
Conversation
|
cc @ryanmarc |
c281874 to
e6eb0db
Compare
|
|
||
| 200 OK | ||
| { | ||
| "docling_document": "", |
There was a problem hiding this comment.
docling is implementation specific. I'd suggest keeping the response generic from the underlying processing - for example, if we were to change from using docling to something else then this property wouldn't make sense.
There was a problem hiding this comment.
Added a generic sample response, ptal.
There was a problem hiding this comment.
With REST, best practice would be to use nouns for the endpoints. Looking through what's outlined here I would suggest something like this:
| Method | Path | Description |
|---|---|---|
| POST | /documents |
Submit one or more documents for async processing (with optional ingestion flag) |
| DELETE | /documents |
Bulk delete documents (requires confirm=true) |
| GET | /documents/{document_id} |
Get metadata/status of a single document |
| DELETE | /documents/{document_id} |
Delete a single document |
| GET | /documents/{document_id}/content |
Retrieve processed content of a document |
| GET | /documents/jobs/{job_id} |
Retrieve status and results of a document processing job |
There was a problem hiding this comment.
Agree, used nouns for the endpoints now and used documents when dealing with documents after ingestion.
But still I want to keep /v1/digitizations for conversion task, since nothing is being stored or maintained(CRUD) as part of API server and /v1/ingestions for ingesting the documents since its a major operation, don't want to do this based on a flag, I wanted it be part of the endpoint.
Ptal at the revised endpoint table.
Signed-off-by: Dharaneeshwaran Ravichandran <dharaneeshwaran.ravichandran@ibm.com>
1c0a570 to
8505e16
Compare
Remove the job_id since there is only one ingestion is allowed Format the response code in tabular format Update strech enpotints with more details Signed-off-by: Dharaneeshwaran Ravichandran <dharaneeshwaran.ravichandran@ibm.com>
8505e16 to
43e013c
Compare
Signed-off-by: Dharaneeshwaran Ravichandran <dharaneeshwaran.ravichandran@ibm.com>
|
@yussufsh I have addressed your comments, ptal |
| | Status Code | Description | Details | | ||
| | :--- | :--- | :--- | | ||
| | **200 OK** | Success | Returns metadata of the pdf's id requested. | | ||
| | **400 Bad Request** | No Data | No ingested documents matching the id. | |
There was a problem hiding this comment.
Should we be sending a 200 OK, Empty response instead of 400 Bad Request in case of no data?
|
|
||
| **With latest=True** | ||
| ``` | ||
| { |
There was a problem hiding this comment.
With latest=True, we are sending the following:
{
"status": "partial",
"created_at": "2026-01-10T10:00:00Z",
"total_pages": 10,
"total_tables": 5,
"documents": {
"pdf1": { ...
}
Whereas it should be:
[
{
"status": "partial",
"created_at": "2026-01-10T10:00:00Z",
"total_pages": 10,
"total_tables": 5,
"documents": {
"pdf1": { ...
}
]
to ensure similarity in both the responses. The client should not have to alter their parsing logic based on params values.
| - If there are more than one submissions available, will return all | ||
|
|
||
| **Query Params:** | ||
| - latest - bool - Optional param to return the latest ingestion status |
There was a problem hiding this comment.
Maybe we should think of adding pagination support as a stretch goal, because if a user ingests hundreds of documents over time, then this response will be really big. If it is paginated, much easier for the UI implementation when we have a UI.
Same for GET /v1/documents.
| "total_tables": 5, | ||
| "documents": { | ||
| "pdf1": { | ||
| "status": "chunking", // possible values: conversion, processing, chunking, indexing |
There was a problem hiding this comment.
Consider changing the first possible value to converting to be similar to the other 3.
| - Create a LOCK file in /var/lib/ai-services/applications/<app-name>/cache/ | ||
| - Create /var/lib/ai-services/applications/<app-name>/cache/status.json to manage/view the status of ingestion | ||
| - End the request 202 Accepted response | ||
| - Background ingestion process should write the status into status.json like following information |
There was a problem hiding this comment.
@dharaneeshvrd could you add a sample status.json here?
|
|
||
| ``` | ||
| { | ||
| "result": ... // str/dict - str in case of md/text output_format & dict in case of json output format. |
There was a problem hiding this comment.
Could you add actual sample responses here for both md/text and json.
|
|
||
| ## Proposal: | ||
|
|
||
| As per the requirement from PM, need to convert cli into microservice which expose REST endpoints to do the digitize document service’s tasks, which are to |
There was a problem hiding this comment.
This section can be written as:
Recognizing the need for a more scalable and accessible architecture, we are moving to convert the current CLI into a microservice that offers REST endpoints for digitize‑document tasks. The microservice will support the following capabilities:
File Conversion
- Convert the source file (e.g., PDF) into the required output format.
- Return the converted result through the API.
Document Ingestion Workflow
After conversion, the document will be ingested via a structured processing pipeline that includes:
- Extracting and processing text and tables.
- Chunking the extracted content.
- Generating embeddings and indexing them into the vector database (VDB).
External Service Exposure
- The microservice will be made accessible for external, end‑user consumption.
- Port 4000 may be used for hosting and exposure.
| | **POST** | `/v1/digitizations` | Synchronous PDF conversion. | `multipart/form-data` | `200 OK` | | ||
| | **POST** | `/v1/ingestions` | Asynchronous document ingestion on a background process. | `multipart/form-data` | `202 Accepted` | | ||
| | **GET** | `/v1/ingestions` | Retrieve the status of currently ingested documents. | `application/json` | `200 OK` | | ||
| | **GET** | `/v1/documents` | Retrieve the list of currently ingested documents with its metadata. | `application/json` | `200 OK` | |
There was a problem hiding this comment.
This sounds like we are already maintaining the state of the documents inside the service, with this implementation, I feel its worth considering the @ryanmarc 's suggestion + few additions below:
something like below to keep apis consistent, simple and meaningful, lemme know your opinion:
| Method | Path | Description |
|---|---|---|
| POST | /documents | Upload a document into the system for processing. Returns: document_id |
| DELETE | /documents | Bulk delete documents (requires query parameter confirm=true). |
| GET | /documents/{document_id} | Retrieve metadata and current status for a specific document. |
| DELETE | /documents/{document_id} | Delete a specific document. |
| POST | /documents/{document_id}/digitizations | Trigger digitization for the specified document, with optional flags: ingest, async. Returns a job_id for tracking progress. |
| GET | /documents/{document_id}/content | Retrieve the processed content of a document. Returns only when no active processing job exists for that document. |
| GET | /documents/jobs/{job_id} | Retrieve the status and results of a processing job. Supports both digitization and ingestion job types. |
There was a problem hiding this comment.
| Method | Endpoint | Description |
|---|---|---|
| POST | /v1/documents |
Uploads files for async processing. Ingestion by default. Accepts digitize_only and output_format as optional flags. Returns job_id to track the processing. |
| GET | /v1/documents/jobs |
Retrieves information about all the jobs. |
| GET | /v1/documents/jobs/{job_id} |
Retrieves information about job_id. |
| GET | /v1/documents |
Retrieves all the documents with its metadata. |
| GET | /v1/documents/{doc_id} |
Retrieves metadata of a specific document. |
| GET | /v1/documents/{doc_id}/content |
Retrieves digitized content of the specific document. |
| DELETE | /v1/documents/{doc_id} |
Removes a specific document from vdb & clear its cache. |
| DELETE | /v1/documents |
Bulk deletes all documents from vdb & cache. Required param confirm=True to proceed the deletion. |
@mkumatag Can you please take a look at this list? It's more or less similar.
There was a problem hiding this comment.
This looks much much better, minor comments:
/v1/documents - param name: operation and values: ingest/digitize
There was a problem hiding this comment.
Considering default operation as ingest to reduce input from the user.
POST /v1/documents => Starts Ingestion
POST /v1/documents?digitize_only=True => Starts Digitization
Do we really need it?
There was a problem hiding this comment.
what I really meant is:
POST /v1/documents => Starts Ingestion(operation=ingest(default))
POST /v1/documents?operation=digitize => Starts Digitization
POST /v1/documents?operation=store - you can use same value tomorrow for just saving the file you can use the same param
Having this said, I'm open for either of these solutions..
Signed-off-by: Dharaneeshwaran Ravichandran <dharaneeshwaran.ravichandran@ibm.com>
No description provided.