Add endpoint design doc for Digitize Document service by dharaneeshvrd · Pull Request #304 · IBM/project-ai-services

dharaneeshvrd · 2026-02-09T17:28:37Z

No description provided.

dharaneeshvrd · 2026-02-09T17:29:27Z

ryanmarc · 2026-02-10T17:10:01Z

docs/proposals/digitize_documents_endpoints.md

+
+200 OK 
+{ 
+"docling_document": "", 


docling is implementation specific. I'd suggest keeping the response generic from the underlying processing - for example, if we were to change from using docling to something else then this property wouldn't make sense.

Added a generic sample response, ptal.

ryanmarc · 2026-02-10T18:30:00Z

docs/proposals/digitize_documents_endpoints.md

With REST, best practice would be to use nouns for the endpoints. Looking through what's outlined here I would suggest something like this:

Method Path Description

POST /documents Submit one or more documents for async processing (with optional ingestion flag)

DELETE /documents Bulk delete documents (requires confirm=true)

GET /documents/{document_id} Get metadata/status of a single document

DELETE /documents/{document_id} Delete a single document

GET /documents/{document_id}/content Retrieve processed content of a document

GET /documents/jobs/{job_id} Retrieve status and results of a document processing job

Agree, used nouns for the endpoints now and used documents when dealing with documents after ingestion.
But still I want to keep /v1/digitizations for conversion task, since nothing is being stored or maintained(CRUD) as part of API server and /v1/ingestions for ingesting the documents since its a major operation, don't want to do this based on a flag, I wanted it be part of the endpoint.
Ptal at the revised endpoint table.

docs/proposals/digitize_documents_endpoints.md

Signed-off-by: Dharaneeshwaran Ravichandran <dharaneeshwaran.ravichandran@ibm.com>

Remove the job_id since there is only one ingestion is allowed Format the response code in tabular format Update strech enpotints with more details Signed-off-by: Dharaneeshwaran Ravichandran <dharaneeshwaran.ravichandran@ibm.com>

Signed-off-by: Dharaneeshwaran Ravichandran <dharaneeshwaran.ravichandran@ibm.com>

dharaneeshvrd · 2026-02-13T10:10:45Z

@yussufsh I have addressed your comments, ptal

manalilatkar · 2026-02-13T10:12:09Z

docs/proposals/digitize_documents_endpoints.md

+| Status Code | Description | Details |
+| :--- | :--- | :--- |
+| **200 OK** | Success | Returns metadata of the pdf's id requested. |
+| **400 Bad Request** | No Data | No ingested documents matching the id. |


Should we be sending a 200 OK, Empty response instead of 400 Bad Request in case of no data?

manalilatkar · 2026-02-13T10:18:47Z

docs/proposals/digitize_documents_endpoints.md

+
+**With latest=True**
+```
+{


With latest=True, we are sending the following:

{ "status": "partial", "created_at": "2026-01-10T10:00:00Z", "total_pages": 10, "total_tables": 5, "documents": { "pdf1": { ... }

Whereas it should be:

[ { "status": "partial", "created_at": "2026-01-10T10:00:00Z", "total_pages": 10, "total_tables": 5, "documents": { "pdf1": { ... } ]

to ensure similarity in both the responses. The client should not have to alter their parsing logic based on params values.

manalilatkar · 2026-02-13T10:25:06Z

docs/proposals/digitize_documents_endpoints.md

+- If there are more than one submissions available, will return all
+
+**Query Params:**
+- latest - bool  - Optional param to return the latest ingestion status


Maybe we should think of adding pagination support as a stretch goal, because if a user ingests hundreds of documents over time, then this response will be really big. If it is paginated, much easier for the UI implementation when we have a UI.
Same for GET /v1/documents.

manalilatkar · 2026-02-13T10:26:15Z

docs/proposals/digitize_documents_endpoints.md

+    "total_tables": 5,
+    "documents": {
+        "pdf1": { 
+            "status": "chunking", // possible values: conversion, processing, chunking, indexing  


Consider changing the first possible value to converting to be similar to the other 3.

manalilatkar · 2026-02-13T10:30:05Z

docs/proposals/digitize_documents_endpoints.md

+    - Create a LOCK file in /var/lib/ai-services/applications/<app-name>/cache/
+    - Create /var/lib/ai-services/applications/<app-name>/cache/status.json to manage/view the status of ingestion 
+    - End the request 202 Accepted response
+- Background ingestion process should write the status into status.json like following information 


@dharaneeshvrd could you add a sample status.json here?

manalilatkar · 2026-02-13T10:48:41Z

docs/proposals/digitize_documents_endpoints.md

+
+```
+{
+    "result": ... // str/dict - str in case of md/text output_format & dict in case of json output format.


Could you add actual sample responses here for both md/text and json.

mkumatag · 2026-02-16T03:54:21Z

docs/proposals/digitize_documents_endpoints.md

+
+## Proposal: 
+
+As per the requirement from PM, need to convert cli into microservice which expose REST endpoints to do the digitize document service’s tasks, which are to


This section can be written as:

Recognizing the need for a more scalable and accessible architecture, we are moving to convert the current CLI into a microservice that offers REST endpoints for digitize‑document tasks. The microservice will support the following capabilities:

File Conversion

Convert the source file (e.g., PDF) into the required output format.

Return the converted result through the API.

Document Ingestion Workflow

After conversion, the document will be ingested via a structured processing pipeline that includes:

Extracting and processing text and tables.

Chunking the extracted content.

Generating embeddings and indexing them into the vector database (VDB).

External Service Exposure

The microservice will be made accessible for external, end‑user consumption.

Port 4000 may be used for hosting and exposure.

mkumatag · 2026-02-16T04:18:19Z

docs/proposals/digitize_documents_endpoints.md

+| **POST** | `/v1/digitizations` | Synchronous PDF conversion. | `multipart/form-data` | `200 OK` |
+| **POST** | `/v1/ingestions` | Asynchronous document ingestion on a background process. | `multipart/form-data` | `202 Accepted` |
+| **GET** | `/v1/ingestions` | Retrieve the status of currently ingested documents. | `application/json` | `200 OK` |
+| **GET** | `/v1/documents` | Retrieve the list of currently ingested documents with its metadata. | `application/json` | `200 OK` |


This sounds like we are already maintaining the state of the documents inside the service, with this implementation, I feel its worth considering the @ryanmarc 's suggestion + few additions below:

something like below to keep apis consistent, simple and meaningful, lemme know your opinion:

Method Path Description

POST /documents Upload a document into the system for processing. Returns: document_id

DELETE /documents Bulk delete documents (requires query parameter confirm=true).

GET /documents/{document_id} Retrieve metadata and current status for a specific document.

DELETE /documents/{document_id} Delete a specific document.

POST /documents/{document_id}/digitizations Trigger digitization for the specified document, with optional flags: ingest, async. Returns a job_id for tracking progress.

GET /documents/{document_id}/content Retrieve the processed content of a document. Returns only when no active processing job exists for that document.

GET /documents/jobs/{job_id} Retrieve the status and results of a processing job. Supports both digitization and ingestion job types.

Method Endpoint Description

POST /v1/documents Uploads files for async processing. Ingestion by default. Accepts digitize_only and output_format as optional flags. Returns job_id to track the processing.

GET /v1/documents/jobs Retrieves information about all the jobs.

GET /v1/documents/jobs/{job_id} Retrieves information about job_id.

GET /v1/documents Retrieves all the documents with its metadata.

GET /v1/documents/{doc_id} Retrieves metadata of a specific document.

GET /v1/documents/{doc_id}/content Retrieves digitized content of the specific document.

DELETE /v1/documents/{doc_id} Removes a specific document from vdb & clear its cache.

DELETE /v1/documents Bulk deletes all documents from vdb & cache. Required param confirm=True to proceed the deletion.

@mkumatag Can you please take a look at this list? It's more or less similar.

This looks much much better, minor comments:

/v1/documents - param name: operation and values: ingest/digitize

Considering default operation as ingest to reduce input from the user.
POST /v1/documents => Starts Ingestion
POST /v1/documents?digitize_only=True => Starts Digitization
Do we really need it?

what I really meant is:
POST /v1/documents => Starts Ingestion(operation=ingest(default))
POST /v1/documents?operation=digitize => Starts Digitization

POST /v1/documents?operation=store - you can use same value tomorrow for just saving the file you can use the same param

Having this said, I'm open for either of these solutions..

Signed-off-by: Dharaneeshwaran Ravichandran <dharaneeshwaran.ravichandran@ibm.com>

dharaneeshvrd requested review from mkumatag and yussufsh February 9, 2026 17:29

dharaneeshvrd marked this pull request as draft February 10, 2026 09:00

dharaneeshvrd force-pushed the digitize-endpoint-design branch 2 times, most recently from c281874 to e6eb0db Compare February 10, 2026 17:01

dharaneeshvrd marked this pull request as ready for review February 10, 2026 17:01

dharaneeshvrd requested review from manalilatkar, manju956 and ryanmarc February 10, 2026 17:01

ryanmarc reviewed Feb 10, 2026

View reviewed changes

yussufsh requested changes Feb 12, 2026

View reviewed changes

Add endpoint design doc for Digitize Document service

ea56390

Signed-off-by: Dharaneeshwaran Ravichandran <dharaneeshwaran.ravichandran@ibm.com>

dharaneeshvrd force-pushed the digitize-endpoint-design branch from 1c0a570 to 8505e16 Compare February 13, 2026 09:57

Rename the endpoints to use nouns

43e013c

Remove the job_id since there is only one ingestion is allowed Format the response code in tabular format Update strech enpotints with more details Signed-off-by: Dharaneeshwaran Ravichandran <dharaneeshwaran.ravichandran@ibm.com>

dharaneeshvrd force-pushed the digitize-endpoint-design branch from 8505e16 to 43e013c Compare February 13, 2026 09:58

Add sample response for sync conversion

e03f5cf

Signed-off-by: Dharaneeshwaran Ravichandran <dharaneeshwaran.ravichandran@ibm.com>

manalilatkar reviewed Feb 13, 2026

View reviewed changes

mkumatag requested changes Feb 16, 2026

View reviewed changes

Refactor endpoints

9c0e7b6

Signed-off-by: Dharaneeshwaran Ravichandran <dharaneeshwaran.ravichandran@ibm.com>

Method	Path	Description
POST	`/documents`	Submit one or more documents for async processing (with optional `ingestion` flag)
DELETE	`/documents`	Bulk delete documents (requires `confirm=true`)
GET	`/documents/{document_id}`	Get metadata/status of a single document
DELETE	`/documents/{document_id}`	Delete a single document
GET	`/documents/{document_id}/content`	Retrieve processed content of a document
GET	`/documents/jobs/{job_id}`	Retrieve status and results of a document processing job


		## Proposal:

		As per the requirement from PM, need to convert cli into microservice which expose REST endpoints to do the digitize document service’s tasks, which are to

Method	Path	Description
POST	/documents	Upload a document into the system for processing. Returns: document_id
DELETE	/documents	Bulk delete documents (requires query parameter confirm=true).
GET	/documents/{document_id}	Retrieve metadata and current status for a specific document.
DELETE	/documents/{document_id}	Delete a specific document.
POST	/documents/{document_id}/digitizations	Trigger digitization for the specified document, with optional flags: ingest, async. Returns a job_id for tracking progress.
GET	/documents/{document_id}/content	Retrieve the processed content of a document. Returns only when no active processing job exists for that document.
GET	/documents/jobs/{job_id}	Retrieve the status and results of a processing job. Supports both digitization and ingestion job types.

Method	Endpoint	Description
POST	`/v1/documents`	Uploads files for async processing. Ingestion by default. Accepts `digitize_only` and `output_format` as optional flags. Returns `job_id` to track the processing.
GET	`/v1/documents/jobs`	Retrieves information about all the jobs.
GET	`/v1/documents/jobs/{job_id}`	Retrieves information about `job_id`.
GET	`/v1/documents`	Retrieves all the documents with its metadata.
GET	`/v1/documents/{doc_id}`	Retrieves metadata of a specific document.
GET	`/v1/documents/{doc_id}/content`	Retrieves digitized content of the specific document.
DELETE	`/v1/documents/{doc_id}`	Removes a specific document from vdb & clear its cache.
DELETE	`/v1/documents`	Bulk deletes all documents from vdb & cache. Required param `confirm=True` to proceed the deletion.

Conversation

dharaneeshvrd commented Feb 9, 2026

Uh oh!

dharaneeshvrd commented Feb 9, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dharaneeshvrd commented Feb 13, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

File Conversion

Document Ingestion Workflow

External Service Exposure

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants