Skip to content

Add endpoint design doc for Digitize Document service#304

Open
dharaneeshvrd wants to merge 4 commits intoIBM:mainfrom
dharaneeshvrd:digitize-endpoint-design
Open

Add endpoint design doc for Digitize Document service#304
dharaneeshvrd wants to merge 4 commits intoIBM:mainfrom
dharaneeshvrd:digitize-endpoint-design

Conversation

@dharaneeshvrd
Copy link
Member

No description provided.

@dharaneeshvrd
Copy link
Member Author

cc @ryanmarc

@dharaneeshvrd dharaneeshvrd marked this pull request as draft February 10, 2026 09:00
@dharaneeshvrd dharaneeshvrd force-pushed the digitize-endpoint-design branch 2 times, most recently from c281874 to e6eb0db Compare February 10, 2026 17:01
@dharaneeshvrd dharaneeshvrd marked this pull request as ready for review February 10, 2026 17:01

200 OK
{
"docling_document": "",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docling is implementation specific. I'd suggest keeping the response generic from the underlying processing - for example, if we were to change from using docling to something else then this property wouldn't make sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a generic sample response, ptal.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With REST, best practice would be to use nouns for the endpoints. Looking through what's outlined here I would suggest something like this:

Method Path Description
POST /documents Submit one or more documents for async processing (with optional ingestion flag)
DELETE /documents Bulk delete documents (requires confirm=true)
GET /documents/{document_id} Get metadata/status of a single document
DELETE /documents/{document_id} Delete a single document
GET /documents/{document_id}/content Retrieve processed content of a document
GET /documents/jobs/{job_id} Retrieve status and results of a document processing job

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, used nouns for the endpoints now and used documents when dealing with documents after ingestion.
But still I want to keep /v1/digitizations for conversion task, since nothing is being stored or maintained(CRUD) as part of API server and /v1/ingestions for ingesting the documents since its a major operation, don't want to do this based on a flag, I wanted it be part of the endpoint.
Ptal at the revised endpoint table.

Signed-off-by: Dharaneeshwaran Ravichandran <dharaneeshwaran.ravichandran@ibm.com>
@dharaneeshvrd dharaneeshvrd force-pushed the digitize-endpoint-design branch from 1c0a570 to 8505e16 Compare February 13, 2026 09:57
Remove the job_id since there is only one ingestion is allowed
Format the response code in tabular format
Update strech enpotints with more details

Signed-off-by: Dharaneeshwaran Ravichandran <dharaneeshwaran.ravichandran@ibm.com>
@dharaneeshvrd dharaneeshvrd force-pushed the digitize-endpoint-design branch from 8505e16 to 43e013c Compare February 13, 2026 09:58
Signed-off-by: Dharaneeshwaran Ravichandran <dharaneeshwaran.ravichandran@ibm.com>
@dharaneeshvrd
Copy link
Member Author

@yussufsh I have addressed your comments, ptal

| Status Code | Description | Details |
| :--- | :--- | :--- |
| **200 OK** | Success | Returns metadata of the pdf's id requested. |
| **400 Bad Request** | No Data | No ingested documents matching the id. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be sending a 200 OK, Empty response instead of 400 Bad Request in case of no data?


**With latest=True**
```
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With latest=True, we are sending the following:

{
    "status": "partial",
    "created_at": "2026-01-10T10:00:00Z",
    "total_pages": 10,
    "total_tables": 5,
    "documents": {
        "pdf1": { ...
}

Whereas it should be:

[ 
{
    "status": "partial",
    "created_at": "2026-01-10T10:00:00Z",
    "total_pages": 10,
    "total_tables": 5,
    "documents": {
        "pdf1": { ...
}
]

to ensure similarity in both the responses. The client should not have to alter their parsing logic based on params values.

- If there are more than one submissions available, will return all

**Query Params:**
- latest - bool - Optional param to return the latest ingestion status
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should think of adding pagination support as a stretch goal, because if a user ingests hundreds of documents over time, then this response will be really big. If it is paginated, much easier for the UI implementation when we have a UI.
Same for GET /v1/documents.

"total_tables": 5,
"documents": {
"pdf1": {
"status": "chunking", // possible values: conversion, processing, chunking, indexing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider changing the first possible value to converting to be similar to the other 3.

- Create a LOCK file in /var/lib/ai-services/applications/<app-name>/cache/
- Create /var/lib/ai-services/applications/<app-name>/cache/status.json to manage/view the status of ingestion
- End the request 202 Accepted response
- Background ingestion process should write the status into status.json like following information
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dharaneeshvrd could you add a sample status.json here?


```
{
"result": ... // str/dict - str in case of md/text output_format & dict in case of json output format.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add actual sample responses here for both md/text and json.


## Proposal:

As per the requirement from PM, need to convert cli into microservice which expose REST endpoints to do the digitize document service’s tasks, which are to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section can be written as:

Recognizing the need for a more scalable and accessible architecture, we are moving to convert the current CLI into a microservice that offers REST endpoints for digitize‑document tasks. The microservice will support the following capabilities:

File Conversion

  • Convert the source file (e.g., PDF) into the required output format.
  • Return the converted result through the API.

Document Ingestion Workflow

After conversion, the document will be ingested via a structured processing pipeline that includes:

  • Extracting and processing text and tables.
  • Chunking the extracted content.
  • Generating embeddings and indexing them into the vector database (VDB).

External Service Exposure

  • The microservice will be made accessible for external, end‑user consumption.
  • Port 4000 may be used for hosting and exposure.

| **POST** | `/v1/digitizations` | Synchronous PDF conversion. | `multipart/form-data` | `200 OK` |
| **POST** | `/v1/ingestions` | Asynchronous document ingestion on a background process. | `multipart/form-data` | `202 Accepted` |
| **GET** | `/v1/ingestions` | Retrieve the status of currently ingested documents. | `application/json` | `200 OK` |
| **GET** | `/v1/documents` | Retrieve the list of currently ingested documents with its metadata. | `application/json` | `200 OK` |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like we are already maintaining the state of the documents inside the service, with this implementation, I feel its worth considering the @ryanmarc 's suggestion + few additions below:

something like below to keep apis consistent, simple and meaningful, lemme know your opinion:

Method Path Description
POST /documents Upload a document into the system for processing. Returns: document_id
DELETE /documents Bulk delete documents (requires query parameter confirm=true).
GET /documents/{document_id} Retrieve metadata and current status for a specific document.
DELETE /documents/{document_id} Delete a specific document.
POST /documents/{document_id}/digitizations Trigger digitization for the specified document, with optional flags: ingest, async. Returns a job_id for tracking progress.
GET /documents/{document_id}/content Retrieve the processed content of a document. Returns only when no active processing job exists for that document.
GET /documents/jobs/{job_id} Retrieve the status and results of a processing job. Supports both digitization and ingestion job types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method Endpoint Description
POST /v1/documents Uploads files for async processing. Ingestion by default. Accepts digitize_only and output_format as optional flags. Returns job_id to track the processing.
GET /v1/documents/jobs Retrieves information about all the jobs.
GET /v1/documents/jobs/{job_id} Retrieves information about job_id.
GET /v1/documents Retrieves all the documents with its metadata.
GET /v1/documents/{doc_id} Retrieves metadata of a specific document.
GET /v1/documents/{doc_id}/content Retrieves digitized content of the specific document.
DELETE /v1/documents/{doc_id} Removes a specific document from vdb & clear its cache.
DELETE /v1/documents Bulk deletes all documents from vdb & cache. Required param confirm=True to proceed the deletion.

@mkumatag Can you please take a look at this list? It's more or less similar.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks much much better, minor comments:

/v1/documents - param name: operation and values: ingest/digitize

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering default operation as ingest to reduce input from the user.
POST /v1/documents => Starts Ingestion
POST /v1/documents?digitize_only=True => Starts Digitization
Do we really need it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what I really meant is:
POST /v1/documents => Starts Ingestion(operation=ingest(default))
POST /v1/documents?operation=digitize => Starts Digitization

POST /v1/documents?operation=store - you can use same value tomorrow for just saving the file you can use the same param

Having this said, I'm open for either of these solutions..

Signed-off-by: Dharaneeshwaran Ravichandran <dharaneeshwaran.ravichandran@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants