Skip to content

[FEA]: Support embedding of custom content fields in the embedding stage #1078

@drobison00

Description

@drobison00

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Currently preventing usage

Please provide a clear description of problem this feature solves

Summary

Enhance the embedding pipeline stage to support embedding arbitrary custom content fields specified in the embedding job.
When a custom_content_field is provided, the embedding stage will extract the text from that field, generate embeddings, and store the results in a specified result_target_field on the same metadata object.

Describe the feature, and optionally a solution or implementation and any alternatives

Requirements

1. EmbedTask schema updates

  • Add two optional fields to the EmbedTask object:
    • custom_content_field: Optional[str] — name of the field on the content metadata that contains the text to embed.
    • result_target_field: Optional[str] — name of the field on the content metadata where the resulting embedding vector should be stored.

2. Embedding stage logic

When processing an embedding job:

  1. Check if custom_content_field is provided.
  2. If not provided → existing behavior (embed default content field).
  3. If provided:
    • Attempt to locate the field on the content metadata.
    • Validate that its value is text (string or convertible to string).
    • Run embedding on that text.
    • Store the resulting embedding vector in the field specified by result_target_field.
    • If the target field does not exist, create it on the metadata object.
  4. Even when the custom field is provided, we also want to construct default embeddings for primitive data types.
  5. The goal is support embedding generation for custom UDF fields, while maintaining backwards compatibility.

3. Error handling

  • If custom_content_field is specified but not found → log warning, skip embedding for that record.
  • If the field exists but is non-textual → log warning, skip embedding for that record.

Additional context

No response

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions