From 5258c29b88076949181d39db8b13d5536015e4d9 Mon Sep 17 00:00:00 2001 From: jillnogold Date: Wed, 18 Feb 2026 09:42:12 +0200 Subject: [PATCH 1/3] standardization --- docs/api/mlrun.package/index.rst | 3 +- .../packagers/custom_packagers/index.md | 30 +- .../itemization-tutorial.ipynb | 63 +-- .../custom_packagers/pack-only-tutorial.ipynb | 53 ++- .../round-trip-tutorial.ipynb | 44 +- docs/concepts/packagers/index.md | 432 +---------------- docs/concepts/packagers/packagers-overview.md | 443 ++++++++++++++++++ .../packagers/packagers-tutorial.ipynb | 66 ++- docs/development/model-training-tracking.md | 1 - 9 files changed, 618 insertions(+), 517 deletions(-) create mode 100644 docs/concepts/packagers/packagers-overview.md diff --git a/docs/api/mlrun.package/index.rst b/docs/api/mlrun.package/index.rst index 3d01d7ba878..06efdccefd8 100644 --- a/docs/api/mlrun.package/index.rst +++ b/docs/api/mlrun.package/index.rst @@ -35,7 +35,8 @@ To create a custom packager, see :ref:`custom-packagers-tutorials`. errors -.. rubric:: Built-in packager modules +**Built-in packager modules** + MLRun includes the following built-in packager modules. All built-in packagers subclass :py:class:`~mlrun.package.packagers.default_packager.DefaultPackager` and diff --git a/docs/concepts/packagers/custom_packagers/index.md b/docs/concepts/packagers/custom_packagers/index.md index 67bb8e481ba..74457dc089d 100644 --- a/docs/concepts/packagers/custom_packagers/index.md +++ b/docs/concepts/packagers/custom_packagers/index.md @@ -1,7 +1,7 @@ (custom-packagers-tutorials)= # Creating custom packagers -MLRun's {ref}`built-in packagers ` cover common Python types — scalars, +MLRun's {ref}`built-in packagers ` cover common Python types — scalars, collections, NumPy arrays, and Pandas DataFrames. But when your function produces or consumes a type that isn't built-in, the default behavior is to **pickle** the object with `cloudpickle` (or any configured pickling module of your choice). Pickle files are opaque, @@ -14,16 +14,24 @@ tables) instead of pickle blobs. **Reminder**: Packing applies to function **outputs** (return values → artifacts) and unpacking applies to function **inputs** (artifacts → typed Python objects). +**In this section** +- [When to write a custom packager](#when-to-write-a-custom-packager) +- [Choosing a base class: `DefaultPackager` vs `Packager`](#choosing-a-base-class-defaultpackager-vs-packager) +- [The four patterns](#the-four-patterns) +- [Step-by-step guide](#step-by-step-guide) +- [Tutorials](#tutorials) + + ## When to write a custom packager Write a custom packager when: - **Your type isn't handled by a built-in packager** — for example, a PIL Image, - a LangChain prompt template, or a domain-specific data class. + a LangChain prompt template, or a domain-specific data class - **You want human-readable serialization** — save as JSON, PNG, CSV, etc. - instead of an opaque pickle file. + instead of an opaque pickle file - **You need bundling support** — your type is a collection that should be - decomposed into individual artifacts when unbundled with `"*key"`. + decomposed into individual artifacts when unbundled with `"*key"` ## Choosing a base class: `DefaultPackager` vs `Packager` @@ -94,7 +102,7 @@ class MyTypePackager(DefaultPackager): ... ### 2. Set class variables -At the minimum, set the type your packager handles and the default artifact type: +At a minimum, set the type your packager handles and the default artifact type: ```python class MyTypePackager(DefaultPackager): @@ -195,10 +203,10 @@ the artifact type is automatically excluded from unpacking validation, so no ext configuration is needed. Real-world examples: - **Pack-only**: PIL Image → PNG (no need to reconstruct the original PIL object - from the logged PNG). + from the logged PNG) - **Unpack-only**: reading a legacy serialization format that should no longer be written (e.g. `unpack_v1` for backward compatibility while new outputs use - `pack_v2`). + `pack_v2`) ### 5. Clean up temporary files @@ -233,8 +241,8 @@ project.add_custom_packager(packager="my_module.MyTypePackager", is_mandatory=Tr The `is_mandatory` flag controls what happens when the packager fails to import on a remote worker: -- `True` — the run fails immediately with an import error. -- `False` — the packager is silently skipped and the fallback pickle behavior is used. +- `True` — the run fails immediately with an import error +- `False` — the packager is silently skipped and the fallback pickle behavior is used To remove a registered packager: @@ -255,10 +263,10 @@ are several ways to achieve this: ``` * **Build into the function image** — include the packager source in the function's - build so it is baked into the container image. + build so it is baked into the container image * **Shared storage** — place the packager module on a shared volume and configure the - function's working directory to point there. + function's working directory to point there If the packager module is missing at runtime, the run fails immediately when `is_mandatory=True`, or falls back to pickle when `is_mandatory=False`. diff --git a/docs/concepts/packagers/custom_packagers/itemization-tutorial.ipynb b/docs/concepts/packagers/custom_packagers/itemization-tutorial.ipynb index a05372c3eb1..af33d0f50af 100644 --- a/docs/concepts/packagers/custom_packagers/itemization-tutorial.ipynb +++ b/docs/concepts/packagers/custom_packagers/itemization-tutorial.ipynb @@ -6,22 +6,31 @@ "metadata": {}, "source": [ "(itemization-tutorial)=\n", - "# Creating a Custom Packager - Itemization (bundling & unbundling) Tutorial\n", + "# Creating a custom packager - itemization (bundling & unbundling) tutorial\n", "\n", "This tutorial walks through creating a custom packager that supports\n", - "**bundling and unbundling**. The packager handles `EvalSuite` \u2014 a custom\n", + "**bundling and unbundling**. The packager handles `EvalSuite` — a custom\n", "collection type that wraps `dict[str, pd.DataFrame]`, where each key is an\n", "evaluation name and each value is a results DataFrame.\n", "\n", "You will learn how to:\n", - "- Define a custom collection type (`EvalSuite`).\n", + "- Define a custom collection type (`EvalSuite`)\n", "- Implement `unbundle()` so the `\"*\"` log-hint prefix decomposes the suite into\n", - " individual dataset artifacts.\n", - "- Implement `bundle()` to reconstruct an `EvalSuite` from a dict of DataFrames.\n", + " individual dataset artifacts\n", + "- Implement `bundle()` to reconstruct an `EvalSuite` from a dict of DataFrames\n", "- Implement `pack_file()` / `unpack_file()` for saving/loading the suite as a\n", - " single artifact.\n", + " single artifact\n", "- Test all three flows: unbundled output, single-file output, and consuming a\n", - " packed suite as a typed input in a downstream handler." + " packed suite as a typed input in a downstream handler\n", + "\n", + "**In this section**\n", + "- [The problem](#the-problem)\n", + "- [Setup](#setup)\n", + "- [Create the custom packager](#create-the-custom-packager)\n", + "- [Test 1: unbundling with \"*\"](#test-1-unbundling-with-)\n", + "- [Test 2: packing as a single file (no unbundling)](#test-2-packing-as-a-single-file-no-unbundling)\n", + "- [Test 3: consuming unbundled artifacts as an EvalSuite input](#test-3-consuming-unbundled-artifacts-as-an-evalsuite-input)\n", + "- [Summary of all three patterns](#summary-of-all-three-patterns)" ] }, { @@ -34,13 +43,13 @@ "Imagine a function that evaluates an AI system on multiple benchmarks and returns\n", "all results in a single object. Without bundling support you have two options:\n", "\n", - "1. **Return the whole object as one artifact** \u2014 opaque, hard to browse individual\n", - " benchmarks in the UI.\n", - "2. **Manually break it apart** \u2014 return each DataFrame separately and wire up the\n", - " log hints by hand.\n", + "- **Return the whole object as one artifact** — opaque, hard to browse individual\n", + " benchmarks in the UI\n", + "- **Manually break it apart** — return each DataFrame separately and wire up the\n", + " log hints by hand\n", "\n", "With a bundling packager, you can return the single `EvalSuite` object and use\n", - "`\"*eval_suite\"` in the log hint. The packager automatically unbundles it \u2014 each\n", + "`\"*eval_suite\"` in the log hint. The packager automatically unbundles it — each\n", "benchmark becomes its own `DatasetArtifact`. And when a downstream function needs\n", "the full suite, the packager re-bundles the individual artifacts back into an\n", "`EvalSuite`." @@ -53,7 +62,7 @@ "source": [ "## Setup\n", "\n", - "We'll create a project and define the `EvalSuite` type." + "In this tutorial, you'll create a project and define the `EvalSuite` type." ] }, { @@ -61,7 +70,7 @@ "id": "86681a4581209e58", "metadata": {}, "source": [ - "### Create Project" + "### Create project" ] }, { @@ -92,7 +101,7 @@ "### Define the EvalSuite type\n", "\n", "`EvalSuite` is a thin wrapper around `dict[str, pd.DataFrame]`. It inherits\n", - "from `dict` so it behaves like a normal dictionary but has its own type \u2014 which\n", + "from `dict` so it behaves like a normal dictionary but has its own type — which\n", "is what the packager matches on." ] }, @@ -131,7 +140,7 @@ "id": "d98a7068961dfacd", "metadata": {}, "source": [ - "### Write a Function\n", + "### Write a function\n", "\n", "This handler simulates running three benchmarks and returns the results as\n", "an `EvalSuite`." @@ -206,16 +215,16 @@ "id": "a9648c6898587737", "metadata": {}, "source": [ - "## Create the Custom Packager\n", + "## Create the custom packager\n", "\n", "### Write the packager\n", "\n", "The packager supports three modes:\n", - "- **As a single file** (`pack_file` / `unpack_file`) \u2014 saves all DataFrames to a\n", - " single JSON file.\n", - "- **Unbundling** (`unbundle`) \u2014 decomposes the suite into its inner dict so each\n", - " DataFrame is packed separately by the Pandas packager.\n", - "- **Bundling** (`bundle`) \u2014 reconstructs an `EvalSuite` from a dict of DataFrames." + "- **As a single file** (`pack_file` / `unpack_file`) — saves all DataFrames to a\n", + " single JSON file\n", + "- **Unbundling** (`unbundle`) — decomposes the suite into its inner dict so each\n", + " DataFrame is packed separately by the Pandas packager\n", + "- **Bundling** (`bundle`) — reconstructs an `EvalSuite` from a dict of DataFrames" ] }, { @@ -316,7 +325,7 @@ " Unbundle an EvalSuite into its inner dict.\n", "\n", " Each value (a pd.DataFrame) will be packed separately by the\n", - " Pandas packager \u2014 so each benchmark becomes its own DatasetArtifact.\n", + " Pandas packager — so each benchmark becomes its own DatasetArtifact.\n", "\n", " :param bundled_object: The EvalSuite to unbundle.\n", "\n", @@ -348,7 +357,7 @@ "- `BUNDLE_FROM_DICT = True` tells the packager manager that `EvalSuite` can\n", " be constructed from a `dict`. This enables both the `\"*\"` unbundling operator\n", " on output and dict-based input bundling.\n", - "- `unbundle()` simply returns `dict(bundled_object)` \u2014 the plain dict of\n", + "- `unbundle()` simply returns `dict(bundled_object)` — the plain dict of\n", " DataFrames. The manager then packs each DataFrame individually using the\n", " built-in Pandas packager (as `DatasetArtifact`s).\n", "- `bundle()` wraps a dict back into `EvalSuite(collection)`. The manager\n", @@ -387,7 +396,7 @@ "\n", "Using `\"*eval_suite\"` in the log hint triggers unbundling. The packager calls\n", "`unbundle()` to get the inner dict, then the manager packs each DataFrame\n", - "individually \u2014 creating three separate `DatasetArtifact`s." + "individually — creating three separate `DatasetArtifact`s." ] }, { @@ -454,7 +463,7 @@ "run_single = fn.run(\n", " local=True,\n", " params={\"num_samples\": 5},\n", - " returns=[\"eval_suite\"], # no * \u2192 single file artifact\n", + " returns=[\"eval_suite\"], # no * → single file artifact\n", ")" ] }, @@ -528,7 +537,7 @@ "metadata": {}, "outputs": [], "source": [ - "# Pass the unbundled artifacts \u2014 the packager re-bundles them into an EvalSuite\n", + "# Pass the unbundled artifacts — the packager re-bundles them into an EvalSuite\n", "summarize_run = summarize_fn.run(\n", " local=True,\n", " inputs={\"suite\": run_unbundled.outputs[\"eval_suite\"]},\n", diff --git a/docs/concepts/packagers/custom_packagers/pack-only-tutorial.ipynb b/docs/concepts/packagers/custom_packagers/pack-only-tutorial.ipynb index c9e64848dfd..d3d6c754557 100644 --- a/docs/concepts/packagers/custom_packagers/pack-only-tutorial.ipynb +++ b/docs/concepts/packagers/custom_packagers/pack-only-tutorial.ipynb @@ -6,7 +6,7 @@ "metadata": {}, "source": [ "(pack-only-tutorial)=\n", - "# Creating a Custom Packager - Pack-only Tutorial\n", + "# Creating a custom packager - pack-only tutorial\n", "\n", "\n", "This tutorial walks through creating a **pack-only** custom packager for\n", @@ -14,12 +14,20 @@ "images are produced by functions but not consumed as typed inputs.\n", "\n", "You will learn how to:\n", - "- Subclass `DefaultPackager` and set the packable type.\n", - "- Implement `pack_file()` to save an image as a file artifact.\n", - "- Implement `pack_plot()` to create an inline-preview artifact.\n", - "- Clean up temporary files with `add_future_clearing_path()`.\n", - "- Add packing kwargs so users can choose the image format.\n", - "- Register and test the packager." + "- Subclass `DefaultPackager` and set the packable type\n", + "- Implement `pack_file()` to save an image as a file artifact\n", + "- Implement `pack_plot()` to create an inline-preview artifact\n", + "- Clean up temporary files with `add_future_clearing_path()`\n", + "- Add packing kwargs so users can choose the image format\n", + "- Register and test the packager\n", + "\n", + "**In this section**\n", + "- [The problem](#the-problem)\n", + "- [Setup](#setup)\n", + "- [Create the custom packager](#create-the-custom-packager)\n", + "- [Test 1: pack as plot (default)](#test-1-pack-as-plot-default)\n", + "- [Test 2: pack as file with format kwarg](#test-2-pack-as-file-with-format-kwarg)\n", + "- [Recap](#recap)" ] }, { @@ -32,10 +40,10 @@ "Without a custom packager, returning a `PIL.Image.Image` from an MLRun function\n", "causes the default packager to **pickle** it. The resulting `.pkl` file is:\n", "\n", - "- **Opaque** — you can't view the image in the MLRun UI or a file browser.\n", - "- **Fragile** — pickle files are tied to the exact Pillow version.\n", + "- **Opaque** — you can't view the image in the MLRun UI or a file browser\n", + "- **Fragile** — pickle files are tied to the exact Pillow version\n", "\n", - "We want images saved as standard PNG (or JPEG) files that anyone can view." + "Ideally, images are saved as standard PNG (or JPEG) files that anyone can view." ] }, { @@ -51,7 +59,7 @@ "id": "b2000001", "metadata": {}, "source": [ - "### Create Project" + "### Create project" ] }, { @@ -79,7 +87,7 @@ "id": "b2000002", "metadata": {}, "source": [ - "### Write a Function\n", + "### Write a function\n", "\n", "This handler generates a simple gradient image using Pillow — no external\n", "data or API keys required." @@ -136,12 +144,12 @@ "id": "b2000003", "metadata": {}, "source": [ - "## Create the Custom Packager\n", + "## Create the custom packager\n", "\n", "### Write the packager\n", "\n", "The packager is written to a `.py` file so it can be imported by MLRun at runtime.\n", - "Let's walk through each piece." + "This code walks you through each piece." ] }, { @@ -231,7 +239,22 @@ "cell_type": "markdown", "id": "b1000008", "metadata": {}, - "source": "Notice:\n\n- **`PACKABLE_OBJECT_TYPE = Image.Image`** tells the packager manager that this\n packager handles PIL Images.\n- **`PACK_SUBCLASSES = True`** so subclasses (e.g. `JpegImageFile`) are also handled.\n- **`DEFAULT_PACKING_ARTIFACT_TYPE = ArtifactType.PLOT`** means that if the user\n doesn't specify an artifact type, the image is packed as an inline preview.\n- **`pack_file()`** accepts a `format` kwarg — users can override it via log hints\n like `\"image : file[format=jpeg]\"`. The temp directory is registered for cleanup.\n- **`pack_plot()`** converts the image to PNG bytes and wraps them in a\n `PlotArtifact` for inline display in the MLRun UI.\n- This is a true **pack-only** packager — no `unpack_*` methods are needed.\n `DefaultPackager` discovers packing and unpacking artifact types independently,\n so `\"file\"` and `\"plot\"` are automatically available for packing only." + "source": [ + "Notice:\n", + "\n", + "- **`PACKABLE_OBJECT_TYPE = Image.Image`** tells the packager manager that this\n", + " packager handles PIL Images\n", + "- **`PACK_SUBCLASSES = True`** so subclasses (e.g. `JpegImageFile`) are also handled\n", + "- **`DEFAULT_PACKING_ARTIFACT_TYPE = ArtifactType.PLOT`** means that if the user\n", + " doesn't specify an artifact type, the image is packed as an inline preview\n", + "- **`pack_file()`** accepts a `format` kwarg — users can override it via log hints\n", + " like `\"image : file[format=jpeg]\"`. The temp directory is registered for cleanup\n", + "- **`pack_plot()`** converts the image to PNG bytes and wraps them in a\n", + " `PlotArtifact` for inline display in the MLRun UI\n", + "- This is a true **pack-only** packager — no `unpack_*` methods are needed\n", + " `DefaultPackager` discovers packing and unpacking artifact types independently,\n", + " so `\"file\"` and `\"plot\"` are automatically available for packing only" + ] }, { "cell_type": "markdown", diff --git a/docs/concepts/packagers/custom_packagers/round-trip-tutorial.ipynb b/docs/concepts/packagers/custom_packagers/round-trip-tutorial.ipynb index 999105cc782..5c86816aae4 100644 --- a/docs/concepts/packagers/custom_packagers/round-trip-tutorial.ipynb +++ b/docs/concepts/packagers/custom_packagers/round-trip-tutorial.ipynb @@ -6,7 +6,7 @@ "metadata": {}, "source": [ "(round-trip-tutorial)=\n", - "# Creating a Custom Packager - Round-trip (pack + unpack) Tutorial\n", + "# Creating a custom packager - round-trip (pack + unpack) tutorial\n", "\n", "This tutorial walks through creating a **round-trip** custom packager for\n", "`langchain_core.prompts.ChatPromptTemplate`. Round-trip means the packager handles\n", @@ -14,15 +14,23 @@ "a prompt template produced by one function can be consumed as a typed input by another.\n", "\n", "You will learn how to:\n", - "- Implement `pack_file()` to serialize a `ChatPromptTemplate` to a human-readable JSON file.\n", - "- Implement `unpack_file()` to load it back from a `DataItem`.\n", - "- Test the full round-trip: one handler produces a template, another consumes it.\n", + "- Implement `pack_file()` to serialize a `ChatPromptTemplate` to a human-readable JSON file\n", + "- Implement `unpack_file()` to load it back from a `DataItem`\n", + "- Test the full round-trip: one handler produces a template, another consumes it\n", "\n", - "```{admonition} Prerequisites\n", + "```{admonition} Prerequisite\n", "This tutorial requires `langchain-core`. Install it with:\n", "\n", " pip install langchain\n", - "```" + "```\n", + "\n", + "**In this section**\n", + "- [The problem](#the-problem)\n", + "- [Setup](#setup)\n", + "- [Create the custom packager](#create-the-custom-packager)\n", + "- [Test 1: pack a prompt template](#test-1-pack-a-prompt-template)\n", + "- [Test 2: round-trip — consume as typed input](#test-2-round-trip--consume-as-typed-input)\n", + "- [Recap](#recap)" ] }, { @@ -35,11 +43,11 @@ "Without a custom packager, returning a `ChatPromptTemplate` from an MLRun function\n", "causes it to be **pickled**. The resulting `.pkl` file is:\n", "\n", - "- **Opaque** — you can't read or review the prompt structure.\n", - "- **Fragile** — pickle files break when LangChain versions change.\n", + "- **Opaque** — you can't read or review the prompt structure\n", + "- **Fragile** — pickle files break when LangChain versions change\n", "\n", - "We want the template saved as a readable **JSON** file that captures the message\n", - "structure, and we want to load it back as a `ChatPromptTemplate` in downstream\n", + "Ibstead, the template should be saved as a readable **JSON** file that captures the message\n", + "structure, and you can load it back as a `ChatPromptTemplate` in downstream\n", "functions." ] }, @@ -56,7 +64,7 @@ "id": "c2000001", "metadata": {}, "source": [ - "### Create Project" + "### Create the project" ] }, { @@ -84,7 +92,7 @@ "id": "c2000002", "metadata": {}, "source": [ - "### Write a Function\n", + "### Write a function\n", "\n", "This handler creates a `ChatPromptTemplate` with template variables (`{domain}`,\n", "`{response}`) and returns it. The packager will serialize it to JSON." @@ -136,7 +144,7 @@ "id": "c2000003", "metadata": {}, "source": [ - "## Create the Custom Packager\n", + "## Create the custom packager\n", "\n", "### Write the packager\n", "\n", @@ -256,14 +264,14 @@ "\n", "- **`DEFAULT_PACKING_ARTIFACT_TYPE = \"file\"`** and\n", " **`DEFAULT_UNPACKING_ARTIFACT_TYPE = \"file\"`** — both packing and unpacking\n", - " default to the `file` artifact type.\n", + " default to the `file` artifact type\n", "- **`pack_file()`** extracts each message's role and template string into a\n", - " clean JSON structure. This is human-readable — you can inspect the artifact\n", - " in any JSON viewer.\n", + " clean JSON structure; this is human-readable — you can inspect the artifact\n", + " in any JSON viewer\n", "- **`unpack_file()`** reads the JSON and reconstructs the template using\n", - " `ChatPromptTemplate.from_messages()` — the standard LangChain constructor.\n", + " `ChatPromptTemplate.from_messages()` — the standard LangChain constructor\n", "- The `role_map` translates LangChain class names back to the short role\n", - " strings (`\"system\"`, `\"human\"`, `\"ai\"`) that `from_messages()` expects." + " strings (`\"system\"`, `\"human\"`, `\"ai\"`) that `from_messages()` expects" ] }, { diff --git a/docs/concepts/packagers/index.md b/docs/concepts/packagers/index.md index 70a2dab1969..257568c4f95 100644 --- a/docs/concepts/packagers/index.md +++ b/docs/concepts/packagers/index.md @@ -6,439 +6,13 @@ Instead of manually handling `DataItem`s, artifact uploads, and serialization co you write standard Python functions with type hints and returning values — and MLRun automatically handles input parsing, output logging, and artifact creation. +**In this section** + ```{toctree} :maxdepth: 1 +packagers-overview packagers-tutorial custom_packagers/index ``` -## What are packagers? - -Writing a function locally and running it remotely should feel identical. For example, -running locally, your function accepts a DataFrame and returns a cleaned dataset — objects -live in memory and everything just works. But when you move that same function to a remote -job, Python objects can't be sent over the wire, and whatever the function returns simply -disappears into the void. - -Packagers bridge this gap: they serialize inputs before they reach your code and -capture outputs after it runs, handling all the MLRun-specific I/O behind the scenes so -your function stays pure Python regardless of where it executes. - -Every MLRun function has two I/O touch-points: - -1. **Input parsing** — automatically cast `DataItem` inputs to the type-hinted Python type (e.g. `pd.DataFrame`, - `np.ndarray`, `dict`). -2. **Output logging** — automatically serialize, log, and upload returned objects as artifacts or results based on log - hints. - -The flow looks like this: - -**Input flow:** `inputs={"data": "store://..."}` → `DataItem` → packager `unpack()` → typed Python object → -your function - -**Output flow:** your function `return` → Python object → packager `pack()` → `Artifact` / `Result` → artifact store - -## Why use packagers? - -Packagers offer several advantages over manual artifact handling and the legacy context-based API. - -### Better and faster learning curve - -With packagers you don't need to learn about `Artifact`s, `DataItem`s, or the MLRun context object. You write standard -Python with type hints and returning values — MLRun wraps your existing code without changing it. - -**Before** — manual artifact handling: - -```python -import mlrun -import pandas as pd - - -def clean_data(context: mlrun.MLClientCtx, raw_data: mlrun.DataItem): - # Parse input manually - df = raw_data.as_df() - - # Drop rows with missing values and duplicates - cleaned = df.dropna().drop_duplicates() - row_count = len(cleaned) - - # Log outputs manually - context.log_result("row_count", row_count) - context.log_dataset("cleaned_data", df=cleaned, format="parquet") -``` - -**After** — with packagers: - -```python -import pandas as pd - - -def clean_data(raw_data: pd.DataFrame) -> tuple[int, pd.DataFrame]: - cleaned = raw_data.dropna().drop_duplicates() - return len(cleaned), cleaned -``` - -The function is pure Python — no MLRun imports, no manual serialization. When you run it with: - -```python -fn.run( - handler="clean_data", - inputs={"raw_data": "store://my-raw-data"}, - returns=["row_count", "cleaned_data : dataset"], -) -``` - -MLRun automatically converts the `DataItem` to a DataFrame on input and logs the row count as a result and the -cleaned DataFrame as a dataset artifact on output. - -### Uniformity of artifacts between users and projects - -ML engineers can define how artifacts are serialized once and have that convention enforced -across every development notebook, CI pipeline, and production project in the organization. -Because packagers standardize the serialization format, artifacts become truly portable — a -DataFrame logged in one project can be consumed by a function in a completely different -project without conversion steps or format mismatches (of course, given access is allowed across -these projects). - -In a pipeline, functions don't need to agree on file formats or know about MLRun's artifact -API. The producer just `return`s the object and the consumer receives it as a typed -parameter: - -```python -# Producer — returns a DataFrame -def prepare_data(raw: pd.DataFrame) -> pd.DataFrame: - return raw.dropna() - - -# Consumer — receives a DataFrame directly -def train_model(data: pd.DataFrame): ... -``` - -No manual `data_item.as_df()` calls, no format negotiation — the same artifact flows -cleanly between functions, projects, and teams. - -### Adaptive to user needs - -MLRun provides common built-in packagers with rich options and configurations. For -example, you can control the output format with a single log hint string: - -```python -returns = ['data : dataset[format="parquet"]'] -``` - -or equivalently using the {py:class}`~mlrun.package.log_hint.LogHint` class for full -control: - -```python -from mlrun import LogHint - -returns = [ - LogHint(key="data", artifact_type="dataset", packing_kwargs={"format": "parquet"}) -] -``` - -Either form replaces manual pandas I/O and artifact construction. - -Beyond built-in packagers, MLRun supports **custom packagers** that you write and -register in your project to handle domain-specific types. -See the {ref}`custom packagers tutorials `. - -## How to use packagers - -### Parsing inputs with type hints - -Type hints on function parameters tell packagers what Python type each input -should be converted to. When you pass a value via `inputs={}`, it arrives as a -`DataItem`. The packager looks at the type hint and automatically converts it -to the declared type — `pd.DataFrame`, `dict`, `np.ndarray`, etc. - -```python -def my_handler(data: pd.DataFrame, config: dict): - # `data` is already a DataFrame — no .as_df() needed - # `config` is already a dict — no .get() / json.loads() needed - ... -``` - -```python -fn.run( - handler="my_handler", - inputs={"data": "store://my-dataset", "config": "store://my-config"}, -) -``` - -Packagers are enabled by default (`mlrun.mlconf.packagers.enabled = True`). -When enabled, the runtime automatically parses all type-hinted arguments -that are passed via `inputs={}`. To disable parsing for a specific run, -set `mlrun.mlconf.packagers.enabled = False`. - -### Logging outputs with log hints - -A log hint tells MLRun how to log a single returned value — what key to store it -under, what artifact type to use, and any serialization options. Log hints are -passed via the `returns` parameter on `function.run()`: - -```python -fn.run( - handler="train", - inputs={"dataset": "store://my-dataset"}, - returns=["accuracy", "X_test : dataset", "model : model"], -) -``` - -Each entry in the `returns` list is a log hint — either a `LogHint` object or a -string shortcut. The sections below cover artifact types, the LogHint class, and -the string shortcut format. - -#### Artifact types - -The artifact type is a string that determines how an object is serialized and -what metadata is stored. MLRun defines common types in `mlrun.package.ArtifactType`, -but custom packagers can implement any artifact type string they need — these are -just conventions that built-in packagers share: - -| Type | Description | Typical objects | -|------|-------------|-----------------| -| `result` | Scalar/simple value stored in run metadata | `int`, `float`, `str`, small `dict`/`list` | -| `dataset` | Tabular data logged as a `DatasetArtifact` | `pd.DataFrame` | -| `file` | Generic file upload | `np.ndarray`, `bytes`, large dicts | -| `model` | ML model artifact | scikit-learn models, torch models | -| `plot` | Visualization | matplotlib figures | -| `object` | Pickle serialization (fallback) | Any Python object | -| `path` | File/directory path | `str`, `pathlib.Path` | - -If you don't specify an artifact type, the packager for the object's type chooses -a sensible default. Custom packagers define their own defaults via -`DEFAULT_PACKING_ARTIFACT_TYPE`. - -##### Asymmetric (pack-only / unpack-only) artifact types - -Packing and unpacking artifact types are discovered independently. A `DefaultPackager` -subclass with `pack_foo` but no `unpack_foo` supports `"foo"` for packing only — -`is_packable` accepts it but `is_unpackable` rejects it. The reverse also applies: -`unpack_bar` without `pack_bar` means `"bar"` is unpack-only. - -Common scenarios: - -- **Pack-only** — saving plots as images, logging summary metrics as plain results, - rendering a model to an image (the PNG can't be deserialized back to the original - object). -- **Unpack-only** — legacy/migration support (e.g. `unpack_v1` reads artifacts from - an older packager version while new writes always use `pack_v2`); cross-format - compatibility (e.g. a DataFrame packager can `unpack_csv` to read manually-logged - CSV artifacts but always `pack_parquet` for new outputs). - -#### The LogHint class - -A {py:class}`~mlrun.package.log_hint.LogHint` gives you full control over logging — -artifact type, labels, extra data, metrics, and more: - -```python -from mlrun import LogHint - -returns = [ - LogHint(key="model", artifact_type="model", labels={"version": "1"}), - LogHint(key="data", artifact_type="dataset", packing_kwargs={"format": "csv"}), -] -``` - -A `LogHint` has the following fields: - -| Field | Type | Description | -|-------|------|-------------| -| `key` | `str` | **Required.** The artifact key to log the object under. | -| `artifact_type` | `str \| None` | The artifact type (e.g. `"dataset"`, `"model"`, `"result"`). If `None`, the packager's default is used. | -| `tag` | `str` | Tag for the artifact. Default: `""`. | -| `itemized` | `bool \| int` | Unbundling control. `False` (default): log as one artifact. `True`: fully unbundle. `int`: unbundle to that depth. | -| `packing_kwargs` | `dict` | Extra keyword arguments passed to the packager's `pack_()` method (e.g. `{"format": "parquet"}`). | -| `labels` | `dict[str, str]` | Labels to add to the logged artifact. | -| `extra_data` | `dict` | Extra data to attach to the artifact. Use `...` (Ellipsis) as a value to link to another package by key. | -| `metrics` | `dict` | Metrics to log alongside a model artifact. Use `...` (Ellipsis) as a value to link to another package by key. | - -#### Linking artifacts - -When a function returns multiple values, you can **link** them together so that -related outputs are attached to a primary artifact. For example, you might want a -model artifact to carry its evaluation metrics and supporting artifacts (plots, test -data) as part of its metadata. This is done through the `extra_data` and `metrics` -fields of `LogHint`, using Python's `...` (Ellipsis) as a placeholder meaning -"fill this in with the package that has this key." - -Consider a training function that returns a model alongside its metrics, a loss plot, -and a test dataset: - -```python -def train(dataset: pd.DataFrame): - # ... training logic ... - return my_model, some_result, loss_plot, test_dataset - - -fn.run( - handler="train", - inputs={"dataset": "store://my-dataset"}, - returns=[ - LogHint( - key="my_model", - artifact_type="model", - metrics={"some_result": ...}, - extra_data={"loss_plot": ..., "test_dataset": ...}, - ), - "some_results : result", - "loss_plot : plot", - "test_dataset : dataset", - ], -) -``` - -After all four values are packed, the packager manager resolves every `...`: - -- `"some_result"` is a result (scalar), so it is placed into the model's `metrics`. -- `"loss_plot"` and `"test_dataset"` are artifacts, so they are placed into the - model's `extra_data`. - -The result is a model artifact with its evaluation metrics and supporting data -attached directly — visible as a single unit in the MLRun UI. - -```{note} -**Linking rules:** - -* `metrics` is available only on **model** artifacts and can link to **results** only - (scalar values in run metadata). -* `extra_data` works with any artifact type and can link to both artifacts and results. -* If a referenced key is not found among the packed outputs, the entry is removed and - a warning is logged. -* The order of items in `returns` does not matter — linking is resolved after all - packing is complete. -``` - -#### String shortcut - -The most common way to specify a log hint. A string shortcut has up to four parts: - -| Part | Syntax | Purpose | Example | -|------|--------|---------|---------| -| **Key** (required) | `""` | The artifact name | `"accuracy"` | -| **Artifact type** | `" : "` | Override the default type | `"data : dataset"` | -| **Packing kwargs** | `" : [k='v', ...]"` | Pass options to the packager | `'data : dataset[format="parquet"]'` | -| **Itemization prefix** | `"*"` or `"*"` | Unbundle a collection | `"*results"`, `"2*results"` | - -Examples and their `LogHint` equivalents: - -| String | Equivalent LogHint | -|--------|--------------------| -| `"accuracy"` | `LogHint(key="accuracy")` | -| `"data : dataset"` | `LogHint(key="data", artifact_type="dataset")` | -| `'data : dataset[format="parquet"]'` | `LogHint(key="data", artifact_type="dataset", packing_kwargs={"format": "parquet"})` | -| `"*results"` | `LogHint(key="results", itemized=True)` | -| `"2*results"` | `LogHint(key="results", itemized=2)` | - -(unbundling)= -#### Itemization (unbundling) - -Unbundling breaks a collection (list or dict) into separate artifacts, each logged individually. This is useful when a -function returns a dictionary of DataFrames and you want each one as its own dataset artifact. - -```python -def evaluate(data: pd.DataFrame) -> dict[str, pd.DataFrame]: - """Returns per-category evaluation results.""" - results = {} - for category in data["category"].unique(): - subset = data[data["category"] == category] - results[category] = compute_metrics(subset) - return results -``` - -Without unbundling, the entire dict is logged as a single artifact. With unbundling: - -```python -fn.run(handler="evaluate", inputs={"data": "store://eval-data"}, returns=["*results"]) -``` - -Each DataFrame in the dict becomes its own dataset artifact, keyed as `results_`. - -**Depth control** - -- `"*results"` or `itemized=True` — fully recursive unbundling. -- `"2*results"` or `itemized=2` — unbundle up to 2 levels deep. Nested collections beyond that depth are logged as - single artifacts. - -## Configuration - -Packager behavior is controlled by settings under `mlrun.mlconf.packagers`: - -| Setting | Default | Description | -|---------|---------|-------------| -| `enabled` | `True` | Master switch. When enabled, MLRun automatically wraps every function execution with the packager handler — parsing typed inputs and logging returned outputs. Set to `False` to disable all packager functionality. | -| `auto_unpack_inputs` | `False` | When `True`, inputs that have **no type hint** are still automatically unpacked if they were originally logged via packagers. When `False` (default), un-hinted inputs remain as raw `DataItem` objects. | -| `auto_pack_outputs` | `False` | When `True`, returned objects are packed even if no log hints were provided by the user. The artifact key follows the pattern `--` where `i` is enumerated. When `False` (default), returned objects without log hints are ignored. | -| `auto_pack_key` | `"artifact"` | The base key used in the auto-generated artifact name when `auto_pack_outputs` is enabled. | -| `pack_tuples` | `False` | When `True`, returned tuples are treated as a single tuple object and packed together. When `False` (default), each element of a returned tuple is packed as a separate output — enabling functions to return multiple items via `return a, b, c`. | -| `logging_worker` | `0` | In multi-worker runs, only the worker with this rank packs outputs and logs results/artifacts. Other workers skip logging to avoid overriding each other. Default is `0` (the main worker). | - -You can change these settings globally: - -```python -import mlrun - -mlrun.mlconf.packagers.auto_unpack_inputs = True -``` - -```{note} -You can also set these options via environment variables. Use the `MLRUN_` prefix -with `__` (double underscore) as the nesting separator: - - MLRUN_PACKAGERS__ENABLED=true - MLRUN_PACKAGERS__AUTO_PACK_OUTPUTS=true -``` - -## Built-in packagers - -MLRun includes packagers for common Python types. All built-in packagers are available automatically — no registration -needed. - -### Python standard library - -Handles `None`, `int`, `float`, `bool`, `str`, `dict`, `list`, `tuple`, `set`, -`frozenset`, `bytes`, `bytearray`, and `pathlib.Path`. - -API reference: {py:mod}`~mlrun.package.packagers.python_standard_library_packagers` - -### NumPy - -Handles `np.ndarray`, `np.number`, and collections of arrays (`list[np.ndarray]`, -`dict[str, np.ndarray]`). - -API reference: {py:mod}`~mlrun.package.packagers.numpy_packagers` - -### Pandas - -Handles `pd.DataFrame` and `pd.Series`. - -API reference: {py:mod}`~mlrun.package.packagers.pandas_packagers` - -### Default (fallback) - -Any unrecognized type is handled by the {py:class}`~mlrun.package.packagers.default_packager.DefaultPackager`, which -serializes objects using `cloudpickle` (or any pickling module configured). The default artifact type is `object`. - -(custom-packagers)= -## Creating a custom packager - -When a built-in packager doesn't handle your type (or you want human-readable serialization -instead of pickle), you can write a custom packager. The -{ref}`custom packagers guide ` walks through the full process — -choosing a base class, setting class variables, implementing pack/unpack methods, and -registering the packager in your project. - -```{note} -When running remotely, set the project source with `pull_at_runtime=True` -so the packager module can be imported on the remote worker: - - project.set_source(source="./", pull_at_runtime=True) -``` - - -**See also** -- {ref}`auto-logging-mlops` — framework-specific auto-logging with `apply_mlrun()` -- {ref}`working-with-data-and-model-artifacts` — manual artifact handling -- {py:mod}`mlrun.package` — API reference diff --git a/docs/concepts/packagers/packagers-overview.md b/docs/concepts/packagers/packagers-overview.md new file mode 100644 index 00000000000..19304deeb60 --- /dev/null +++ b/docs/concepts/packagers/packagers-overview.md @@ -0,0 +1,443 @@ +(packagers-overview)= +# Packagers overview + +Learn about built-in and custom packagers, and how to configure and use them. + +**In this section** +- [What are packagers?](#what-are-packagers) +- [Why use packagers?](#why-use-packagers) +- [How to use packagers](#how-to-use-packagers) +- [Configuration](#configuration) +- [Built-in packagers](#built-in-packagers) +- [Creating a custom packager](#creating-a-custom-packager) + + +## What are packagers? + +Writing a function locally and running it remotely should feel identical. For example, +running locally, your function accepts a DataFrame and returns a cleaned dataset — objects +live in memory and everything just works. But when you move that same function to a remote +job, Python objects can't be sent over the wire, and whatever the function returns simply +disappears into the void. + +Packagers bridge this gap: they serialize inputs before they reach your code and +capture outputs after it runs, handling all the MLRun-specific I/O behind the scenes so +your function stays pure Python regardless of where it executes. + +Every MLRun function has two I/O touch-points: + +- **Input parsing** — automatically cast `DataItem` inputs to the type-hinted Python type (e.g. `pd.DataFrame`, + `np.ndarray`, `dict`) +- **Output logging** — automatically serialize, log, and upload returned objects as artifacts or results based on log + hints + +The flow looks like this: + +**Input flow:** `inputs={"data": "store://..."}` → `DataItem` → packager `unpack()` → typed Python object → +your function + +**Output flow:** your function `return` → Python object → packager `pack()` → `Artifact` / `Result` → artifact store + +## Why use packagers? + +Packagers offer several advantages over manual artifact handling and the legacy context-based API. + +### Better and faster learning curve + +With packagers you don't need to learn about `Artifact`s, `DataItem`s, or the MLRun context object. You write standard +Python with type hints and returning values — MLRun wraps your existing code without changing it. + +**Before** — manual artifact handling: + +```python +import mlrun +import pandas as pd + + +def clean_data(context: mlrun.MLClientCtx, raw_data: mlrun.DataItem): + # Parse input manually + df = raw_data.as_df() + + # Drop rows with missing values and duplicates + cleaned = df.dropna().drop_duplicates() + row_count = len(cleaned) + + # Log outputs manually + context.log_result("row_count", row_count) + context.log_dataset("cleaned_data", df=cleaned, format="parquet") +``` + +**After** — with packagers: + +```python +import pandas as pd + + +def clean_data(raw_data: pd.DataFrame) -> tuple[int, pd.DataFrame]: + cleaned = raw_data.dropna().drop_duplicates() + return len(cleaned), cleaned +``` + +The function is pure Python — no MLRun imports, no manual serialization. When you run it with: + +```python +fn.run( + handler="clean_data", + inputs={"raw_data": "store://my-raw-data"}, + returns=["row_count", "cleaned_data : dataset"], +) +``` + +MLRun automatically converts the `DataItem` to a DataFrame on input and logs the row count as a result and the +cleaned DataFrame as a dataset artifact on output. + +### Uniformity of artifacts between users and projects + +ML engineers can establish a standardized method for artifact serialization once, ensuring consistent enforcement +across every development notebook,, CI pipeline, and production project in the organization. +Because packagers standardize the serialization format, artifacts become truly portable — a +DataFrame logged in one project can be consumed by a function in a completely different +project without conversion steps or format mismatches (assuming, of course, that access is allowed across +these projects). + +In a pipeline, functions don't need to agree on file formats or know about MLRun's artifact +API. The producer just `return`s the object and the consumer receives it as a typed +parameter: + +```python +# Producer — returns a DataFrame +def prepare_data(raw: pd.DataFrame) -> pd.DataFrame: + return raw.dropna() + + +# Consumer — receives a DataFrame directly +def train_model(data: pd.DataFrame): ... +``` + +No manual `data_item.as_df()` calls, no format negotiation — the same artifact flows +cleanly between functions, projects, and teams. + +### Adaptive to user needs + +MLRun provides common built-in packagers with rich options and configurations. For +example, you can control the output format with a single log hint string: + +```python +returns = ['data : dataset[format="parquet"]'] +``` + +or equivalently using the {py:class}`~mlrun.package.log_hint.LogHint` class for full +control: + +```python +from mlrun import LogHint + +returns = [ + LogHint(key="data", artifact_type="dataset", packing_kwargs={"format": "parquet"}) +] +``` + +Either form replaces manual pandas I/O and artifact construction. + +Beyond built-in packagers, MLRun supports **custom packagers** that you write and +register in your project to handle domain-specific types. +See the {ref}`custom packagers tutorials `. + +## How to use packagers + +### Parsing inputs with type hints + +Type hints on function parameters tell packagers what Python type each input +should be converted to. When you pass a value via `inputs={}`, it arrives as a +`DataItem`. The packager looks at the type hint and automatically converts it +to the declared type — `pd.DataFrame`, `dict`, `np.ndarray`, etc. + +```python +def my_handler(data: pd.DataFrame, config: dict): + # `data` is already a DataFrame — no .as_df() needed + # `config` is already a dict — no .get() / json.loads() needed + ... +``` + +```python +fn.run( + handler="my_handler", + inputs={"data": "store://my-dataset", "config": "store://my-config"}, +) +``` + +Packagers are enabled by default (`mlrun.mlconf.packagers.enabled = True`). +When enabled, the runtime automatically parses all type-hinted arguments +that are passed via `inputs={}`. To disable parsing for a specific run, +set `mlrun.mlconf.packagers.enabled = False`. + +### Logging outputs with log hints + +A log hint tells MLRun how to log a single returned value — what key to store it +under, what artifact type to use, and any serialization options. Log hints are +passed via the `returns` parameter on `function.run()`: + +```python +fn.run( + handler="train", + inputs={"dataset": "store://my-dataset"}, + returns=["accuracy", "X_test : dataset", "model : model"], +) +``` + +Each entry in the `returns` list is a log hint — either a `LogHint` object or a +string shortcut. The sections below cover artifact types, the LogHint class, and +the string shortcut format. + +#### Artifact types + +The artifact type is a string that determines how an object is serialized and +what metadata is stored. MLRun defines common types in `mlrun.package.ArtifactType`, +but custom packagers can implement any artifact type string they need — these are +just conventions that built-in packagers share: + +| Type | Description | Typical objects | +|------|-------------|-----------------| +| `result` | Scalar/simple value stored in run metadata | `int`, `float`, `str`, small `dict`/`list` | +| `dataset` | Tabular data logged as a `DatasetArtifact` | `pd.DataFrame` | +| `file` | Generic file upload | `np.ndarray`, `bytes`, large dicts | +| `model` | ML model artifact | scikit-learn models, torch models | +| `plot` | Visualization | matplotlib figures | +| `object` | Pickle serialization (fallback) | Any Python object | +| `path` | File/directory path | `str`, `pathlib.Path` | + +If you don't specify an artifact type, the packager for the object's type chooses +a sensible default. Custom packagers define their own defaults via +`DEFAULT_PACKING_ARTIFACT_TYPE`. + +##### Asymmetric (pack-only / unpack-only) artifact types + +Packing and unpacking artifact types are discovered independently. A `DefaultPackager` +subclass with `pack_foo` but no `unpack_foo` supports `"foo"` for packing only — +`is_packable` accepts it but `is_unpackable` rejects it. The reverse also applies: +`unpack_bar` without `pack_bar` means `"bar"` is unpack-only. + +Common scenarios: + +- **Pack-only** — saving plots as images, logging summary metrics as plain results, + rendering a model to an image (the PNG can't be deserialized back to the original + object) +- **Unpack-only** — legacy/migration support (e.g. `unpack_v1` reads artifacts from + an older packager version while new writes always use `pack_v2`); cross-format + compatibility (e.g. a DataFrame packager can `unpack_csv` to read manually-logged + CSV artifacts but always `pack_parquet` for new outputs) + +#### The LogHint class + +A {py:class}`~mlrun.package.log_hint.LogHint` gives you full control over logging — +artifact type, labels, extra data, metrics, and more: + +```python +from mlrun import LogHint + +returns = [ + LogHint(key="model", artifact_type="model", labels={"version": "1"}), + LogHint(key="data", artifact_type="dataset", packing_kwargs={"format": "csv"}), +] +``` + +A `LogHint` has the following fields: + +| Field | Type | Description | +|-------|------|-------------| +| `key` | `str` | **Required.** The artifact key to log the object under. | +| `artifact_type` | `str \| None` | The artifact type (e.g. `"dataset"`, `"model"`, `"result"`). If `None`, the packager's default is used. | +| `tag` | `str` | Tag for the artifact. Default: `""`. | +| `itemized` | `bool \| int` | Unbundling control. `False` (default): log as one artifact. `True`: fully unbundle. `int`: unbundle to that depth. | +| `packing_kwargs` | `dict` | Extra keyword arguments passed to the packager's `pack_()` method (e.g. `{"format": "parquet"}`). | +| `labels` | `dict[str, str]` | Labels to add to the logged artifact. | +| `extra_data` | `dict` | Extra data to attach to the artifact. Use `...` (Ellipsis) as a value to link to another package by key. | +| `metrics` | `dict` | Metrics to log alongside a model artifact. Use `...` (Ellipsis) as a value to link to another package by key. | + +#### Linking artifacts + +When a function returns multiple values, you can **link** them together so that +related outputs are attached to a primary artifact. For example, you might want a +model artifact to carry its evaluation metrics and supporting artifacts (plots, test +data) as part of its metadata. This is done through the `extra_data` and `metrics` +fields of `LogHint`, using Python's `...` (Ellipsis) as a placeholder meaning +"fill this in with the package that has this key." + +Consider a training function that returns a model alongside its metrics, a loss plot, +and a test dataset: + +```python +def train(dataset: pd.DataFrame): + # ... training logic ... + return my_model, some_result, loss_plot, test_dataset + + +fn.run( + handler="train", + inputs={"dataset": "store://my-dataset"}, + returns=[ + LogHint( + key="my_model", + artifact_type="model", + metrics={"some_result": ...}, + extra_data={"loss_plot": ..., "test_dataset": ...}, + ), + "some_results : result", + "loss_plot : plot", + "test_dataset : dataset", + ], +) +``` + +After all four values are packed, the packager manager resolves every `...`: + +- `"some_result"` is a result (scalar), so it is placed into the model's `metrics` +- `"loss_plot"` and `"test_dataset"` are artifacts, so they are placed into the + model's `extra_data` + +The result is a model artifact with its evaluation metrics and supporting data +attached directly — visible as a single unit in the MLRun UI. + +```{note} +**Linking rules:** + +* `metrics` is available only on **model** artifacts and can link to **results** only + (scalar values in run metadata) +* `extra_data` works with any artifact type and can link to both artifacts and results +* If a referenced key is not found among the packed outputs, the entry is removed and + a warning is logged +* The order of items in `returns` does not matter — linking is resolved after all + packing is complete +``` + +#### String shortcut + +The most common way to specify a log hint. A string shortcut has up to four parts: + +| Part | Syntax | Purpose | Example | +|------|--------|---------|---------| +| **Key** (required) | `""` | The artifact name | `"accuracy"` | +| **Artifact type** | `" : "` | Override the default type | `"data : dataset"` | +| **Packing kwargs** | `" : [k='v', ...]"` | Pass options to the packager | `'data : dataset[format="parquet"]'` | +| **Itemization prefix** | `"*"` or `"*"` | Unbundle a collection | `"*results"`, `"2*results"` | + +Examples and their `LogHint` equivalents: + +| String | Equivalent LogHint | +|--------|--------------------| +| `"accuracy"` | `LogHint(key="accuracy")` | +| `"data : dataset"` | `LogHint(key="data", artifact_type="dataset")` | +| `'data : dataset[format="parquet"]'` | `LogHint(key="data", artifact_type="dataset", packing_kwargs={"format": "parquet"})` | +| `"*results"` | `LogHint(key="results", itemized=True)` | +| `"2*results"` | `LogHint(key="results", itemized=2)` | + +(unbundling)= +#### Itemization (unbundling) + +Unbundling breaks a collection (list or dict) into separate artifacts, each logged individually. This is useful when a +function returns a dictionary of DataFrames and you want each one as its own dataset artifact. + +```python +def evaluate(data: pd.DataFrame) -> dict[str, pd.DataFrame]: + """Returns per-category evaluation results.""" + results = {} + for category in data["category"].unique(): + subset = data[data["category"] == category] + results[category] = compute_metrics(subset) + return results +``` + +Without unbundling, the entire dict is logged as a single artifact. With unbundling: + +```python +fn.run(handler="evaluate", inputs={"data": "store://eval-data"}, returns=["*results"]) +``` + +Each DataFrame in the dict becomes its own dataset artifact, keyed as `results_`. + +**Depth control** + +- `"*results"` or `itemized=True` — fully recursive unbundling. +- `"2*results"` or `itemized=2` — unbundle up to 2 levels deep. Nested collections beyond that depth are logged as + single artifacts. + +## Configuration + +Packager behavior is controlled by settings under `mlrun.mlconf.packagers`: + +| Setting | Default | Description | +|---------|---------|-------------| +| `enabled` | `True` | Master switch. When enabled, MLRun automatically wraps every function execution with the packager handler — parsing typed inputs and logging returned outputs. Set to `False` to disable all packager functionality. | +| `auto_unpack_inputs` | `False` | When `True`, inputs that have **no type hint** are still automatically unpacked if they were originally logged via packagers. When `False` (default), un-hinted inputs remain as raw `DataItem` objects. | +| `auto_pack_outputs` | `False` | When `True`, returned objects are packed even if no log hints were provided by the user. The artifact key follows the pattern `--` where `i` is enumerated. When `False` (default), returned objects without log hints are ignored. | +| `auto_pack_key` | `"artifact"` | The base key used in the auto-generated artifact name when `auto_pack_outputs` is enabled. | +| `pack_tuples` | `False` | When `True`, returned tuples are treated as a single tuple object and packed together. When `False` (default), each element of a returned tuple is packed as a separate output — enabling functions to return multiple items via `return a, b, c`. | +| `logging_worker` | `0` | In multi-worker runs, only the worker with this rank packs outputs and logs results/artifacts. Other workers skip logging to avoid overriding each other. Default is `0` (the main worker). | + +You can change these settings globally: + +```python +import mlrun + +mlrun.mlconf.packagers.auto_unpack_inputs = True +``` + +```{note} +You can also set these options via environment variables. Use the `MLRUN_` prefix +with `__` (double underscore) as the nesting separator: + + MLRUN_PACKAGERS__ENABLED=true + MLRUN_PACKAGERS__AUTO_PACK_OUTPUTS=true +``` + +## Built-in packagers + +MLRun includes packagers for common Python types. All built-in packagers are available automatically — no registration +needed. + +### Python standard library + +Handles `None`, `int`, `float`, `bool`, `str`, `dict`, `list`, `tuple`, `set`, +`frozenset`, `bytes`, `bytearray`, and `pathlib.Path`. + +API reference: {py:mod}`~mlrun.package.packagers.python_standard_library_packagers` + +### NumPy + +Handles `np.ndarray`, `np.number`, and collections of arrays (`list[np.ndarray]`, +`dict[str, np.ndarray]`). + +API reference: {py:mod}`~mlrun.package.packagers.numpy_packagers` + +### Pandas + +Handles `pd.DataFrame` and `pd.Series`. + +API reference: {py:mod}`~mlrun.package.packagers.pandas_packagers` + +### Default (fallback) + +Any unrecognized type is handled by the {py:class}`~mlrun.package.packagers.default_packager.DefaultPackager`, which +serializes objects using `cloudpickle` (or any pickling module configured). The default artifact type is `object`. + +(custom-packagers)= +## Creating a custom packager + +When a built-in packager doesn't handle your type (or you want human-readable serialization +instead of pickle), you can write a custom packager. The +{ref}`custom packagers guide ` walks through the full process — +choosing a base class, setting class variables, implementing pack/unpack methods, and +registering the packager in your project. + +```{note} +When running remotely, set the project source with `pull_at_runtime=True` +so the packager module can be imported on the remote worker: + + project.set_source(source="./", pull_at_runtime=True) +``` + + +**See also** +- {ref}`auto-logging-mlops` — framework-specific auto-logging with `apply_mlrun()` +- {ref}`working-with-data-and-model-artifacts` — manual artifact handling +- [mlrun.package](../../api/mlrun.package/index.rst) — API reference diff --git a/docs/concepts/packagers/packagers-tutorial.ipynb b/docs/concepts/packagers/packagers-tutorial.ipynb index 03821e5cc1d..796c55d302e 100644 --- a/docs/concepts/packagers/packagers-tutorial.ipynb +++ b/docs/concepts/packagers/packagers-tutorial.ipynb @@ -6,7 +6,7 @@ "metadata": {}, "source": [ "(packagers-tutorial)=\n", - "# Using Packagers to Automate I/O in a Gen-AI Agent Pipeline\n", + "# Using packagers to automate I/O in a gen AI agent pipeline\n", "\n", "This tutorial demonstrates how MLRun **packagers** automate input parsing and output logging in a mock\n", "scenario: evaluating multiple gen-AI agents on a set of test prompts.\n", @@ -15,11 +15,21 @@ "> The focus is on how packagers handle mixed output types and input parsing, not on the evaluation logic itself.\n", "\n", "You will learn how to:\n", - "- Log mixed output types (DataFrames, dicts, strings) using log hints.\n", - "- Use `LogHint` objects for fine-grained control (labels, artifact types).\n", - "- Itemize (unbundle) a dict of responses into separate artifacts with the `*` prefix.\n", - "- Pass a previously logged artifact as a typed input to a downstream function.\n", - "- Understand the difference between `params` (direct values) and `inputs` (DataItems parsed by packagers)." + "- Log mixed output types (DataFrames, dicts, strings) using log hints\n", + "- Use `LogHint` objects for fine-grained control (labels, artifact types)\n", + "- Itemize (unbundle) a dict of responses into separate artifacts with the `*` prefix\n", + "- Pass a previously logged artifact as a typed input to a downstream function\n", + "- Understand the difference between `params` (direct values) and `inputs` (DataItems parsed by packagers)\n", + "\n", + "**In this section**\n", + "- [Setup](#setup)\n", + "- [Define the evaluation handler](#define-the-evaluation-handler)\n", + "- [Create the MLRun function](#create-the-mlrun-function)\n", + "- [Run with mixed log hints](#run-with-mixed-log-hints)\n", + "- [Inspect the results](#inspect-the-results)\n", + "- [Consuming packaged artifacts as inputs](#consuming-packaged-artifacts-as-inputs)\n", + "- [What the packagers did automatically](#what-the-packagers-did-automatically)\n", + "- [Next steps](#next-steps)" ] }, { @@ -57,7 +67,16 @@ "cell_type": "markdown", "id": "e5f6a7b8", "metadata": {}, - "source": "## Define the evaluation handler\n\nThe handler below simulates evaluating multiple gen-AI agents. Each \"agent\" is just a\nstring-formatting heuristic — no real LLM calls are made, so the notebook runs anywhere\nwithout API keys.\n\nNotice the function is **pure Python** — no MLRun imports, no `context` object.\nThe `returns` log hints (shown later) tell packagers how to log each returned value." + "source": [ + "## Define the evaluation handler\n", + "\n", + "The handler below simulates evaluating multiple gen AI agents. Each \"agent\" is just a\n", + "string-formatting heuristic — no real LLM calls are made, so the notebook runs anywhere\n", + "without API keys.\n", + "\n", + "Notice the function is **pure Python** — no MLRun imports, no `context` object.\n", + "The `returns` log hints (shown later) tell packagers how to log each returned value." + ] }, { "cell_type": "code", @@ -76,7 +95,7 @@ " prompts: list,\n", ") -> tuple[pd.DataFrame, dict, dict, str]:\n", " \"\"\"\n", - " Evaluate simulated Gen-AI agents on a set of prompts.\n", + " Evaluate simulated gen AI agents on a set of prompts.\n", "\n", " :param agents_config: Mapping of agent name to its configuration dict.\n", " Each config has keys like 'style' and 'max_words'.\n", @@ -163,7 +182,24 @@ "cell_type": "markdown", "id": "c9d0e1f2", "metadata": {}, - "source": "## Run with mixed log hints\n\nNote that `agents_config` and `prompts` are passed via `params={}` — they arrive as\nplain Python objects (a dict and a list) directly, with **no packager involvement**.\nPackagers only parse values passed via `inputs={}`, which flow through `DataItem`s.\nWe'll see input parsing in action with the second handler below.\n\nThe `returns` list uses four different log-hint styles to demonstrate the full range of\npackagers output capabilities:\n\n| Return value | Log hint | What happens |\n|---|---|---|\n| `evaluation` (DataFrame) | `\"evaluation : dataset\"` | String shortcut — logged as a `DatasetArtifact` |\n| `best_agent` (dict) | `LogHint(key=\"best_agent\", labels={...})` | LogHint object — logged as a `result` (dict default) with custom labels |\n| `all_responses` (dict of lists) | `\"*all_responses\"` | Unbundled — each agent's response list becomes a separate artifact |\n| `summary` (str) | `\"summary\"` | Key only — artifact type inferred from the value type (`result` for `str`) |" + "source": [ + "## Run with mixed log hints\n", + "\n", + "Note that `agents_config` and `prompts` are passed via `params={}` — they arrive as\n", + "plain Python objects (a dict and a list) directly, with **no packager involvement**.\n", + "Packagers only parse values passed via `inputs={}`, which flow through `DataItem`s.\n", + "The second handler below shows input parsing in action.\n", + "\n", + "The `returns` list uses four different log-hint styles to demonstrate the full range of\n", + "packagers output capabilities:\n", + "\n", + "| Return value | Log hint | What happens |\n", + "|---|---|---|\n", + "| `evaluation` (DataFrame) | `\"evaluation : dataset\"` | String shortcut — logged as a `DatasetArtifact` |\n", + "| `best_agent` (dict) | `LogHint(key=\"best_agent\", labels={...})` | LogHint object — logged as a `result` (dict default) with custom labels |\n", + "| `all_responses` (dict of lists) | `\"*all_responses\"` | Unbundled — each agent's response list becomes a separate artifact |\n", + "| `summary` (str) | `\"summary\"` | Key only — artifact type inferred from the value type (`result` for `str`) |" + ] }, { "cell_type": "code", @@ -203,7 +239,7 @@ "## Inspect the results\n", "\n", "The run's `outputs` dictionary contains all logged artifacts and results.\n", - "Let's examine each one." + "Here's a look at each one." ] }, { @@ -316,7 +352,7 @@ "This is distinct from `params={}`, which pass plain JSON serializable Python values directly to the\n", "function with no packager involvement.\n", "\n", - "Let's define a second handler that takes the evaluation DataFrame as input and\n", + "Here's a second handler that takes the evaluation DataFrame as input and\n", "returns a filtered version." ] }, @@ -403,7 +439,7 @@ "id": "c7d8e9f0", "metadata": {}, "source": [ - "## What packagers did automatically\n", + "## What the packagers did automatically\n", "\n", "Here's a summary of what happened behind the scenes — and what you would have\n", "had to do manually without packagers:\n", @@ -432,10 +468,10 @@ "source": [ "## Next steps\n", "\n", - "- Read the full {ref}`packagers guide ` for details on all built-in packagers,\n", - " the `LogHint` fields, and artifact types.\n", + "- Read the full {ref}`packagers guide ` for details on all built-in packagers,\n", + " the `LogHint` fields, and artifact types\n", "- See the {ref}`custom packagers tutorials ` to learn how\n", - " to write packagers for your own types." + " to write packagers for your own types" ] } ], diff --git a/docs/development/model-training-tracking.md b/docs/development/model-training-tracking.md index 9e384c2ecf0..5b5a078578d 100644 --- a/docs/development/model-training-tracking.md +++ b/docs/development/model-training-tracking.md @@ -10,7 +10,6 @@ Learn how to create a model training job, work with the input data and the model ../training/create-a-basic-training-job ../training/working-with-data-and-model-artifacts -../concepts/packagers/index ../concepts/auto-logging-mlops ../training/built-in-training-function ../hyper-params From d045a75f14c3a27b8bd65a1aa588333e52a37c8c Mon Sep 17 00:00:00 2001 From: jillnogold Date: Thu, 19 Feb 2026 12:17:16 +0200 Subject: [PATCH 2/3] more on pack and unpack --- mlrun/package/packagers/default_packager.py | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/mlrun/package/packagers/default_packager.py b/mlrun/package/packagers/default_packager.py index 201aa0a620a..f713869b50e 100644 --- a/mlrun/package/packagers/default_packager.py +++ b/mlrun/package/packagers/default_packager.py @@ -250,9 +250,11 @@ def pack_x(self, obj: Any, key: str, ...) -> Union[Tuple[Artifact, dict], dict]: Where 'x' is the artifact type, 'obj' is the object to pack, `key` is the key to name the artifact and `...` are additional, custom, log hint configurations. The returned values are the packed artifact and the instructions - for unpacking it, or in the case of result, the dictionary of the result with its key and value. configurations - are sent by the user and shouldn't be mandatory, meaning they should have a default value (otherwise, the user - has to add them to every log hint). + for unpacking it, or in the case of result, the dictionary of the result with its key and value. Returning + an artifact means that you can return any of the common subclasses of Artifact, including: + ModelArtifact, DatasetArtifact and LLMPromptArtifact. The default packing and unpacking class variables + should always be set. They are by default set to “object”. You can change it to any valid + artifact type. * **The abstract class method** :py:meth:`unpack`: The method is implemented to get a :py:meth:`DataItem` and send it to the relevant unpacking method by the artifact type using the following naming: `"unpack_"`. (If the artifact type was not provided, From e011eb05818bf988a3c8b1943bb45779de99d23c Mon Sep 17 00:00:00 2001 From: jillnogold Date: Thu, 19 Feb 2026 12:54:52 +0200 Subject: [PATCH 3/3] review input --- .../custom-packagers-tutorial-overview.md | 258 +++++++++++++++++ .../packagers/custom_packagers/index.md | 262 +----------------- .../round-trip-tutorial.ipynb | 2 +- docs/concepts/packagers/packagers-overview.md | 3 +- 4 files changed, 262 insertions(+), 263 deletions(-) create mode 100644 docs/concepts/packagers/custom_packagers/custom-packagers-tutorial-overview.md diff --git a/docs/concepts/packagers/custom_packagers/custom-packagers-tutorial-overview.md b/docs/concepts/packagers/custom_packagers/custom-packagers-tutorial-overview.md new file mode 100644 index 00000000000..c98e083ecb5 --- /dev/null +++ b/docs/concepts/packagers/custom_packagers/custom-packagers-tutorial-overview.md @@ -0,0 +1,258 @@ +(custom-packagers-tutorial-overview)= +# Custom packager tutorials overview +Learn when custom packagers are required, and how to create them. + +**In this section** +- [When to write a custom packager](#when-to-write-a-custom-packager) +- [Choosing a base class: `DefaultPackager` vs `Packager`](#choosing-a-base-class-defaultpackager-vs-packager) +- [The four patterns](#the-four-patterns) +- [Step-by-step guide](#step-by-step-guide) + +## When to write a custom packager + +Write a custom packager when: + +- **Your type isn't handled by a built-in packager** — for example, a PIL Image, + a LangChain prompt template, or a domain-specific data class +- **You want human-readable serialization** — save as JSON, PNG, CSV, etc. + instead of an opaque pickle file +- **You need bundling support** — your type is a collection that should be + decomposed into individual artifacts when unbundled with `"*key"` + +## Choosing a base class: `DefaultPackager` vs `Packager` + +MLRun provides two base classes for custom packagers. In most cases you should use +`DefaultPackager`. + +### `DefaultPackager` (recommended) + +{py:class}`~mlrun.package.packagers.default_packager.DefaultPackager` is the recommended +base class for custom packagers. It implements all the abstract methods from `Packager` +with sensible default logic — routing pack/unpack calls to the right method by artifact +type, validating arguments, and falling back to pickle when needed. Instead of overriding +abstract methods, you configure behavior through **class variables** and implement +**named methods** (`pack_`, `unpack_`). + +The class variables you can set: + +| Variable | Default | Description | +|----------|---------|-------------| +| `PACKABLE_OBJECT_TYPE` | `...` (any) | The Python type this packager handles. Used by `is_packable` and `is_unpackable` to match objects and type hints. | +| `PACK_SUBCLASSES` | `False` | When `True`, this packager also handles subclasses of `PACKABLE_OBJECT_TYPE`. | +| `DEFAULT_PACKING_ARTIFACT_TYPE` | `"object"` | The artifact type to use when the user doesn't specify one in the log hint. | +| `DEFAULT_UNPACKING_ARTIFACT_TYPE` | `"object"` | The artifact type to use when unpacking a `DataItem` that wasn't originally packed by this packager (e.g. a manually logged artifact). | +| `BUNDLE_FROM_LIST` | `False` | When `True`, the type can be initialized from a `list` to serve as a bundle container. | +| `BUNDLE_FROM_DICT` | `False` | When `True`, the type can be initialized from a `dict` to serve as a bundle container. | + +`DefaultPackager` **auto-discovers** supported artifact types by scanning for methods +independently: `pack_*` methods define packing artifact types and `unpack_*` methods +define unpacking artifact types. If your class has `pack_file` but no `unpack_file`, +then `"file"` is available for packing only — `is_packable` accepts it but +`is_unpackable` rejects it. The `"result"` type is always available for packing +(logging scalar values as run metadata). The `"object"` (pickle) type is always +available for both packing and unpacking. + +If needed, you can still override methods like `is_packable`, `is_unpackable`, `get_default_packing_artifact_type`, +`get_default_unpacking_artifact_type` etc. to customize the default behavior. For example, when the default artifact +type depends on runtime conditions rather than a fixed value. + +### `Packager` (full control) + +The base {py:class}`~mlrun.package.packager.Packager` class gives you complete control. +You override `pack()` and `unpack()` directly and manage artifact-type routing, validation, +and fallback behavior yourself. Use this only when `DefaultPackager`'s convention-based +approach doesn't fit your needs. + +## The four patterns + +Custom packagers follow one of four patterns, depending on what your type needs: + +| Pattern | When to use | What to implement | +|---------|-------------|-------------------| +| **Pack-only** | The type is produced as output but never consumed as a typed input. The framework automatically excludes pack-only artifact types from unpacking validation. | `pack_*` methods only. | +| **Unpack-only** | Legacy/migration support — reading artifacts from an older format while new writes use a different artifact type. | `unpack_*` methods only. | +| **Round-trip (pack + unpack)** | The type needs to be saved *and* loaded back in a later function. | Both `pack_*` and `unpack_*` methods. | +| **Bundling & unbundling** | The type is a collection that should decompose into separate artifacts when unbundled. | `pack_*`/`unpack_*` plus `bundle`/`unbundle` methods and `BUNDLE_FROM_LIST`/`BUNDLE_FROM_DICT` flags. | + +## Step-by-step guide + +### 1. Subclass `DefaultPackager` + +```python +from mlrun.package import ArtifactType +from mlrun.package.packagers.default_packager import DefaultPackager + + +class MyTypePackager(DefaultPackager): ... +``` + +### 2. Set class variables + +At a minimum, set the type your packager handles and the default artifact type: + +```python +class MyTypePackager(DefaultPackager): + PACKABLE_OBJECT_TYPE = MyType + DEFAULT_PACKING_ARTIFACT_TYPE = ArtifactType.FILE + DEFAULT_UNPACKING_ARTIFACT_TYPE = ArtifactType.FILE +``` + +If your type has subclasses that should also be handled by this packager, set +`PACK_SUBCLASSES = True`. + +### 3. Implement `pack_()` methods + +Each packing method serializes the object and returns a tuple of `(Artifact, instructions_dict)`. +The `instructions_dict` carries metadata needed to reconstruct the object when unpacking. + +```python +from mlrun import Artifact + + +def pack_file( + self, obj: MyType, key: str, file_format: str = "json" +) -> tuple[Artifact, dict]: + # Serialize to a temporary file + path = f"/tmp/{key}.{file_format}" + obj.save(path) + + # Create the artifact + artifact = Artifact(key=key, src_path=path) + + # Clean up the temp file after upload + self.add_future_clearing_path(path) + + # Return artifact + instructions for unpacking + return artifact, {"file_format": file_format} +``` + +```{note} +Inside a `pack_*` method you create and **return** an `Artifact` object — you do not +call `context.log_artifact()` or `context.log_dataset()`. The packager manager handles +the actual logging and uploading; the pack method's job is only to serialize the data and +describe the artifact. +``` + +The method name determines the artifact type: `pack_file` handles `artifact_type="file"`, +`pack_plot` handles `"plot"`, and so on. + +**Important:** Extra parameters like `file_format` above become **packing kwargs** that +users can pass via log hints: + +```python +returns = ['my_output : file[file_format="csv"]'] +``` + +All packing kwargs must have default values so users aren't forced to specify them. + +#### Result artifact type + +A **special case** of a `pack_()` method is the **result** artifact type — a scalar +or simple value (`int`, `float`, `str`, `bool`) stored directly in run metadata (visible +in `run.status.results` and in the MLRun UI without downloading anything). For result +types, the pack method returns a plain `dict` with the key and value instead of an +`(Artifact, instructions)` tuple: + +```python +def pack_result(self, obj: MyType, key: str) -> dict: + # Stored as run metadata, not as a file artifact + return {key: obj.score} +``` + +`DefaultPackager` already provides a generic `pack_result` implementation, so you only +need to override it if you want custom extraction logic (e.g. pulling a specific field +from your type). The `"result"` type is always available for packing. + +### 4. Implement `unpack_()` methods + +Each unpacking method takes a {py:class}`~mlrun.datastore.base.DataItem` and the +instructions that were stored during packing, and returns the reconstructed object: + +```python +import mlrun + + +def unpack_file(self, data_item: mlrun.DataItem, file_format: str = "json") -> MyType: + # Download the artifact to a local path + local_path = data_item.local() + + # Reconstruct the object + return MyType.load(local_path, format=file_format) +``` + +Each instruction parameter (e.g. `file_format`) must be **optional** (have a default +value) so that the method can also handle objects that were logged manually rather than +through this packager. + +For pack-only packagers, you can skip implementing `unpack_*` methods entirely — +the artifact type is automatically excluded from unpacking validation, so no extra +configuration is needed. Real-world examples: + +- **Pack-only**: PIL Image → PNG (no need to reconstruct the original PIL object + from the logged PNG) +- **Unpack-only**: reading a legacy serialization format that should no longer be + written (e.g. `unpack_v1` for backward compatibility while new outputs use + `pack_v2`) + +### 5. Clean up temporary files + +If your `pack_*` or `unpack_*` methods write files to disk, call +`self.add_future_clearing_path(path)` so MLRun deletes them after the artifact is +uploaded. This prevents temporary files from accumulating on the worker. + +### 6. Set the priority (optional) + +The `PRIORITY` class variable (integer 1–10, 1 = highest priority) controls which +packager is selected when multiple packagers can handle the same type. Custom packagers +default to priority **3**, which is higher than the built-in packagers at **5**. You +rarely need to change this unless you have multiple custom packagers competing for the +same type - which is not how the packagers are intended to be used (`XPackager` should +handle type `x`). + +```python +class MyTypePackager(DefaultPackager): + PRIORITY = 2 # Higher priority than other custom packagers + ... +``` + +### 7. Register the packager in your project + +Use {py:meth}`~mlrun.projects.project.MlrunProject.add_custom_packager` to register +your packager: + +```python +project.add_custom_packager(packager="my_module.MyTypePackager", is_mandatory=True) +``` + +The `is_mandatory` flag controls what happens when the packager fails to import on a +remote worker: + +- `True` — the run fails immediately with an import error +- `False` — the packager is silently skipped and the fallback pickle behavior is used + +To remove a registered packager: + +```python +project.remove_custom_packager("my_module.MyTypePackager") +``` +(make-the-packager-importable-on-the-remote-worker)= +### 8. Make the packager importable on the remote worker + +When running remotely, the worker must be able to import your packager module. There +are several ways to achieve this: + +* **Pull at runtime** (simplest) — set the project source with `pull_at_runtime=True` + so the code is fetched before execution: + + ```python + project.set_source(source="./", pull_at_runtime=True) + ``` + +* **Build into the function image** — include the packager source in the function's + build so it is baked into the container image + +* **Shared storage** — place the packager module on a shared volume and configure the + function's working directory to point there + +If the packager module is missing at runtime, the run fails immediately when +`is_mandatory=True`, or falls back to pickle when `is_mandatory=False`. \ No newline at end of file diff --git a/docs/concepts/packagers/custom_packagers/index.md b/docs/concepts/packagers/custom_packagers/index.md index 74457dc089d..23be678be63 100644 --- a/docs/concepts/packagers/custom_packagers/index.md +++ b/docs/concepts/packagers/custom_packagers/index.md @@ -14,270 +14,10 @@ tables) instead of pickle blobs. **Reminder**: Packing applies to function **outputs** (return values → artifacts) and unpacking applies to function **inputs** (artifacts → typed Python objects). -**In this section** -- [When to write a custom packager](#when-to-write-a-custom-packager) -- [Choosing a base class: `DefaultPackager` vs `Packager`](#choosing-a-base-class-defaultpackager-vs-packager) -- [The four patterns](#the-four-patterns) -- [Step-by-step guide](#step-by-step-guide) -- [Tutorials](#tutorials) - - -## When to write a custom packager - -Write a custom packager when: - -- **Your type isn't handled by a built-in packager** — for example, a PIL Image, - a LangChain prompt template, or a domain-specific data class -- **You want human-readable serialization** — save as JSON, PNG, CSV, etc. - instead of an opaque pickle file -- **You need bundling support** — your type is a collection that should be - decomposed into individual artifacts when unbundled with `"*key"` - -## Choosing a base class: `DefaultPackager` vs `Packager` - -MLRun provides two base classes for custom packagers. In most cases you should use -`DefaultPackager`. - -### `DefaultPackager` (recommended) - -{py:class}`~mlrun.package.packagers.default_packager.DefaultPackager` is the recommended -base class for custom packagers. It implements all the abstract methods from `Packager` -with sensible default logic — routing pack/unpack calls to the right method by artifact -type, validating arguments, and falling back to pickle when needed. Instead of overriding -abstract methods, you configure behavior through **class variables** and implement -**named methods** (`pack_`, `unpack_`). - -The class variables you can set: - -| Variable | Default | Description | -|----------|---------|-------------| -| `PACKABLE_OBJECT_TYPE` | `...` (any) | The Python type this packager handles. Used by `is_packable` and `is_unpackable` to match objects and type hints. | -| `PACK_SUBCLASSES` | `False` | When `True`, this packager also handles subclasses of `PACKABLE_OBJECT_TYPE`. | -| `DEFAULT_PACKING_ARTIFACT_TYPE` | `"object"` | The artifact type to use when the user doesn't specify one in the log hint. | -| `DEFAULT_UNPACKING_ARTIFACT_TYPE` | `"object"` | The artifact type to use when unpacking a `DataItem` that wasn't originally packed by this packager (e.g. a manually logged artifact). | -| `BUNDLE_FROM_LIST` | `False` | When `True`, the type can be initialized from a `list` to serve as a bundle container. | -| `BUNDLE_FROM_DICT` | `False` | When `True`, the type can be initialized from a `dict` to serve as a bundle container. | - -`DefaultPackager` **auto-discovers** supported artifact types by scanning for methods -independently: `pack_*` methods define packing artifact types and `unpack_*` methods -define unpacking artifact types. If your class has `pack_file` but no `unpack_file`, -then `"file"` is available for packing only — `is_packable` accepts it but -`is_unpackable` rejects it. The `"result"` type is always available for packing -(logging scalar values as run metadata). The `"object"` (pickle) type is always -available for both packing and unpacking. - -If needed, you can still override methods like `is_packable`, `is_unpackable`, `get_default_packing_artifact_type`, -`get_default_unpacking_artifact_type` etc. to customize the default behavior. For example, when the default artifact -type depends on runtime conditions rather than a fixed value. - -### `Packager` (full control) - -The base {py:class}`~mlrun.package.packager.Packager` class gives you complete control. -You override `pack()` and `unpack()` directly and manage artifact-type routing, validation, -and fallback behavior yourself. Use this only when `DefaultPackager`'s convention-based -approach doesn't fit your needs. - -## The four patterns - -Custom packagers follow one of four patterns, depending on what your type needs: - -| Pattern | When to use | What to implement | -|---------|-------------|-------------------| -| **Pack-only** | The type is produced as output but never consumed as a typed input. The framework automatically excludes pack-only artifact types from unpacking validation. | `pack_*` methods only. | -| **Unpack-only** | Legacy/migration support — reading artifacts from an older format while new writes use a different artifact type. | `unpack_*` methods only. | -| **Round-trip (pack + unpack)** | The type needs to be saved *and* loaded back in a later function. | Both `pack_*` and `unpack_*` methods. | -| **Bundling & unbundling** | The type is a collection that should decompose into separate artifacts when unbundled. | `pack_*`/`unpack_*` plus `bundle`/`unbundle` methods and `BUNDLE_FROM_LIST`/`BUNDLE_FROM_DICT` flags. | - -## Step-by-step guide - -### 1. Subclass `DefaultPackager` - -```python -from mlrun.package import ArtifactType -from mlrun.package.packagers.default_packager import DefaultPackager - - -class MyTypePackager(DefaultPackager): ... -``` - -### 2. Set class variables - -At a minimum, set the type your packager handles and the default artifact type: - -```python -class MyTypePackager(DefaultPackager): - PACKABLE_OBJECT_TYPE = MyType - DEFAULT_PACKING_ARTIFACT_TYPE = ArtifactType.FILE - DEFAULT_UNPACKING_ARTIFACT_TYPE = ArtifactType.FILE -``` - -If your type has subclasses that should also be handled by this packager, set -`PACK_SUBCLASSES = True`. - -### 3. Implement `pack_()` methods - -Each packing method serializes the object and returns a tuple of `(Artifact, instructions_dict)`. -The `instructions_dict` carries metadata needed to reconstruct the object when unpacking. - -```python -from mlrun import Artifact - - -def pack_file( - self, obj: MyType, key: str, file_format: str = "json" -) -> tuple[Artifact, dict]: - # Serialize to a temporary file - path = f"/tmp/{key}.{file_format}" - obj.save(path) - - # Create the artifact - artifact = Artifact(key=key, src_path=path) - - # Clean up the temp file after upload - self.add_future_clearing_path(path) - - # Return artifact + instructions for unpacking - return artifact, {"file_format": file_format} -``` - -```{note} -Inside a `pack_*` method you create and **return** an `Artifact` object — you do not -call `context.log_artifact()` or `context.log_dataset()`. The packager manager handles -the actual logging and uploading; the pack method's job is only to serialize the data and -describe the artifact. -``` - -The method name determines the artifact type: `pack_file` handles `artifact_type="file"`, -`pack_plot` handles `"plot"`, and so on. - -**Important:** Extra parameters like `file_format` above become **packing kwargs** that -users can pass via log hints: - -```python -returns = ['my_output : file[file_format="csv"]'] -``` - -All packing kwargs must have default values so users aren't forced to specify them. - -#### Result artifact type - -A **special case** of a `pack_()` method is the **result** artifact type — a scalar -or simple value (`int`, `float`, `str`, `bool`) stored directly in run metadata (visible -in `run.status.results` and in the MLRun UI without downloading anything). For result -types, the pack method returns a plain `dict` with the key and value instead of an -`(Artifact, instructions)` tuple: - -```python -def pack_result(self, obj: MyType, key: str) -> dict: - # Stored as run metadata, not as a file artifact - return {key: obj.score} -``` - -`DefaultPackager` already provides a generic `pack_result` implementation, so you only -need to override it if you want custom extraction logic (e.g. pulling a specific field -from your type). The `"result"` type is always available for packing. - -### 4. Implement `unpack_()` methods - -Each unpacking method takes a {py:class}`~mlrun.datastore.base.DataItem` and the -instructions that were stored during packing, and returns the reconstructed object: - -```python -import mlrun - - -def unpack_file(self, data_item: mlrun.DataItem, file_format: str = "json") -> MyType: - # Download the artifact to a local path - local_path = data_item.local() - - # Reconstruct the object - return MyType.load(local_path, format=file_format) -``` - -Each instruction parameter (e.g. `file_format`) must be **optional** (have a default -value) so that the method can also handle objects that were logged manually rather than -through this packager. - -For pack-only packagers, you can skip implementing `unpack_*` methods entirely — -the artifact type is automatically excluded from unpacking validation, so no extra -configuration is needed. Real-world examples: - -- **Pack-only**: PIL Image → PNG (no need to reconstruct the original PIL object - from the logged PNG) -- **Unpack-only**: reading a legacy serialization format that should no longer be - written (e.g. `unpack_v1` for backward compatibility while new outputs use - `pack_v2`) - -### 5. Clean up temporary files - -If your `pack_*` or `unpack_*` methods write files to disk, call -`self.add_future_clearing_path(path)` so MLRun deletes them after the artifact is -uploaded. This prevents temporary files from accumulating on the worker. - -### 6. Set the priority (optional) - -The `PRIORITY` class variable (integer 1–10, 1 = highest priority) controls which -packager is selected when multiple packagers can handle the same type. Custom packagers -default to priority **3**, which is higher than the built-in packagers at **5**. You -rarely need to change this unless you have multiple custom packagers competing for the -same type - which is not how the packagers are intended to be used (`XPackager` should -handle type `x`). - -```python -class MyTypePackager(DefaultPackager): - PRIORITY = 2 # Higher priority than other custom packagers - ... -``` - -### 7. Register the packager in your project - -Use {py:meth}`~mlrun.projects.project.MlrunProject.add_custom_packager` to register -your packager: - -```python -project.add_custom_packager(packager="my_module.MyTypePackager", is_mandatory=True) -``` - -The `is_mandatory` flag controls what happens when the packager fails to import on a -remote worker: - -- `True` — the run fails immediately with an import error -- `False` — the packager is silently skipped and the fallback pickle behavior is used - -To remove a registered packager: - -```python -project.remove_custom_packager("my_module.MyTypePackager") -``` - -### 8. Make the packager importable on the remote worker - -When running remotely, the worker must be able to import your packager module. There -are several ways to achieve this: - -* **Pull at runtime** (simplest) — set the project source with `pull_at_runtime=True` - so the code is fetched before execution: - - ```python - project.set_source(source="./", pull_at_runtime=True) - ``` - -* **Build into the function image** — include the packager source in the function's - build so it is baked into the container image - -* **Shared storage** — place the packager module on a shared volume and configure the - function's working directory to point there - -If the packager module is missing at runtime, the run fails immediately when -`is_mandatory=True`, or falls back to pickle when `is_mandatory=False`. - -## Tutorials - -The tutorials below demonstrate each pattern with a complete, runnable example: - ```{toctree} :maxdepth: 1 +custom-packagers-tutorial-overview pack-only-tutorial round-trip-tutorial itemization-tutorial diff --git a/docs/concepts/packagers/custom_packagers/round-trip-tutorial.ipynb b/docs/concepts/packagers/custom_packagers/round-trip-tutorial.ipynb index 5c86816aae4..2583ab2917f 100644 --- a/docs/concepts/packagers/custom_packagers/round-trip-tutorial.ipynb +++ b/docs/concepts/packagers/custom_packagers/round-trip-tutorial.ipynb @@ -46,7 +46,7 @@ "- **Opaque** — you can't read or review the prompt structure\n", "- **Fragile** — pickle files break when LangChain versions change\n", "\n", - "Ibstead, the template should be saved as a readable **JSON** file that captures the message\n", + "Instead, the template should be saved as a readable **JSON** file that captures the message\n", "structure, and you can load it back as a `ChatPromptTemplate` in downstream\n", "functions." ] diff --git a/docs/concepts/packagers/packagers-overview.md b/docs/concepts/packagers/packagers-overview.md index 19304deeb60..68ad56db55e 100644 --- a/docs/concepts/packagers/packagers-overview.md +++ b/docs/concepts/packagers/packagers-overview.md @@ -434,9 +434,10 @@ When running remotely, set the project source with `pull_at_runtime=True` so the packager module can be imported on the remote worker: project.set_source(source="./", pull_at_runtime=True) +The custom packager needs to be available during runtime. See how to do that +in [Make the packager importable on the remote worker](./custom_packagers/index.md#make-the-packager-importable-on-the-remote-worker). ``` - **See also** - {ref}`auto-logging-mlops` — framework-specific auto-logging with `apply_mlrun()` - {ref}`working-with-data-and-model-artifacts` — manual artifact handling