Skip to content

Setup execute-kfp-localrunner on EC2 instance and add an option to provide the BASE_IMAGE URLs for qualification#76

Merged
mergify[bot] merged 2 commits intoopendatahub-io:mainfrom
shruthis4:KFPWorkflow
Dec 22, 2025
Merged

Setup execute-kfp-localrunner on EC2 instance and add an option to provide the BASE_IMAGE URLs for qualification#76
mergify[bot] merged 2 commits intoopendatahub-io:mainfrom
shruthis4:KFPWorkflow

Conversation

@shruthis4
Copy link
Contributor

@shruthis4 shruthis4 commented Dec 22, 2025

Description

  • Update the execute-kfp-localrunner on ec2 instance and also enable docling-vlm

  • Adds a gate mechanism to the execute-kfp-localrunners workflow to ensure CI jobs requiring secrets only run on the upstream repository.

How Has This Been Tested?

This has been tested with a draft PR created:#75

Merge criteria:

  • [ x] The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • New Features

    • Configurable base-image inputs and environment-driven defaults for Python and Docling images.
  • Chores

    • CI flow enhanced to provision and run tests on ephemeral EC2 runners and to stop them afterward.
    • Added registry authentication and improved artifact/log upload on failures.
    • Standardized container image references across pipeline deployments for consistency.

✏️ Tip: You can customize this high-level summary in your review settings.

@shruthis4 shruthis4 requested a review from alimaredia December 22, 2025 20:29
@shruthis4 shruthis4 requested a review from a team as a code owner December 22, 2025 20:29
@coderabbitai
Copy link

coderabbitai bot commented Dec 22, 2025

Walkthrough

Introduce EC2-backed GitHub Actions runner flow for local pipeline tests, add workflow_dispatch image overrides and Quay registry env, make image constants environment-driven, and update compiled pipeline manifests to use a single quay.io/amaredia/aipcc-docling-image.

Changes

Cohort / File(s) Summary
Workflows
\.github/workflows/compile-kfp.yml, \.github/workflows/execute-kfp-localrunners.yml
Minor formatting in compile workflow; reworked execute workflow to add workflow_dispatch inputs (python_base_image, docling_base_image), env vars (QUAY_REGISTRY, INSTANCE_TYPE), and a multi-job flow: pr-check, launch-ec2-runner (provisions EC2 runner, outputs label & instance-id), test-local-pipelines (runs on EC2 runner with environment setup, Docker login, conditional image overrides, artifact upload on failure), and stop-ec2-runner (terminates runner).
Constants
kubeflow-pipelines/common/constants.py
Replaced hard-coded PYTHON_BASE_IMAGE and DOCLING_BASE_IMAGE with environment-driven values using os.getenv(..., "quay.io/amaredia/aipcc-docling-image").
Compiled Pipelines
kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml, kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml
Replaced multiple container image references for executor tasks (exec-create-pdf-splits, exec-docling-chunk, exec-docling-convert-*, exec-download-docling-models, exec-import-pdfs) with quay.io/amaredia/aipcc-docling-image.

Sequence Diagram

sequenceDiagram
    participant GH as GitHub Actions
    participant AWS as AWS OIDC / EC2 API
    participant EC2 as EC2 Runner (instance)
    participant Quay as Quay Registry
    participant Tests as Local Pipeline Tests

    GH->>GH: pr-check
    GH->>AWS: launch-ec2-runner (request instance, role via OIDC)
    AWS->>EC2: Provision instance
    EC2-->>GH: Return runner label & instance-id

    GH->>EC2: test-local-pipelines (dispatch job on runner)
    EC2->>EC2: Install system deps, Python, Docker, pip packages
    EC2->>Quay: Docker login (QUAY_REGISTRY)
    EC2->>EC2: Apply python_base_image/docling_base_image env (if provided)
    EC2->>Tests: Execute pipeline tests (local)
    Tests-->>EC2: Emit results & logs
    EC2-->>GH: Upload artifacts on failure

    GH->>AWS: stop-ec2-runner (terminate by label/instance-id)
    AWS->>EC2: Terminate instance
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main changes: setting up EC2 instance execution and adding BASE_IMAGE URL configuration options for the execute-kfp-localrunner workflow.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8b02cb4 and 9634515.

📒 Files selected for processing (1)
  • .github/workflows/compile-kfp.yml
🚧 Files skipped from review as they are similar to previous changes (1)
  • .github/workflows/compile-kfp.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: compile (docling-vlm, kubeflow-pipelines/docling-vlm, python vlm_convert_pipeline.py, vlm_convert...
  • GitHub Check: compile (docling-standard, kubeflow-pipelines/docling-standard, python standard_convert_pipeline....
  • GitHub Check: Summary

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
.github/workflows/compile-kfp.yml (1)

54-59: Quay login may be unnecessary for pipeline compilation.

The compile workflow runs Python scripts to generate YAML files. It doesn't pull or push container images. Unless the compilation step validates image existence, this login step adds latency without benefit.

Also, the indentation under with: uses 1 space (lines 57-59) instead of 2, which is inconsistent with the rest of the file.

🔎 Proposed fix for indentation
       - name: Log in to Quay Container Registry
         uses: docker/login-action@v3
         with:
-         registry: ${{ env.QUAY_REGISTRY }}
-         username: ${{ secrets.QUAY_USERNAME }}
-         password: ${{ secrets.QUAY_PASSWORD }}
+          registry: ${{ env.QUAY_REGISTRY }}
+          username: ${{ secrets.QUAY_USERNAME }}
+          password: ${{ secrets.QUAY_PASSWORD }}
.github/workflows/execute-kfp-localrunners.yml (2)

108-108: Inconsistent checkout action version.

Line 108 uses actions/checkout@v4 while line 62 uses actions/checkout@v6. Unify to the same version for consistency across jobs.

🔎 Proposed fix
-      - uses: actions/checkout@v4
+      - uses: actions/checkout@v6

141-146: YAML indentation inconsistency under with:.

The indentation uses 1 space instead of 2 under the with: block, inconsistent with the rest of the file.

🔎 Proposed fix
       - name: Log in to Quay Container Registry
         uses: docker/login-action@v3
         with:
-         registry: ${{ env.QUAY_REGISTRY }}
-         username: ${{ secrets.QUAY_USERNAME }}
-         password: ${{ secrets.QUAY_PASSWORD }}
+          registry: ${{ env.QUAY_REGISTRY }}
+          username: ${{ secrets.QUAY_USERNAME }}
+          password: ${{ secrets.QUAY_PASSWORD }}
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8537f16 and 8b02cb4.

📒 Files selected for processing (5)
  • .github/workflows/compile-kfp.yml
  • .github/workflows/execute-kfp-localrunners.yml
  • kubeflow-pipelines/common/constants.py
  • kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml
  • kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: compile (docling-standard, kubeflow-pipelines/docling-standard, python standard_convert_pipeline....
  • GitHub Check: compile (docling-vlm, kubeflow-pipelines/docling-vlm, python vlm_convert_pipeline.py, vlm_convert...
  • GitHub Check: Summary
🔇 Additional comments (5)
kubeflow-pipelines/common/constants.py (1)

1-9: LGTM!

The environment-driven image configuration allows flexible overrides for CI qualification workflows while maintaining sensible defaults. The consolidation to a single image simplifies maintenance.

.github/workflows/execute-kfp-localrunners.yml (2)

35-48: Good addition: Clear PR feedback for fork contributors.

The pr-check job provides helpful messaging for contributors from forks who won't have access to secrets. This is a good UX improvement.


50-91: EC2 runner provisioning looks well-structured.

The multi-AZ configuration with proper tagging for CloudTrail tracking and job outputs for downstream consumption is well designed.

kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (1)

304-304: Image consolidation applied correctly.

All executor images in this compiled pipeline are consistently updated to use quay.io/amaredia/aipcc-docling-image, aligning with the centralized image strategy in constants.py.

kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (1)

386-386: Image consolidation applied consistently with VLM pipeline.

All executor images are updated to quay.io/amaredia/aipcc-docling-image, matching the VLM pipeline and central constants configuration.

Comment on lines +122 to +126
# Add current user to docker group
sudo usermod -aG docker $(whoami)

# Apply group membership without logout (temporary for this session)
sudo chmod 666 /var/run/docker.sock
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Security concern: Overly permissive Docker socket permissions.

chmod 666 /var/run/docker.sock makes the socket world-readable/writable, which is a security risk on shared or multi-tenant environments. While this is an ephemeral EC2 instance, the approach sets a bad precedent.

🔎 Suggested alternative
          # Add current user to docker group
          sudo usermod -aG docker $(whoami)
          
-          # Apply group membership without logout (temporary for this session)
-          sudo chmod 666 /var/run/docker.sock 
+          # Verify docker access works with sudo (group takes effect next login)
+          sudo docker version

Or run subsequent docker commands with sudo since this is an ephemeral instance anyway.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Add current user to docker group
sudo usermod -aG docker $(whoami)
# Apply group membership without logout (temporary for this session)
sudo chmod 666 /var/run/docker.sock
# Add current user to docker group
sudo usermod -aG docker $(whoami)
# Verify docker access works with sudo (group takes effect next login)
sudo docker version
🤖 Prompt for AI Agents
.github/workflows/execute-kfp-localrunners.yml lines 122-126: do not set
/var/run/docker.sock to 666; instead remove the chmod line and either run
subsequent docker commands with sudo (keep ephemeral security) or apply the new
docker group membership for the current shell using newgrp docker (or exec sg
docker "$SHELL" / command) after running sudo usermod -aG docker $(whoami);
ensure the docker group exists before adding the user and avoid making the
socket world-writable.

Comment on lines +167 to +191
stop-ec2-runner:
if: github.repository == 'opendatahub-io/data-processing' && always()
permissions:
id-token: write # This is required for OIDC (AWS auth)
contents: read
needs:
- launch-ec2-runner
- test-local-pipelines

runs-on: ubuntu-latest
steps:
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@61815dcd50bd041e203e49132bacad1fd04d2708 # v5.1.1
with:
role-to-assume: "arn:aws:iam::${{ secrets.AWS_ACCOUNT_ID }}:role/${{ vars.DATA_PROCESSING_IAM_ROLE }}"
aws-region: us-east-2
role-session-name: odh-data-processing # For tracking in CloudTrail

- name: Stop EC2 runner
uses: machulav/ec2-github-runner@a6dbcefcf8a31a861f5e078bb153ed332130c512 # v2.4.3
with:
mode: stop
github-token: "${{ secrets.DATA_PROCESSING_GH_PERSONAL_ACCESS_TOKEN }}"
label: ${{ needs.launch-ec2-runner.outputs.label }}
ec2-instance-id: ${{ needs.launch-ec2-runner.outputs.ec2-instance-id }}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cat -n .github/workflows/execute-kfp-localrunners.yml | sed -n '1,200p'

Repository: opendatahub-io/data-processing

Length of output: 9175


🏁 Script executed:

fd execute-kfp-localrunners.yml

Repository: opendatahub-io/data-processing

Length of output: 56


Add condition to stop-ec2-runner to skip if launch-ec2-runner fails.

The stop-ec2-runner job uses if: always() to run cleanup even on failure, but it receives empty label and ec2-instance-id outputs if launch-ec2-runner fails before emitting them. These inputs are required for the machulav/ec2-github-runner action in stop mode, and the action cannot properly terminate the instance with empty values. Add a check: if: github.repository == 'opendatahub-io/data-processing' && always() && needs.launch-ec2-runner.result == 'success' to skip cleanup only when outputs are guaranteed to be available.

🤖 Prompt for AI Agents
.github/workflows/execute-kfp-localrunners.yml lines 167-191: the
stop-ec2-runner job currently uses if: always() which causes it to run even when
launch-ec2-runner failed and its outputs (label and ec2-instance-id) are empty;
update the job-level if condition to require that needs.launch-ec2-runner.result
== 'success' in addition to the existing repository check and always() guard so
the job only runs when the launch job completed successfully and outputs are
present.

@mergify mergify bot merged commit 37b97a0 into opendatahub-io:main Dec 22, 2025
7 of 9 checks passed
@mergify
Copy link

mergify bot commented Dec 22, 2025

🎉 Auto-merged successfully!

All reviewers approved: 1
CI checks passed: All

Approved by:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments