Setup execute-kfp-localrunner on EC2 instance and add an option to provide the BASE_IMAGE URLs for qualification#76
Conversation
…ovide the BASE_IMAGE URLs for qualification
WalkthroughIntroduce EC2-backed GitHub Actions runner flow for local pipeline tests, add workflow_dispatch image overrides and Quay registry env, make image constants environment-driven, and update compiled pipeline manifests to use a single quay.io/amaredia/aipcc-docling-image. Changes
Sequence DiagramsequenceDiagram
participant GH as GitHub Actions
participant AWS as AWS OIDC / EC2 API
participant EC2 as EC2 Runner (instance)
participant Quay as Quay Registry
participant Tests as Local Pipeline Tests
GH->>GH: pr-check
GH->>AWS: launch-ec2-runner (request instance, role via OIDC)
AWS->>EC2: Provision instance
EC2-->>GH: Return runner label & instance-id
GH->>EC2: test-local-pipelines (dispatch job on runner)
EC2->>EC2: Install system deps, Python, Docker, pip packages
EC2->>Quay: Docker login (QUAY_REGISTRY)
EC2->>EC2: Apply python_base_image/docling_base_image env (if provided)
EC2->>Tests: Execute pipeline tests (local)
Tests-->>EC2: Emit results & logs
EC2-->>GH: Upload artifacts on failure
GH->>AWS: stop-ec2-runner (terminate by label/instance-id)
AWS->>EC2: Terminate instance
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
.github/workflows/compile-kfp.yml (1)
54-59: Quay login may be unnecessary for pipeline compilation.The compile workflow runs Python scripts to generate YAML files. It doesn't pull or push container images. Unless the compilation step validates image existence, this login step adds latency without benefit.
Also, the indentation under
with:uses 1 space (lines 57-59) instead of 2, which is inconsistent with the rest of the file.🔎 Proposed fix for indentation
- name: Log in to Quay Container Registry uses: docker/login-action@v3 with: - registry: ${{ env.QUAY_REGISTRY }} - username: ${{ secrets.QUAY_USERNAME }} - password: ${{ secrets.QUAY_PASSWORD }} + registry: ${{ env.QUAY_REGISTRY }} + username: ${{ secrets.QUAY_USERNAME }} + password: ${{ secrets.QUAY_PASSWORD }}.github/workflows/execute-kfp-localrunners.yml (2)
108-108: Inconsistent checkout action version.Line 108 uses
actions/checkout@v4while line 62 usesactions/checkout@v6. Unify to the same version for consistency across jobs.🔎 Proposed fix
- - uses: actions/checkout@v4 + - uses: actions/checkout@v6
141-146: YAML indentation inconsistency underwith:.The indentation uses 1 space instead of 2 under the
with:block, inconsistent with the rest of the file.🔎 Proposed fix
- name: Log in to Quay Container Registry uses: docker/login-action@v3 with: - registry: ${{ env.QUAY_REGISTRY }} - username: ${{ secrets.QUAY_USERNAME }} - password: ${{ secrets.QUAY_PASSWORD }} + registry: ${{ env.QUAY_REGISTRY }} + username: ${{ secrets.QUAY_USERNAME }} + password: ${{ secrets.QUAY_PASSWORD }}
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
.github/workflows/compile-kfp.yml.github/workflows/execute-kfp-localrunners.ymlkubeflow-pipelines/common/constants.pykubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yamlkubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: compile (docling-standard, kubeflow-pipelines/docling-standard, python standard_convert_pipeline....
- GitHub Check: compile (docling-vlm, kubeflow-pipelines/docling-vlm, python vlm_convert_pipeline.py, vlm_convert...
- GitHub Check: Summary
🔇 Additional comments (5)
kubeflow-pipelines/common/constants.py (1)
1-9: LGTM!The environment-driven image configuration allows flexible overrides for CI qualification workflows while maintaining sensible defaults. The consolidation to a single image simplifies maintenance.
.github/workflows/execute-kfp-localrunners.yml (2)
35-48: Good addition: Clear PR feedback for fork contributors.The
pr-checkjob provides helpful messaging for contributors from forks who won't have access to secrets. This is a good UX improvement.
50-91: EC2 runner provisioning looks well-structured.The multi-AZ configuration with proper tagging for CloudTrail tracking and job outputs for downstream consumption is well designed.
kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (1)
304-304: Image consolidation applied correctly.All executor images in this compiled pipeline are consistently updated to use
quay.io/amaredia/aipcc-docling-image, aligning with the centralized image strategy inconstants.py.kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (1)
386-386: Image consolidation applied consistently with VLM pipeline.All executor images are updated to
quay.io/amaredia/aipcc-docling-image, matching the VLM pipeline and central constants configuration.
| # Add current user to docker group | ||
| sudo usermod -aG docker $(whoami) | ||
|
|
||
| # Apply group membership without logout (temporary for this session) | ||
| sudo chmod 666 /var/run/docker.sock |
There was a problem hiding this comment.
Security concern: Overly permissive Docker socket permissions.
chmod 666 /var/run/docker.sock makes the socket world-readable/writable, which is a security risk on shared or multi-tenant environments. While this is an ephemeral EC2 instance, the approach sets a bad precedent.
🔎 Suggested alternative
# Add current user to docker group
sudo usermod -aG docker $(whoami)
- # Apply group membership without logout (temporary for this session)
- sudo chmod 666 /var/run/docker.sock
+ # Verify docker access works with sudo (group takes effect next login)
+ sudo docker versionOr run subsequent docker commands with sudo since this is an ephemeral instance anyway.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Add current user to docker group | |
| sudo usermod -aG docker $(whoami) | |
| # Apply group membership without logout (temporary for this session) | |
| sudo chmod 666 /var/run/docker.sock | |
| # Add current user to docker group | |
| sudo usermod -aG docker $(whoami) | |
| # Verify docker access works with sudo (group takes effect next login) | |
| sudo docker version |
🤖 Prompt for AI Agents
.github/workflows/execute-kfp-localrunners.yml lines 122-126: do not set
/var/run/docker.sock to 666; instead remove the chmod line and either run
subsequent docker commands with sudo (keep ephemeral security) or apply the new
docker group membership for the current shell using newgrp docker (or exec sg
docker "$SHELL" / command) after running sudo usermod -aG docker $(whoami);
ensure the docker group exists before adding the user and avoid making the
socket world-writable.
| stop-ec2-runner: | ||
| if: github.repository == 'opendatahub-io/data-processing' && always() | ||
| permissions: | ||
| id-token: write # This is required for OIDC (AWS auth) | ||
| contents: read | ||
| needs: | ||
| - launch-ec2-runner | ||
| - test-local-pipelines | ||
|
|
||
| runs-on: ubuntu-latest | ||
| steps: | ||
| - name: Configure AWS credentials | ||
| uses: aws-actions/configure-aws-credentials@61815dcd50bd041e203e49132bacad1fd04d2708 # v5.1.1 | ||
| with: | ||
| role-to-assume: "arn:aws:iam::${{ secrets.AWS_ACCOUNT_ID }}:role/${{ vars.DATA_PROCESSING_IAM_ROLE }}" | ||
| aws-region: us-east-2 | ||
| role-session-name: odh-data-processing # For tracking in CloudTrail | ||
|
|
||
| - name: Stop EC2 runner | ||
| uses: machulav/ec2-github-runner@a6dbcefcf8a31a861f5e078bb153ed332130c512 # v2.4.3 | ||
| with: | ||
| mode: stop | ||
| github-token: "${{ secrets.DATA_PROCESSING_GH_PERSONAL_ACCESS_TOKEN }}" | ||
| label: ${{ needs.launch-ec2-runner.outputs.label }} | ||
| ec2-instance-id: ${{ needs.launch-ec2-runner.outputs.ec2-instance-id }} |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n .github/workflows/execute-kfp-localrunners.yml | sed -n '1,200p'Repository: opendatahub-io/data-processing
Length of output: 9175
🏁 Script executed:
fd execute-kfp-localrunners.ymlRepository: opendatahub-io/data-processing
Length of output: 56
Add condition to stop-ec2-runner to skip if launch-ec2-runner fails.
The stop-ec2-runner job uses if: always() to run cleanup even on failure, but it receives empty label and ec2-instance-id outputs if launch-ec2-runner fails before emitting them. These inputs are required for the machulav/ec2-github-runner action in stop mode, and the action cannot properly terminate the instance with empty values. Add a check: if: github.repository == 'opendatahub-io/data-processing' && always() && needs.launch-ec2-runner.result == 'success' to skip cleanup only when outputs are guaranteed to be available.
🤖 Prompt for AI Agents
.github/workflows/execute-kfp-localrunners.yml lines 167-191: the
stop-ec2-runner job currently uses if: always() which causes it to run even when
launch-ec2-runner failed and its outputs (label and ec2-instance-id) are empty;
update the job-level if condition to require that needs.launch-ec2-runner.result
== 'success' in addition to the existing repository check and always() guard so
the job only runs when the launch job completed successfully and outputs are
present.
|
🎉 Auto-merged successfully! ✅ All reviewers approved: 1 Approved by: |
Description
Update the execute-kfp-localrunner on ec2 instance and also enable docling-vlm
Adds a gate mechanism to the
execute-kfp-localrunnersworkflow to ensure CI jobs requiring secrets only run on the upstream repository.How Has This Been Tested?
This has been tested with a draft PR created:#75
Merge criteria:
Summary by CodeRabbit
New Features
Chores
✏️ Tip: You can customize this high-level summary in your review settings.