Skip to content

Cloud Composer CI/CD example#390

Merged
tims merged 3 commits intoGoogleCloudPlatform:masterfrom
tims:master
Feb 3, 2020
Merged

Cloud Composer CI/CD example#390
tims merged 3 commits intoGoogleCloudPlatform:masterfrom
tims:master

Conversation

@tims
Copy link
Contributor

@tims tims commented Jan 10, 2020

Example that uses Cloud build to test airflow DAGs and deploy them to Cloud Composer.

When deploying cloud composer environments, people often ask, "How do I test my DAGs".
The answer is CI/CD

  • DAG integrity / validation tests (check it load correctly).
  • DAG execution tests (run the complete dag)

Then deploy.

I've prepared an example of this using cloud build.

PSO Eng Bug Id = 147466944

@googlebot googlebot added the cla: yes All committers have signed a CLA label Jan 10, 2020
@pull-request-size pull-request-size bot added the size/L Denotes a PR that changes 100-499 lines. label Jan 10, 2020
Copy link

@jaketf jaketf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tims unfortunately we are duplicating a lot of cicd for composer work.
Apologies for not increasing visibility on this work outside of AMER.
I've been on this train for a while and have some further lessons learned
I've been working in https://source.cloud.google.com/datapipelines-ci/datapipelines-ci/+/master
and plan on open sourcing as it's own repo. It's very similar to this blog I wrote w/ a partner but less infrastructure heavy (not requiring spinnaker / gke cluster).

I think in cicd for composer we need to cover more things it's really important to handle:

  • testing / deploying SQL queries used by BQ
  • custom conn_ids
  • testing / deploying custom plugins
  • actually integration testing against composer not just an airflow image (we can do this by copying dags to data directory and running list_dags here. Note that composer is slightly divergent from OSS airflow.

Additionally I like to be prescriptive / opinionated about things like asserting dag_id == filename in the ci framework.

Comment on lines +52 to +67
- -c
- |
pwd
ls -l
# Init an empty airflow metadata db
airflow initdb
# Initialise variables to configure airflow DAGs
airflow variables -s input gs://test-input-bucket/
airflow variables -s output gs://test-output-bucket/
# Run a backfill of our DAG for a specific date, add more dags here if required.
airflow backfill echo_and_wait --start_date=2020-01-03 --end_date=2020-01-03
ret=$?
# Cat the task logs so we can see when there was any error
find /workspace/airflow/logs/ -type f -name *.log | sort | xargs cat
# Fail if the backfill failed.
exit $ret
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be taken out into it's own script.
This also notably will cause red-herring test failures because it won't handle:

  • if the customer has any custom conn_ids
  • if the customer is using crypto / fernet key
  • if the customer is also deploying custom plugins.

PTAL https://source.cloud.google.com/datapipelines-ci/datapipelines-ci/+/master:composer/cloudbuild/bin/run_tests.sh

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have access to that repo, permission denied

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it might be more readable in the same file. And since this only tests one DAG this way, it's more of an example than a best practice. This is just to demo how you might override variables before creating and executing a DAG run.

I definitely agree a more complete solution would be great. But it would be good to have something simple like this to link to until then. We can change to link to that when it's ready.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest we merge this for now, and then come up with the better solution next. I'd suggest we're much better off having velocity and revising than waiting for perfect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I consider the main benefit to this example to be just the syntax checking, it covers the most immediate need for customers. People tend to do very different things for end to end tests, and most don't, they just see what happens in prod. So it's more of a starting point and helps explain how simple doing things with cloud build is.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i opened a PR a while back for the cicd for data pipelines solution that didn't get much traction is very similar to what I'm working on.
You could link to my fork: https://github.com/jaketf/ci-cd-for-data-processing-workflow
I think that handling plugins / connections is a must even for a simple example.
IMHO cloud build should be like a main function: very small high level readable steps, all implementation details like initdb, variables, etc should be burried in scripts.

# Deploy the DAGs to your composer environment DAGs GCS folder
id: Deploy DAGs
args:
- -m
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I have had really flaky issues with deleting DAGs w/ composer when paused on creation is not manually configured to True. I also consider this a best practice in case a dag is manually copied here by accident and shouldn't actually be scheduled.

If this config is set this script won't handle the dag unpausing.

Note, (though not necessary) I've seen a desire to have a config that only deploys certain dags. Sometimes, users may need to run multiple deploys to sequence their updating of DAGs due to inter-dag dependencies or they have some utility temporary dags they want to deploy run and undeploy w/o removing the dag from their codebase.
I've solved this with a running_dags.txt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be an environment wide setting no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would be to set the environment wide setting to set all new dags as paused on creation.

If a dag is updated, I don't think it should ever automatically unpause. I'd much prefer the user has to do that once manually.

But this is very user and environment specific. I don't think CI/CD should be particularly opinionated about it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's fair...

@tims
Copy link
Contributor Author

tims commented Jan 15, 2020

I've been working in https://source.cloud.google.com/datapipelines-ci/datapipelines-ci/+/master
I don't have access to that repo, permission denied

@jaketf
Copy link

jaketf commented Jan 24, 2020

@tims please try to access again.
If you can take this run tests scripts and use it or something similar to handle plugins, variables, connections, I'd be much more comfortable merging this.

@jaketf
Copy link

jaketf commented Jan 28, 2020

@tims
LMK if you have any questions about using the run tests script I've linked.

FYI this is where I'm hoping to open source this as a refactor of the public solution.
GoogleCloudPlatform/ci-cd-for-data-processing-workflow#2

@tims
Copy link
Contributor Author

tims commented Feb 3, 2020

@jaketf Unfortunately I think there are a few issues with the run_tests script you linked to, it's more generic but not generic enough.

It takes connections from an existing composer environment. But testing should probably not use the same connections as a composer env. I think they should be provided by the build, if any. It also assumes HTTP connections, which is not ideal.

It also assumes the variables are in a specifically named local file, presumably committed, if we're going to make it more generic than inlining in the cloudbuild.yaml, these should be configurable.

It expects a plugins dir, which a simple example does not require. Similarly it is expects bigquery/sql dir.

I still think we should merge as is, and link to yours as the advanced solution when it's ready.

Copy link
Contributor

@henryken henryken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tims tims merged commit 7ab9e4b into GoogleCloudPlatform:master Feb 3, 2020
@tims
Copy link
Contributor Author

tims commented Feb 3, 2020

@jaketf we can definitely open up a new PR on this when your is ready to converge. This one should be kept as simple as possible as an illustration for people are rolling their own CI/CD and (some customers are even only using airflow, not composer). And we can put the more complete solution in your repo.

@jaketf
Copy link

jaketf commented Feb 3, 2020

@tims makes sense glad you go this merged. thank you that's really valuable feedback on the run tests script. I think there's huge value in making it a bit more configurable / modular. I've migrated this to a PR in the open it will be a while til we can rewrite the solutions page but I'd appreciate any review comments like this! https://github.com/Google loud platform/ci-cd-for-data-processing-workflow/pull/2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes All committers have signed a CLA size/L Denotes a PR that changes 100-499 lines.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants