Skip to content

Comments

Feature/ibis execution api#13

Open
egillax wants to merge 5 commits intoOHDSI:developfrom
egillax:feature/ibis-execution-api
Open

Feature/ibis execution api#13
egillax wants to merge 5 commits intoOHDSI:developfrom
egillax:feature/ibis-execution-api

Conversation

@egillax
Copy link
Collaborator

@egillax egillax commented Feb 11, 2026

Implements an experimental native Ibis execution API for Circepy with lazy relation building and
explicit materialization actions.

  • Adds IbisExecutor with build(), to_pandas(), to_polars(), and write() methods
  • Adds convenience wrappers: build_ibis, to_polars, write_cohort
  • Introduces ExecutionOptions for execution/schema/materialization/SQL-capture controls
  • Ports and simplifies executor logic from Mitos into Circepy where I used like a registry-based factory
  • Adds tests for API behavior and write-option validation
  • Validated DuckDB parity against CirceR row counts across phenotype-library cohorts (local integration
    run)

Sorry for how large this is. The API is in circe/execution/ibis.py

There is some stuff in there I'm not completely satisifed with, but good enough for you to have a look @azimov

@azimov
Copy link
Collaborator

azimov commented Feb 18, 2026

@egillax So I have finally had a chance to give a test with databricks and overall this seems positive. There is a lot of code to unpack so i will need to do a deeper review later but here are my initial notes (mainly so I don't forget but also to stimulate discussion):

  • Maybe we would need to implement some CohortGenerator style functions for execution or track what cohort ids are? The current executor.write method seems to work well once but is it going to work with repeated uses/incremental execution. I'm not sure what IBIS is doing under the hood and I would like to investigate in more detail. My preferred behaviour would be - if the cohort definition with the option id changes then it automatically overwrites, if its the same then it either throws an exception that can be handled (e.g. skips generation) or just carries on.

  • On this - would we want to implement subset cohorts here - maybe ibis can help handling these relations?

  • Clearer parameterization, perhaps convenience classes on top of what our R standards are?

  • I'm not sure how often users should use polars/pandas objects when working with cohorts? I can see some use cases for it but it won't scale well and users generally don't think about the impacts this sort of thing has.

Test code I used to get it working

from dotenv import load_dotenv
import os
import ibis
from circe.execution import ExecutionOptions, IbisExecutor
from examples.basic_cohort import create_diabetes_cohort

load_dotenv()

host = os.getenv("DATABRICKS_HOST")
http_path = os.getenv("DATABRICKS_HTTP_PATH")
token = os.getenv("DATABRICKS_TOKEN")
scratch_schema = os.getenv("DATABRICKS_SCRATCH_SCHEMA")


# Normalize common user input issues.
host = host.removeprefix("https://").removeprefix("http://").rstrip("/")
if not http_path.startswith("/"):
    http_path = f"/{http_path}"

# Split catalog.schema for Unity Catalog-aware connections.
scratch_catalog = None
scratch_db = scratch_schema
if "." in scratch_schema:
    scratch_catalog, scratch_db = scratch_schema.split(".", 1)

cdm_schema = "<hidden>"

con = ibis.databricks.connect(
    server_hostname=host,
    http_path=http_path,
    token=token,
    catalog=scratch_catalog,
    schema=scratch_db,
)

options = ExecutionOptions(
    cdm_schema=cdm_schema,
    vocabulary_schema = cdm_schema,
    result_schema = scratch_schema,
    temp_emulation_schema = scratch_schema,
    cohort_id = 2
)

executor = IbisExecutor(conn=con, options=options)

cohort = create_diabetes_cohort()
executor.write(cohort, table="ibis_test_cohort", overwrite=False)

Overall this was pretty straightforward and I will play around more to give you more feedback soon

ibis-postgres = [
"ibis-framework[postgres]>=11.0.0; python_version >= '3.9'",
]
ibis-databricks = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be how I installed but I got some install errors with polars/pandas after the install

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can not depend on polars, which hopefully would fix that. Otherwise please let me know the particulars.


class TestPackageStructure:
"""Test basic package structure and imports."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@egillax I think we need to agree on some IDE standards as these look like auto changes. I'm just using the PyCharm defaults but happy to change to something if you're using neovim?

(All this code was written by the ai though)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use neovim. But I actually didn't even open it for this.. just codex and git diff when I wanted to look (and this PR).

I think this is because I told my agent to format according to standards in repo. And it called black on this file. So either my agent messed up not using it correctly or your agent.

But I agree, standardize and enforce conventions. And perhaps consider pre-commit hooks or CI to enforce (or both). Have not used pre-commit hooks before so not sure if it's more trouble than not.

ExpressionInput = Union[CohortExpression, Mapping[str, Any], str, Path]


def load_expression(value: ExpressionInput) -> CohortExpression:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objection to this function name, but we should probably have a single api standard for loading the expressions in the .api file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes agree. This pass was mostly ripping stuff out of mitos and putting here. I think the agent didn't look enough at the current api.

cohort_id: Optional[int] = None

materialize_stages: bool = False
materialize_codesets: bool = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing would be good to implement would be taking a checksum of a cohort and then re-using the materialization over and over. Not sure if that would be possible in this branch but it would be a pretty useful change over java.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think we can do that. Or you mean between execution of a relation or within?

@egillax
Copy link
Collaborator Author

egillax commented Feb 19, 2026

Maybe we would need to implement some CohortGenerator style functions for execution or track what cohort ids are? The current executor.write method seems to work well once but is it going to work with repeated uses/incremental execution. I'm not sure what IBIS is doing under the hood and I would like to investigate in more detail. My preferred behaviour would be - if the cohort definition with the option id changes then it automatically overwrites, if its the same then it either throws an exception that can be handled (e.g. skips generation) or just carries on.

I think that sounds good.

On this - would we want to implement subset cohorts here - maybe ibis can help handling these relations?

I think that should be easy with ibis. Haven't actually used subsets before so not sure how it's done in CohortGenerator, or what's the api. But I think with ibis if you do build, then you can still add any kind of operations to the expression tree and it should work. executor.build(cohort).filter(subset).write(). Need to play around with it.

I'm not sure how often users should use polars/pandas objects when working with cohorts? I can see some use cases for it but it won't scale well and users generally don't think about the impacts this sort of thing has.

Agree, so these are already operations in ibis, so first of all no need to have them here. I was to much thinking of downstream where I want to stream the results to file/memory. I think it could be useful for testing, but then again they are already in ibis so you can use it. And we get rid of polar dependency here.

Clearer parameterization, perhaps convenience classes on top of what our R standards are?

I was actually working on a different design since I felt this had a lot of repetition. But that's mostly for internal organization. But I think you mean the api for users above ? Could you explain ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants