-
Notifications
You must be signed in to change notification settings - Fork 5
Tsm clustering spark #412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: tsm
Are you sure you want to change the base?
Tsm clustering spark #412
Changes from all commits
ce394b3
d562ea9
374a1fd
3e08ce5
823c9bc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -552,5 +552,8 @@ def run_stubgen(self): | |
| "geopandas", | ||
| "shapely", | ||
| "folium", | ||
| "pyspark", | ||
| "matplotlib", | ||
| "pandas" | ||
| ], | ||
| ) | ||
| Original file line number | Diff line number | Diff line change | ||
|---|---|---|---|---|
| @@ -1 +1,2 @@ | ||||
| from .tsm import TSM as TSM | ||||
| from .tsm import _get_or_create_spark as _get_or_create_spark | ||||
|
||||
| from .tsm import _get_or_create_spark as _get_or_create_spark |
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -3,18 +3,35 @@ | |||||||||||||||||||||||||||||||||||||
| from pathlib import Path | ||||||||||||||||||||||||||||||||||||||
| from typing import Dict, Optional | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| import polars as pl | ||||||||||||||||||||||||||||||||||||||
| from pyspark.sql import DataFrame, SparkSession, Window | ||||||||||||||||||||||||||||||||||||||
| import pyspark.sql.functions as F | ||||||||||||||||||||||||||||||||||||||
| import pyspark.sql.types as T | ||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||
| import pyspark.sql.types as T |
Copilot
AI
Feb 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This helper hard-codes local[*], very large driver/executor memory, and shuffle/adaptive settings. In a library context this can override user/cluster configuration and cause failures on machines without these resources. Prefer returning an existing active session (or a plain builder.getOrCreate()), and avoid setting resource configs here (or make them optional parameters).
| """Return the active SparkSession or create a local one.""" | |
| return ( | |
| SparkSession.builder | |
| .master("local[*]") | |
| .appName("TSM") | |
| .config("spark.driver.memory", "128g") \ | |
| .config("spark.executor.memory", "128g") \ | |
| .config("spark.sql.shuffle.partitions", "1600") \ | |
| .config("spark.sql.adaptive.enabled", "true") \ | |
| .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \ | |
| .getOrCreate() | |
| ) | |
| """Return the active SparkSession or create one with default settings.""" | |
| active = SparkSession.getActiveSession() | |
| if active is not None: | |
| return active | |
| return SparkSession.builder.appName("TSM").getOrCreate() |
Copilot
AI
Feb 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
groupBy() takes varargs; passing group (a Python list) as a single argument will fail at runtime. Use argument unpacking (groupBy(*group)) so the grouping columns are applied correctly.
| lanes_df = self._df.groupBy(group).agg( | |
| F.count_distinct("lane").alias("n_lanes") | |
| ) | |
| # --- window for per-detector and direction in ordered operations ---------------------- | |
| w = Window.partitionBy(group).orderBy("timestamp") | |
| lanes_df = self._df.groupBy(*group).agg( | |
| F.count_distinct("lane").alias("n_lanes") | |
| ) | |
| # --- window for per-detector and direction in ordered operations ---------------------- | |
| w = Window.partitionBy(*group).orderBy("timestamp") |
Copilot
AI
Feb 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Window.partitionBy() takes varargs; passing group (a list) will be treated as a single invalid column and will error. Use Window.partitionBy(*group).
| w = Window.partitionBy(group).orderBy("timestamp") | |
| w = Window.partitionBy(*group).orderBy("timestamp") |
Copilot
AI
Feb 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unix_timestamp() truncates to whole seconds, so delta_t_s will lose sub-second precision compared to the previous implementation (and can change cluster boundaries). Consider computing the difference in seconds using a higher-precision approach (e.g., casting timestamps to double/decimal or using an interval and extracting seconds) so you preserve fractional seconds.
| F.unix_timestamp(F.col("timestamp")) - F.unix_timestamp(F.col("prev_timestamp")), | |
| F.col("timestamp").cast("double") - F.col("prev_timestamp").cast("double"), |
Copilot
AI
Feb 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Window.partitionBy(group) has the same issue here: partitionBy expects varargs, not a list. Use Window.partitionBy(*group) to avoid runtime errors.
| Window.partitionBy(group) | |
| Window.partitionBy(*group) |
Copilot
AI
Feb 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above: groupBy() expects varargs of column names; group + ["cluster_local_id"] is a list being passed as a single argument. Use groupBy(*(group + ["cluster_local_id"])) (or groupBy(*group, "cluster_local_id")).
| df.groupBy(group + ["cluster_local_id"]) | |
| df.groupBy(*(group + ["cluster_local_id"])) |
Copilot
AI
Feb 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
F.try_divide is not available in all supported PySpark versions. Since install_requires doesn’t constrain a minimum PySpark version, this can break at runtime on older environments. Either add a minimum pyspark version constraint that guarantees try_divide, or implement the safe division with when/otherwise so it works across versions.
Copilot
AI
Feb 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling DataFrame.count() inside __repr__ triggers a Spark job and can be extremely expensive/unexpected (and may even fail if the session is stopped). Consider avoiding actions in __repr__ (e.g., omit row count or show a cached/known value only).
| if self._result is not None: | |
| rows = self._result.count() | |
| else: | |
| rows = self._df.count() | |
| return f"TSM(status={status}, rows={rows})" | |
| return f"TSM(status={status})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pysparkis added without a minimum version constraint, but the code relies on newer APIs (e.g.,pyspark.sql.functions.try_divide). To prevent runtime incompatibilities, consider pinning/declaring a minimum supported PySpark version here.