Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 7 additions & 9 deletions blog/27.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,13 @@ title: "Stream4Graph: Incremental Computation on Dynamic Graphs"
date: "2025-3-11"
---

![](/graph/1740982328260-3a0ff09e-920b-4f55-af14-326b5d0a358c.png)

> Author: Zhang Qi

It's well known that when we need to perform correlation analysis on data, we typically use SQL join operations. However, Cartesian product calculations during SQL joins require maintaining a large number of intermediate results, which significantly impacts overall data analysis performance. In contrast, graph-based approaches maintain data correlations, transforming correlation analysis into graph traversal operations and greatly reducing the cost of data analysis.

However, with the continuous growth in data scale and increasing demand for real-time processing, efficiently solving real-time computation problems on large-scale graph data has become increasingly urgent. Traditional computing engines such as Spark and Flink are gradually falling short of meeting the growing business demands for graph data processing. Therefore, designing a real-time processing engine tailored for large-scale graph data will bring significant advancements to big data processing technologies.

Stream graph computing engine [GeaFlow](https://github.com/TuGraph-family/tugraph-analytics), which combines the technical advantages of graph processing and stream processing. It implements incremental computation capabilities on dynamic graphs, enhancing real-time performance in high-performance correlation analysis. In the following sections, we will introduce the characteristics of graph computing technology, how the industry addresses large-scale real-time graph computing challenges, and GeaFlow's performance in dynamic graph computation.
Stream graph computing engine [Apache GeaFlow (Incubating)](https://github.com/apache/geaflow), which combines the technical advantages of graph processing and stream processing. It implements incremental computation capabilities on dynamic graphs, enhancing real-time performance in high-performance correlation analysis. In the following sections, we will introduce the characteristics of graph computing technology, how the industry addresses large-scale real-time graph computing challenges, and GeaFlow's performance in dynamic graph computation.

<!-- truncate -->

Expand Down Expand Up @@ -96,19 +94,19 @@ However, this approach has a significant drawback: it involves redundant computa

<font style="color:rgb(64, 64, 64);"></font>

## 4. Incremental Dynamic Graph Computing: GeaFlow
## 4. Incremental Dynamic Graph Computing: Apache GeaFlow (Incubating)

We know that in traditional stream computing engines like Flink, the processing model allows the system to handle continuously incoming data events. When processing each event, Flink can evaluate changes and execute computations only on the changed parts. This means that in incremental computing, Flink focuses on the latest incoming data rather than the entire dataset. Inspired by Flink’s incremental computing, we developed the incremental graph computing system GeaFlow (also known as the stream graph computing engine), which effectively supports incremental graph iterative computation.
We know that in traditional stream computing engines like Flink, the processing model allows the system to handle continuously incoming data events. When processing each event, Flink can evaluate changes and execute computations only on the changed parts. This means that in incremental computing, Flink focuses on the latest incoming data rather than the entire dataset. Inspired by Flink’s incremental computing, we developed the incremental graph computing system Apache GeaFlow (Incubating) (also known as the stream graph computing engine), which effectively supports incremental graph iterative computation.

<font style="color:rgb(64, 64, 64);"></font>

How does GeaFlow implement incremental graph computing? First, real-time data is input into GeaFlow through connectors. GeaFlow generates internal node-edge structure data based on the real-time data and inserts this data into the underlying graph. Nodes involved in the real-time data within the current window are activated, triggering graph iterative computation.
How does Apache GeaFlow (Incubating) implement incremental graph computing? First, real-time data is input into GeaFlow through connectors. GeaFlow generates internal node-edge structure data based on the real-time data and inserts this data into the underlying graph. Nodes involved in the real-time data within the current window are activated, triggering graph iterative computation.

Using the WCC algorithm as an example, for the connected components algorithm, in a time window, each edge’s src id and tar id vertices are activated. In the first iteration, their id information is sent to neighboring nodes. If a neighboring node receives the message and finds that it needs to update its information, it continues to notify its neighbors; otherwise, its iteration terminates.

![](https://intranetproxy.alipay.com/skylark/lark/0/2025/png/314644/1740471552771-36ee8f06-d58e-4cb7-914d-c44e151575a0.png)

## 5. GeaFlow Architecture Overview
## 5. Apache GeaFlow (Incubating) Architecture Overview

The GeaFlow engine consists of three main parts: DSL, Framework, and State. It also provides users with Stream API, Static Graph API, and Dynamic Graph API. The DSL layer is responsible for parsing and optimizing graph query languages like SQL+ISO/GQL, as well as schema inference. It also supports various Connectors such as Hive, Hudi, Kafka, and ODPS. The Framework layer handles runtime scheduling, fault tolerance, shuffle, and coordination of components. The State layer is responsible for storing underlying graph data and persistence, as well as performance optimizations like indexing and predicate pushdown.

Expand Down Expand Up @@ -218,6 +216,6 @@ The GeaFlow project is fully open-sourced. We have built some of the foundationa

## References

1. GeaFlow Project:[https://github.com/TuGraph-family/tugraph-analytics](https://github.com/TuGraph-family/tugraph-analytics)
1. Apache GeaFlow (Incubating) Project:[https://github.com/apache/geaflow](https://github.com/apache/geaflow)
2. soc-Livejournal Dataset:[https://snap.stanford.edu/data/soc-LiveJournal1.html](https://snap.stanford.edu/data/soc-LiveJournal1.html)
3. GeaFlow Issues:[https://github.com/TuGraph-family/tugraph-analytics/issues](https://github.com/TuGraph-family/tugraph-analytics/issues)
3. Apache GeaFlow (Incubating) Issues:[https://github.com/apache/geaflow/issues](https://github.com/apache/geaflow/issues)
19 changes: 4 additions & 15 deletions blog/28.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,23 +3,15 @@ title: Principles and Applications of Incremental Match in Streaming Graph Compu
date: 2025-6-3
---

![](/graph/1743162676746-973d8e75-11b5-43d7-8832-724e7332b964.png)

## Problem Background
In streaming computing, data rarely arrives all at once but is continuously input and processed. Similarly, in graph computing/graph querying scenarios, vertices and edges are constantly read from data sources to construct graphs incrementally. In incremental graph queries, the graph evolves continuously, leading to different query results across graph versions. When new vertices/edges form an updated graph version, recomputing through the entire graph incurs high overhead and duplicates historical computations. Since historical data has already been processed, ideally only the delta-affected portions should be computed/queried without full-graph re-execution.

<!-- truncate -->

<font style="color:rgb(51, 51, 51);">GQL (Graph Query Language)</font> <font style="color:rgb(0, 0, 0);">is an international standard developed by ISO for graph query languages,</font> <font style="color:rgb(51, 51, 51);">used to execute queries on graphs. Geaflow is an open-source streaming graph engine by Ant Group’s graph computing team, specializing in dynamically changing graph data and supporting large-scale, high-concurrency real-time graph computing scenarios.</font> This article introduces Geaflow’s approach to incremental GQL-based Match queries on dynamic graphs, aiming to execute queries solely on delta data while avoiding redundant full computations.

![画板](https://intranetproxy.alipay.com/skylark/lark/0/2025/jpeg/23857192/1741574572676-ff7e2c56-14d0-470c-b21d-604f928c6ec9.jpeg)
<font style="color:rgb(51, 51, 51);">GQL (Graph Query Language)</font> <font style="color:rgb(0, 0, 0);">is an international standard developed by ISO for graph query languages,</font> <font style="color:rgb(51, 51, 51);">used to execute queries on graphs. Apache GeaFlow (Incubating) is an open-source streaming graph engine, specializing in dynamically changing graph data and supporting large-scale, high-concurrency real-time graph computing scenarios.</font> This article introduces GeaFlow’s approach to incremental GQL-based Match queries on dynamic graphs, aiming to execute queries solely on delta data while avoiding redundant full computations.

## Current Challenges
<font style="color:rgb(0, 0, 0);">The Geaflow engine adopts a vertex-centric framework, where each vertex sends messages iteratively. Vertices process received messages in subsequent iterations.</font> For GQL queries, traversal starts from initial vertices for pattern matching (e.g., from node `A` to `B` to `C`). In dynamic graphs, if only newly added vertices/edges trigger computation, results may be incomplete, as illustrated below:

<div style="text-align: center;">
<img src="https://intranetproxy.alipay.com/skylark/lark/0/2025/jpeg/23857192/1741576149930-b169b7da-0600-4fca-b6ad-5eadcfdbff5b.jpeg" alt='画板' height="281" width="486">
</div>
<font style="color:rgb(0, 0, 0);">The Apache GeaFlow (Incubating) engine adopts a vertex-centric framework, where each vertex sends messages iteratively. Vertices process received messages in subsequent iterations.</font> For GQL queries, traversal starts from initial vertices for pattern matching (e.g., from node `A` to `B` to `C`). In dynamic graphs, if only newly added vertices/edges trigger computation, results may be incomplete, as illustrated below:

The key issue is that **Vertex A1 cannot trigger computation if only the delta is considered**, yet it belongs to the incremental results. To resolve this, we propose a subgraph expansion method from new vertices. The query is divided into two phases:
1. **Evolve Phase**: Propagate `EvolveMessage` from new vertices to neighbors, adding recipients to the `EvolveVertices` set.
Expand Down Expand Up @@ -72,17 +64,14 @@ public void compute(Object vertexId, Iterator<MessageBox> messageIterator) {
}
```

**Visualization:**
![画板](https://intranetproxy.alipay.com/skylark/lark/0/2024/jpeg/23857192/1734590557540-5f3f4528-fa07-4208-8425-bc514ea5e06b.jpeg)

**Evolve Conditions:**
- Query iterations `>2` (no Evolve needed for ≤2 hops).
- Query iterations `≤ Threshold`.
- `windowId >1` (skip initial graph construction).
- No starting vertex filter in GQL (e.g., `Match(a:person where a.id=1)` excludes Evolve).

## Demo
In Geaflow, configure incremental graphs via `windowSize` for vertex/edge tables:
In Apache GeaFlow (Incubating), configure incremental graphs via `windowSize` for vertex/edge tables:

```sql
CREATE GRAPH modern (
Expand Down Expand Up @@ -160,4 +149,4 @@ In this demo, vertex window size is 20, and edge window size is 3, meaning each

## Conclusion and Outlook

In dynamic/streaming graph scenarios, graph nodes and edges change in real time. When querying such graphs, we can often trigger computation only on the incremental part using historical information, avoiding full graph traversal. Geaflow uses a subgraph expansion-based incremental match method, applied within a vertex-centric distributed graph computing framework, to support incremental querying in dynamic graph scenarios. In the future, we aim to implement more complex incremental matching logic for advanced use cases.
In dynamic/streaming graph scenarios, graph nodes and edges change in real time. When querying such graphs, we can often trigger computation only on the incremental part using historical information, avoiding full graph traversal. Apache GeaFlow (Incubating) uses a subgraph expansion-based incremental match method, applied within a vertex-centric distributed graph computing framework, to support incremental querying in dynamic graph scenarios. In the future, we aim to implement more complex incremental matching logic for advanced use cases.
16 changes: 8 additions & 8 deletions blog/29.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: "Exploring GeaFlow's Temporal Capabilities — Breathing New Life into Time-Series Data!"
title: "Exploring Apache GeaFlow (Incubating)'s Temporal Capabilities — Breathing New Life into Time-Series Data!"
date: 2025-6-25
---

Expand All @@ -24,9 +24,9 @@ In today's digital era, data has become a core resource driving decisions and in
- **Lack of Flexibility**
&nbsp;&nbsp;&nbsp;&nbsp;Many tools support only one type of analysis and cannot concurrently process real-time streams and historical data.

&nbsp;&nbsp;&nbsp;&nbsp;To address these issues, GeaFlow innovatively introduces temporal graph computing. As a distributed stream-graph engine designed for dynamic data, GeaFlow efficiently tackles challenges posed by evolving datasets. For dynamically changing graph structures, users can seamlessly perform operations like graph traversal, pattern matching, and computations—meeting complex analytical needs. By integrating temporal dimensions with dynamic graph processing, GeaFlow offers a groundbreaking solution for real-time analytics, empowering users to extract deeper value from dynamic data.
&nbsp;&nbsp;&nbsp;&nbsp;To address these issues, Apache GeaFlow (Incubating) innovatively introduces temporal graph computing. As a distributed stream-graph engine designed for dynamic data, GeaFlow efficiently tackles challenges posed by evolving datasets. For dynamically changing graph structures, users can seamlessly perform operations like graph traversal, pattern matching, and computations—meeting complex analytical needs. By integrating temporal dimensions with dynamic graph processing, GeaFlow offers a groundbreaking solution for real-time analytics, empowering users to extract deeper value from dynamic data.

## What Is GeaFlow?
## What Is Apache GeaFlow (Incubating)?

GeaFlow is a powerful distributed computing platform that combines graph computing and stream processing to handle dynamic graphs and temporal data efficiently. It supports complex graph algorithms and real-time analytics, making it ideal for dynamic scenarios. Key features include:

Expand Down Expand Up @@ -66,8 +66,8 @@ They complement each other:
- **Temporal Graphs Enhance Stream Analysis**
Timestamps enable complex operations like trend prediction and window-based analytics.

### **4. GeaFlow’s Implementation**
GeaFlow unifies stream and temporal graphs through:
### **4. Apache GeaFlow (Incubating)’s Implementation**
Apache GeaFlow (Incubating) unifies stream and temporal graphs through:
- **Timestamp Assignment**
Assigns *processing time* or *event time* to all data.
- **Dynamic Updates & Historical Retention**
Expand Down Expand Up @@ -176,10 +176,10 @@ a_id | e1_ts | b_id | e2_ts | c_id
- **Flexible**: SQL-like syntax lowers development barriers.
- **Scalable**: Handles massive dynamic graphs via incremental computation.

## Core Highlights of GeaFlow’s Temporal Capabilities
## Core Highlights of Apache GeaFlow (Incubating)’s Temporal Capabilities

### 1. Time-Aware Data Processing
Timestamps enable precision. GeaFlow supports:
Timestamps enable precision. Apache GeaFlow (Incubating) supports:
- **5-Minute Trend Analysis**: Track real-time interaction frequency shifts.
- **24-Hour Dynamic Patterns**: Identify long-term trends (e.g., user purchase behavior).

Expand All @@ -203,7 +203,7 @@ Optimized temporal algorithms:

Dynamic data holds immense value, and GeaFlow’s temporal capabilities unlock it. Whether you’re a novice or an expert, GeaFlow empowers you to harness time-series data.

**Download GeaFlow today and explore the power of temporal analytics!**
**Download Apache GeaFlow (Incubating) today and explore the power of temporal analytics!**

---

Expand Down
Loading