diff --git a/blog/27.md b/blog/27.md
index f694dbf5f..5ae7dd933 100644
--- a/blog/27.md
+++ b/blog/27.md
@@ -3,15 +3,13 @@ title: "Stream4Graph: Incremental Computation on Dynamic Graphs"
date: "2025-3-11"
---
-
-
> Author: Zhang Qi
It's well known that when we need to perform correlation analysis on data, we typically use SQL join operations. However, Cartesian product calculations during SQL joins require maintaining a large number of intermediate results, which significantly impacts overall data analysis performance. In contrast, graph-based approaches maintain data correlations, transforming correlation analysis into graph traversal operations and greatly reducing the cost of data analysis.
However, with the continuous growth in data scale and increasing demand for real-time processing, efficiently solving real-time computation problems on large-scale graph data has become increasingly urgent. Traditional computing engines such as Spark and Flink are gradually falling short of meeting the growing business demands for graph data processing. Therefore, designing a real-time processing engine tailored for large-scale graph data will bring significant advancements to big data processing technologies.
-Stream graph computing engine [GeaFlow](https://github.com/TuGraph-family/tugraph-analytics), which combines the technical advantages of graph processing and stream processing. It implements incremental computation capabilities on dynamic graphs, enhancing real-time performance in high-performance correlation analysis. In the following sections, we will introduce the characteristics of graph computing technology, how the industry addresses large-scale real-time graph computing challenges, and GeaFlow's performance in dynamic graph computation.
+Stream graph computing engine [Apache GeaFlow (Incubating)](https://github.com/apache/geaflow), which combines the technical advantages of graph processing and stream processing. It implements incremental computation capabilities on dynamic graphs, enhancing real-time performance in high-performance correlation analysis. In the following sections, we will introduce the characteristics of graph computing technology, how the industry addresses large-scale real-time graph computing challenges, and GeaFlow's performance in dynamic graph computation.
@@ -96,19 +94,19 @@ However, this approach has a significant drawback: it involves redundant computa
-## 4. Incremental Dynamic Graph Computing: GeaFlow
+## 4. Incremental Dynamic Graph Computing: Apache GeaFlow (Incubating)
-We know that in traditional stream computing engines like Flink, the processing model allows the system to handle continuously incoming data events. When processing each event, Flink can evaluate changes and execute computations only on the changed parts. This means that in incremental computing, Flink focuses on the latest incoming data rather than the entire dataset. Inspired by Flink’s incremental computing, we developed the incremental graph computing system GeaFlow (also known as the stream graph computing engine), which effectively supports incremental graph iterative computation.
+We know that in traditional stream computing engines like Flink, the processing model allows the system to handle continuously incoming data events. When processing each event, Flink can evaluate changes and execute computations only on the changed parts. This means that in incremental computing, Flink focuses on the latest incoming data rather than the entire dataset. Inspired by Flink’s incremental computing, we developed the incremental graph computing system Apache GeaFlow (Incubating) (also known as the stream graph computing engine), which effectively supports incremental graph iterative computation.
-How does GeaFlow implement incremental graph computing? First, real-time data is input into GeaFlow through connectors. GeaFlow generates internal node-edge structure data based on the real-time data and inserts this data into the underlying graph. Nodes involved in the real-time data within the current window are activated, triggering graph iterative computation.
+How does Apache GeaFlow (Incubating) implement incremental graph computing? First, real-time data is input into GeaFlow through connectors. GeaFlow generates internal node-edge structure data based on the real-time data and inserts this data into the underlying graph. Nodes involved in the real-time data within the current window are activated, triggering graph iterative computation.
Using the WCC algorithm as an example, for the connected components algorithm, in a time window, each edge’s src id and tar id vertices are activated. In the first iteration, their id information is sent to neighboring nodes. If a neighboring node receives the message and finds that it needs to update its information, it continues to notify its neighbors; otherwise, its iteration terminates.

-## 5. GeaFlow Architecture Overview
+## 5. Apache GeaFlow (Incubating) Architecture Overview
The GeaFlow engine consists of three main parts: DSL, Framework, and State. It also provides users with Stream API, Static Graph API, and Dynamic Graph API. The DSL layer is responsible for parsing and optimizing graph query languages like SQL+ISO/GQL, as well as schema inference. It also supports various Connectors such as Hive, Hudi, Kafka, and ODPS. The Framework layer handles runtime scheduling, fault tolerance, shuffle, and coordination of components. The State layer is responsible for storing underlying graph data and persistence, as well as performance optimizations like indexing and predicate pushdown.
@@ -218,6 +216,6 @@ The GeaFlow project is fully open-sourced. We have built some of the foundationa
## References
-1. GeaFlow Project:[https://github.com/TuGraph-family/tugraph-analytics](https://github.com/TuGraph-family/tugraph-analytics)
+1. Apache GeaFlow (Incubating) Project:[https://github.com/apache/geaflow](https://github.com/apache/geaflow)
2. soc-Livejournal Dataset:[https://snap.stanford.edu/data/soc-LiveJournal1.html](https://snap.stanford.edu/data/soc-LiveJournal1.html)
-3. GeaFlow Issues:[https://github.com/TuGraph-family/tugraph-analytics/issues](https://github.com/TuGraph-family/tugraph-analytics/issues)
+3. Apache GeaFlow (Incubating) Issues:[https://github.com/apache/geaflow/issues](https://github.com/apache/geaflow/issues)
diff --git a/blog/28.md b/blog/28.md
index b0ce2f55d..b1453f7db 100644
--- a/blog/28.md
+++ b/blog/28.md
@@ -3,23 +3,15 @@ title: Principles and Applications of Incremental Match in Streaming Graph Compu
date: 2025-6-3
---
-
-
## Problem Background
In streaming computing, data rarely arrives all at once but is continuously input and processed. Similarly, in graph computing/graph querying scenarios, vertices and edges are constantly read from data sources to construct graphs incrementally. In incremental graph queries, the graph evolves continuously, leading to different query results across graph versions. When new vertices/edges form an updated graph version, recomputing through the entire graph incurs high overhead and duplicates historical computations. Since historical data has already been processed, ideally only the delta-affected portions should be computed/queried without full-graph re-execution.
-GQL (Graph Query Language) is an international standard developed by ISO for graph query languages, used to execute queries on graphs. Geaflow is an open-source streaming graph engine by Ant Group’s graph computing team, specializing in dynamically changing graph data and supporting large-scale, high-concurrency real-time graph computing scenarios. This article introduces Geaflow’s approach to incremental GQL-based Match queries on dynamic graphs, aiming to execute queries solely on delta data while avoiding redundant full computations.
-
-
+GQL (Graph Query Language) is an international standard developed by ISO for graph query languages, used to execute queries on graphs. Apache GeaFlow (Incubating) is an open-source streaming graph engine, specializing in dynamically changing graph data and supporting large-scale, high-concurrency real-time graph computing scenarios. This article introduces GeaFlow’s approach to incremental GQL-based Match queries on dynamic graphs, aiming to execute queries solely on delta data while avoiding redundant full computations.
## Current Challenges
-The Geaflow engine adopts a vertex-centric framework, where each vertex sends messages iteratively. Vertices process received messages in subsequent iterations. For GQL queries, traversal starts from initial vertices for pattern matching (e.g., from node `A` to `B` to `C`). In dynamic graphs, if only newly added vertices/edges trigger computation, results may be incomplete, as illustrated below:
-
-
-

-
+The Apache GeaFlow (Incubating) engine adopts a vertex-centric framework, where each vertex sends messages iteratively. Vertices process received messages in subsequent iterations. For GQL queries, traversal starts from initial vertices for pattern matching (e.g., from node `A` to `B` to `C`). In dynamic graphs, if only newly added vertices/edges trigger computation, results may be incomplete, as illustrated below:
The key issue is that **Vertex A1 cannot trigger computation if only the delta is considered**, yet it belongs to the incremental results. To resolve this, we propose a subgraph expansion method from new vertices. The query is divided into two phases:
1. **Evolve Phase**: Propagate `EvolveMessage` from new vertices to neighbors, adding recipients to the `EvolveVertices` set.
@@ -72,9 +64,6 @@ public void compute(Object vertexId, Iterator messageIterator) {
}
```
-**Visualization:**
-
-
**Evolve Conditions:**
- Query iterations `>2` (no Evolve needed for ≤2 hops).
- Query iterations `≤ Threshold`.
@@ -82,7 +71,7 @@ public void compute(Object vertexId, Iterator messageIterator) {
- No starting vertex filter in GQL (e.g., `Match(a:person where a.id=1)` excludes Evolve).
## Demo
-In Geaflow, configure incremental graphs via `windowSize` for vertex/edge tables:
+In Apache GeaFlow (Incubating), configure incremental graphs via `windowSize` for vertex/edge tables:
```sql
CREATE GRAPH modern (
@@ -160,4 +149,4 @@ In this demo, vertex window size is 20, and edge window size is 3, meaning each
## Conclusion and Outlook
-In dynamic/streaming graph scenarios, graph nodes and edges change in real time. When querying such graphs, we can often trigger computation only on the incremental part using historical information, avoiding full graph traversal. Geaflow uses a subgraph expansion-based incremental match method, applied within a vertex-centric distributed graph computing framework, to support incremental querying in dynamic graph scenarios. In the future, we aim to implement more complex incremental matching logic for advanced use cases.
\ No newline at end of file
+In dynamic/streaming graph scenarios, graph nodes and edges change in real time. When querying such graphs, we can often trigger computation only on the incremental part using historical information, avoiding full graph traversal. Apache GeaFlow (Incubating) uses a subgraph expansion-based incremental match method, applied within a vertex-centric distributed graph computing framework, to support incremental querying in dynamic graph scenarios. In the future, we aim to implement more complex incremental matching logic for advanced use cases.
\ No newline at end of file
diff --git a/blog/29.md b/blog/29.md
index b537fceaa..120fb71d0 100644
--- a/blog/29.md
+++ b/blog/29.md
@@ -1,5 +1,5 @@
---
-title: "Exploring GeaFlow's Temporal Capabilities — Breathing New Life into Time-Series Data!"
+title: "Exploring Apache GeaFlow (Incubating)'s Temporal Capabilities — Breathing New Life into Time-Series Data!"
date: 2025-6-25
---
@@ -24,9 +24,9 @@ In today's digital era, data has become a core resource driving decisions and in
- **Lack of Flexibility**
Many tools support only one type of analysis and cannot concurrently process real-time streams and historical data.
- To address these issues, GeaFlow innovatively introduces temporal graph computing. As a distributed stream-graph engine designed for dynamic data, GeaFlow efficiently tackles challenges posed by evolving datasets. For dynamically changing graph structures, users can seamlessly perform operations like graph traversal, pattern matching, and computations—meeting complex analytical needs. By integrating temporal dimensions with dynamic graph processing, GeaFlow offers a groundbreaking solution for real-time analytics, empowering users to extract deeper value from dynamic data.
+ To address these issues, Apache GeaFlow (Incubating) innovatively introduces temporal graph computing. As a distributed stream-graph engine designed for dynamic data, GeaFlow efficiently tackles challenges posed by evolving datasets. For dynamically changing graph structures, users can seamlessly perform operations like graph traversal, pattern matching, and computations—meeting complex analytical needs. By integrating temporal dimensions with dynamic graph processing, GeaFlow offers a groundbreaking solution for real-time analytics, empowering users to extract deeper value from dynamic data.
-## What Is GeaFlow?
+## What Is Apache GeaFlow (Incubating)?
GeaFlow is a powerful distributed computing platform that combines graph computing and stream processing to handle dynamic graphs and temporal data efficiently. It supports complex graph algorithms and real-time analytics, making it ideal for dynamic scenarios. Key features include:
@@ -66,8 +66,8 @@ They complement each other:
- **Temporal Graphs Enhance Stream Analysis**
Timestamps enable complex operations like trend prediction and window-based analytics.
-### **4. GeaFlow’s Implementation**
-GeaFlow unifies stream and temporal graphs through:
+### **4. Apache GeaFlow (Incubating)’s Implementation**
+Apache GeaFlow (Incubating) unifies stream and temporal graphs through:
- **Timestamp Assignment**
Assigns *processing time* or *event time* to all data.
- **Dynamic Updates & Historical Retention**
@@ -176,10 +176,10 @@ a_id | e1_ts | b_id | e2_ts | c_id
- **Flexible**: SQL-like syntax lowers development barriers.
- **Scalable**: Handles massive dynamic graphs via incremental computation.
-## Core Highlights of GeaFlow’s Temporal Capabilities
+## Core Highlights of Apache GeaFlow (Incubating)’s Temporal Capabilities
### 1. Time-Aware Data Processing
-Timestamps enable precision. GeaFlow supports:
+Timestamps enable precision. Apache GeaFlow (Incubating) supports:
- **5-Minute Trend Analysis**: Track real-time interaction frequency shifts.
- **24-Hour Dynamic Patterns**: Identify long-term trends (e.g., user purchase behavior).
@@ -203,7 +203,7 @@ Optimized temporal algorithms:
Dynamic data holds immense value, and GeaFlow’s temporal capabilities unlock it. Whether you’re a novice or an expert, GeaFlow empowers you to harness time-series data.
-**Download GeaFlow today and explore the power of temporal analytics!**
+**Download Apache GeaFlow (Incubating) today and explore the power of temporal analytics!**
---
diff --git a/blog/30.md b/blog/30.md
index 5e6f58ee2..ab91ec2fb 100644
--- a/blog/30.md
+++ b/blog/30.md
@@ -16,10 +16,6 @@ date: 2025-5-15
-
-
-**Figure 1: Performance Difference Between SQL Join and GQL Graph Hop Queries**
-
### 2. Data Constraints
**Efficiency Constraint**: When association levels exceed 3 hops, the time complexity of traditional JOIN operations grows exponentially. Analytical models centered around multi-table JOINs gradually lose their advantage and become a "shackle" to efficiency.
@@ -28,9 +24,6 @@ date: 2025-5-15
**Innovation Constraint**: Business analysts often abandon graph technology stacks due to the need to learn GQL (Graph Query Language). The fragmented toolchain keeps graph analytics confined to technical departments, failing to empower front-line business teams.
-
-
-**Figure 2: JOIN vs GQL Expression Examples**
### 3. Breakthrough Strategy: Core Value of the Graph Data Warehouse
@@ -62,23 +55,18 @@ The Graph Data Warehouse Schema Converter automatically transforms the ER model
**Stage 3: Graph Assembly.**All vertices are merged, and edges bound to start nodes are naturally merged. Endpoint binding is optional. For two different graph conversion schemes, a difference vector can be calculated — representing how all tables map to entity changes.
-
-
-
-
-**Figure 3: ER to Graph Schema Conversion Example Series**
Through algorithmic analysis of inter-table associations and automatic graph construction, this provides a basis for migrating data from its original storage location to the graph data warehouse. It also significantly reduces manual data modeling and DSL scripting efforts, enabling fast migration of traditional warehouse data to a graph warehouse with no manual intervention and immediate analysis readiness.
### 2. Data Pipeline: Materialized Data Interaction Capabilities
-Similar to traditional data warehouses, the graph data warehouse leverages GeaFlow engine capabilities and TuMaker’s mature business platform to provide data task orchestration capabilities — organizing multiple data processing tasks (like data extraction, transformation, and loading) in a logical sequence and executing them automatically. Key features include visual interfaces, task scheduling, event triggers, error handling, monitoring and logging, version control and rollback, and intelligent cluster resource scheduling.
+Similar to traditional data warehouses, the graph data warehouse leverages Apache GeaFlow (Incubating) engine capabilities and TuMaker’s mature business platform to provide data task orchestration capabilities — organizing multiple data processing tasks (like data extraction, transformation, and loading) in a logical sequence and executing them automatically. Key features include visual interfaces, task scheduling, event triggers, error handling, monitoring and logging, version control and rollback, and intelligent cluster resource scheduling.
With the help of the Schema Converter, a materialization plan from table storage to graph storage can be generated, building a data pipeline between traditional and graph data warehouses. Based on the table-to-graph materialization plan, the system can automatically generate data sync task orchestrations according to actual business configurations like acceleration tables, relationships, fields, and permissions. These are then scheduled via the graph warehouse platform to achieve seamless data migration. Subsequent real-time updates and incremental syncs can be completed within ten minutes.
The data pipeline integrates deeply with mainstream big data ecosystems like ODPS/Hive/Paimon. It achieves full lifecycle data management through a three-tier architecture: at the data access layer, it automatically captures table changes, generates materialization plans, and syncs incremental mappings from tables to graph entities, currently managing graph data at the 10TB scale; at the conversion engine layer, it fully automates DSL task orchestration and schedules them to clusters; at the storage optimization layer, it supports proprietary and open-source graph storage solutions like CStore/GraphDB/RocksDB, validated in trillion-edge-scale super-large business graphs. Additionally, hot data preloading maintains second-level query response times even at the TB scale, truly enabling a full-stack transition from relational to graph warehouses, with SQL running on top of graphs.
-
+
**Figure 4: Open-Source Technical Architecture Overview**
@@ -90,9 +78,6 @@ Similar to traditional data warehouses, the graph data warehouse leverages GeaFl
Compared to traditional SQL queriesthat may take minutes to analyze user relationships through three table joins, graph path queries can complete the same task in seconds.This engine has been validated in typical business scenarios like short video analysis, membership growth, and customer rights services. In the future, it will expand to support complex subqueries and expression operations, allowing more developers to unlock graph computing power without crossing technical barriers.
-
-
-**Figure 5: SQL AST to GQL Structure Translation Difference Example**
## 3. Technical Advantages and Application Scenarios
@@ -114,7 +99,7 @@ Compared to traditional SQL queriesthat may
## 4. Future Outlook
-As a core carrier of next-generation data infrastructure, we plan to gradually open-source core capabilities like the graph storage engine, graph computing framework engine, and SQL-GQL translation module to build a developer-driven technical ecosystem. In 2023, we first open-sourced the streaming graph computing engine GeaFlow. In Q3 2025, we will release a standardized graph data analysis platform, high-performance graph computing engine, and support community developers in building connectors for heterogeneous data sources. This open collaboration accelerates technical iteration and positions the product as the best practice platform for the ISO/IEC 39075 GQL international standard, driving SQL-GQL hybrid queries to become an industry norm.
+As a core carrier of next-generation data infrastructure, we plan to gradually open-source core capabilities like the graph storage engine, graph computing framework engine, and SQL-GQL translation module to build a developer-driven technical ecosystem. In 2023, we first open-sourced the streaming graph computing engine Apache GeaFlow (Incubating). In Q3 2025, we will release a standardized graph data analysis platform, high-performance graph computing engine, and support community developers in building connectors for heterogeneous data sources. This open collaboration accelerates technical iteration and positions the product as the best practice platform for the ISO/IEC 39075 GQL international standard, driving SQL-GQL hybrid queries to become an industry norm.
On the technical evolution front, the next-generation engine will break through dynamic streaming graph computing bottlenecks to support trillion-edge incremental updates. By integrating vectorized computing engines, it can jointly query property graphs and vector graphs to meet AIGC-era multimodal analysis needs and enable revolutionary experiences like generating graph queries directly from natural language. Industry applications are rapidly expanding, and the graph data warehouse will soon become the core engine for most enterprise relational data analysis and intelligent decision-making.
diff --git a/blog/31.md b/blog/31.md
index c8ba7bf62..cbb2663c3 100644
--- a/blog/31.md
+++ b/blog/31.md
@@ -3,11 +3,9 @@ title: "Graph4Stream: Accelerating Stream Computing with Graph-Based Approaches"
date: 2025-3-25
---
-
-
> Author: Kunyu; Reviewer: Dongshuo.
-In a previous article ["Stream4Graph: Incremental Computation on Dynamic Graphs"](https://zhuanlan.zhihu.com/p/27618053733), we introduced how introducing incremental computation into graph computing—essentially combining "graphs + streams"—allowed GeaFlow to significantly outperform Spark GraphX in terms of performance. Now, the question arises: when we introduce graph computing capabilities into stream computing—combining "streams + graphs"—how does GeaFlow compare to Flink's associative computation performance?
+In a previous article ["Stream4Graph: Incremental Computation on Dynamic Graphs"](https://zhuanlan.zhihu.com/p/27618053733), we introduced how introducing incremental computation into graph computing—essentially combining "graphs + streams"—allowed Apache GeaFlow (Incubating) to significantly outperform Spark GraphX in terms of performance. Now, the question arises: when we introduce graph computing capabilities into stream computing—combining "streams + graphs"—how does GeaFlow compare to Flink's associative computation performance?
In today’s era, data is being generated at an unprecedented speed and scale, and real-time processing of massive datasets has wide applications in various fields such as anomaly detection, search recommendations, and financial transactions. As one of the core technologies for real-time data processing, **stream computing** has become increasingly important.
@@ -15,7 +13,7 @@ In today’s era, data is being generated at an unprecedented speed and scale, a
Unlike batch processing, which waits for all data to arrive before computation, stream computing partitions continuously generated data streams into micro-batches and performs incremental computations on each batch. This computational characteristic gives stream computing high throughput and low latency. Common stream computing engines include Flink and Spark Streaming, both of which process data using tabular representations. However, as stream computing applications deepen, more and more scenarios involve computing complex relationships among large datasets, leading to significant performance degradation in table-based stream engines.
-GeaFlow, an open-source stream graph computing engine developed by Ant Group's graph computing team, combines graph and stream computing to provide an efficient framework for stream graph processing, significantly improving computational performance. Below, we will introduce the limitations of traditional stream computing engines in relational computation, explain the principles behind GeaFlow's efficiency, and present performance comparisons.
+Apache GeaFlow (Incubating), an open-source stream graph computing engine, combines graph and stream computing to provide an efficient framework for stream graph processing, significantly improving computational performance. Below, we will introduce the limitations of traditional stream computing engines in relational computation, explain the principles behind GeaFlow's efficiency, and present performance comparisons.
## Stream Computing Engine: Flink
@@ -78,7 +76,7 @@ The main performance bottleneck lies in scanning RightStateView. LeftStateView a
Flink Join Operator Implementation
-## Stream Graph Computing Engine: GeaFlow
+## Stream Graph Computing Engine: Apache GeaFlow (Incubating)
### Graph Computing & Stream Graphs
@@ -90,17 +88,17 @@ Table Modeling vs. Graph Modeling
A **stream graph** is the application of graph computing to streaming scenarios. It divides the graph into historical and incremental components based on data stream updates. For example, if the first two rows have been processed and we are now handling the third row, the historical graph is built from the first two rows, and the incremental graph is formed by the third row. Together, they constitute the full graph. Applying incremental graph algorithms on stream graphs enables efficient, real-time computation.
-### GeaFlow Architecture
+### Apache GeaFlow (Incubating) Architecture
The GeaFlow engine’s computation flow consists of stream data input, distributed incremental graph computation, and incremental result output. Like traditional stream engines, real-time data is sliced into micro-batches by window. For each batch, the data is parsed into vertices and edges to form an incremental graph. This incremental graph and the historical graph (built from previous data) together form the complete stream graph. The computation framework applies incremental graph algorithms on the stream graph to yield incremental results, which are then output and added to the historical graph.

-GeaFlow Incremental Computation
+Apache GeaFlow (Incubating) Incremental Computation
The GeaFlow computation framework is a vertex-centric iterative model. It starts with vertices in the incremental graph. In each iteration, each vertex maintains its own state and performs computation based on its associated historical and incremental graph data. The result is then passed to neighboring vertices via message passing to trigger the next iteration.
-Taking k-Hop as an example, the incremental algorithm works as follows: In the first iteration, all edges in the incremental graph are identified and treated as initial incoming and outgoing paths, which are sent to their start and end vertices. In subsequent iterations, these paths are extended. Once the desired hop count is reached, the paths are sent back to the starting vertex, where they are combined into final results. Detailed implementation can be found in the open-source repository file [IncKHopAlgorithm.java](https://github.com/TuGraph-family/tugraph-analytics/blob/master/geaflow/geaflow-dsl/geaflow-dsl-plan/src/main/java/com/antgroup/geaflow/dsl/udf/graph/IncKHopAlgorithm.java).
+Taking k-Hop as an example, the incremental algorithm works as follows: In the first iteration, all edges in the incremental graph are identified and treated as initial incoming and outgoing paths, which are sent to their start and end vertices. In subsequent iterations, these paths are extended. Once the desired hop count is reached, the paths are sent back to the starting vertex, where they are combined into final results. Detailed implementation can be found in the open-source repository file [IncKHopAlgorithm.java](https://github.com/apache/geaflow/blob/master/geaflow/geaflow-dsl/geaflow-dsl-plan/src/main/java/com/antgroup/geaflow/dsl/udf/graph/IncKHopAlgorithm.java).
The diagram below illustrates the two-hop case. In the first iteration, the edge B->C creates incoming and outgoing paths, sent to B and C, respectively. In the second iteration, B receives an incoming path, adds its own incoming edges, and forms a 2-hop incoming path, which it sends to itself. Similarly, C forms a 2-hop outgoing path and sends it to B. In the final iteration, B combines the incoming and outgoing paths to produce the new paths. Unlike Flink, which must scan all historical relationships, GeaFlow's computation is proportional to the incremental paths, not the historical data.
@@ -169,7 +167,7 @@ RETURN ret
;
```
-## GeaFlow Performance Test
+## Apache GeaFlow (Incubating) Performance Test
To evaluate GeaFlow’s performance in stream graph computing, we designed a comparative experiment using the k-Hop algorithm. We used the public dataset [web-Google.txt](https://snap.stanford.edu/data/web-Google.html) as input and measured the time required to complete the computation across one-hop to four-hop scenarios. The experiment ran on 16 servers, each with 8 cores and 16GB memory.
@@ -181,19 +179,19 @@ k-Hop Computation Performance Comparison
## Conclusion and Future Work
-Traditional stream engines like Flink use join operators for relationship computation, which requires scanning all historical data, resulting in poor performance in large-scale associative scenarios. GeaFlow addresses this by introducing graph computing into stream processing through a stream graph framework, significantly boosting performance with incremental graph algorithms.
+Traditional stream engines like Flink use join operators for relationship computation, which requires scanning all historical data, resulting in poor performance in large-scale associative scenarios. Apache GeaFlow (Incubating) addresses this by introducing graph computing into stream processing through a stream graph framework, significantly boosting performance with incremental graph algorithms.
-GeaFlow is now open-source. We aim to build a unified lakehouse engine for graph data to support diverse associative analytics. We are also preparing to join the Apache Software Foundation to enrich the open-source big data ecosystem. If you're interested in graph technology, we welcome you to join the community.
+Apache GeaFlow (Incubating) is now open-source. We aim to build a unified lakehouse engine for graph data to support diverse associative analytics. We are also preparing to join the Apache Software Foundation to enrich the open-source big data ecosystem. If you're interested in graph technology, we welcome you to join the community.
There are many exciting tasks to explore. You can start with these beginner-friendly issues:
-- Support incremental k-Core algorithm ([Issue 466](https://github.com/TuGraph-family/tugraph-analytics/issues/466))
-- Support incremental Minimum Spanning Tree algorithm ([Issue 465](https://github.com/TuGraph-family/tugraph-analytics/issues/465))
+- Support incremental k-Core algorithm ([Issue 466](https://github.com/apache/geaflow/issues/466))
+- Support incremental Minimum Spanning Tree algorithm ([Issue 465](https://github.com/apache/geaflow/issues/465))
- ...
## References
-1. GeaFlow Project: [https://github.com/TuGraph-family/tugraph-analytics](https://github.com/TuGraph-family/tugraph-analytics)
+1. Apache GeaFlow (Incubating) Project: [https://github.com/apache/geaflow](https://github.com/apache/geaflow)
2. web-Google Dataset: [https://snap.stanford.edu/data/web-Google.html](https://snap.stanford.edu/data/web-Google.html)
-3. GeaFlow Issues: [https://github.com/TuGraph-family/tugraph-analytics/issues](https://github.com/TuGraph-family/tugraph-analytics/issues)
-4. Incremental k-Hop Source Code: [https://github.com/TuGraph-family/tugraph-analytics/blob/master/geaflow/geaflow-dsl/geaflow-dsl-plan/src/main/java/com/antgroup/geaflow/dsl/udf/graph/IncKHopAlgorithm.java](https://github.com/TuGraph-family/tugraph-analytics/blob/master/geaflow/geaflow-dsl/geaflow-dsl-plan/src/main/java/com/antgroup/geaflow/dsl/udf/graph/IncKHopAlgorithm.java)
+3. Apache GeaFlow (Incubating) Issues: [https://github.com/apache/geaflow/issues](https://github.com/apache/geaflow/issues)
+4. Incremental k-Hop Source Code: [https://github.com/apache/geaflow/blob/master/geaflow/geaflow-dsl/geaflow-dsl-plan/src/main/java/com/antgroup/geaflow/dsl/udf/graph/IncKHopAlgorithm.java](https://github.com/apache/geaflow/blob/master/geaflow/geaflow-dsl/geaflow-dsl-plan/src/main/java/com/antgroup/geaflow/dsl/udf/graph/IncKHopAlgorithm.java)
diff --git a/blog/32.md b/blog/32.md
index 9794c4eee..cb508495e 100644
--- a/blog/32.md
+++ b/blog/32.md
@@ -1,9 +1,9 @@
---
-title: "Streaming Graph Computing Engine GeaFlow v0.6.4 Released: Supports Relational Access to Graph Data, Incremental Matching Optimizes Real-Time Processing"
+title: "Streaming Graph Computing Engine Apache GeaFlow (Incubating) v0.6.4 Released: Supports Relational Access to Graph Data, Incremental Matching Optimizes Real-Time Processing"
date: April 3, 2025
---
-**March 2025** saw the release of streaming graph computing engine GeaFlow v0.6.4. This version implements multiple significant feature updates, including:
+**March 2025** saw the release of streaming graph computing engine Apache GeaFlow (Incubating) v0.6.4. This version implements multiple significant feature updates, including:
- 🍀 Experimental support for storing GeaFlow graph data in Paimon data lake
- 🍀 Enhanced graph data warehouse capabilities: Supports relational access to graph entities
@@ -15,7 +15,7 @@ date: April 3, 2025
## ✨ New Features
-### 🍀 GeaFlow Graph Storage Extended to Support Paimon Data Lake (Experimental)
+### 🍀 Apache GeaFlow (Incubating) Graph Storage Extended to Support Paimon Data Lake (Experimental)
To enhance GeaFlow's data storage system scalability, real-time processing capabilities, and cost efficiency, this update adds support for **Apache Paimon**. As a next-generation streaming data lake storage format, Paimon differs significantly in design philosophy and features from RocksDB, previously used by GeaFlow:
@@ -28,12 +28,12 @@ In this update, GeaFlow adds support for Paimon storage (currently **experimenta
- **Current limitations**: Only supports local filesystem as Paimon backend; recoverability not yet supported; dynamic graph data storage not yet supported.
- Configure the storage path via the parameter `geaflow.store.paimon.options.warehouse` (default: `"file:///tmp/paimon/"`).
-The current GeaFlow storage architecture is shown below:
+The current Apache GeaFlow (Incubating) storage architecture is shown below:

### 🍀 Graph Data Warehouse Capability Expansion: Supports Relational Access to Graph Entities
-In traditional relational databases, multi-table JOIN queries often require complex SQL statements, hindering development efficiency and struggling with performance for ad-hoc analysis of massive interconnected data. Addressing this pain point, GeaFlow introduces innovative SQL support that automatically translates complex SQL JOIN statements into graph path queries—**no Graph Query Language (GQL) needed**. This version offers two SQL syntax features:
+In traditional relational databases, multi-table JOIN queries often require complex SQL statements, hindering development efficiency and struggling with performance for ad-hoc analysis of massive interconnected data. Addressing this pain point, Apache GeaFlow (Incubating) introduces innovative SQL support that automatically translates complex SQL JOIN statements into graph path queries—**no Graph Query Language (GQL) needed**. This version offers two SQL syntax features:
1. **Querying Vertices/Edges as Source Tables:**
- The `TableScanToGraphRule` identifies vertices/edges within SQL statements, enabling users to query graph entities like standard SQL table scans.
@@ -58,7 +58,7 @@ In traditional relational databases, multi-table JOIN queries often require comp
### 🍀 Unified Memory Manager Support
-Previously, GeaFlow lacked centralized memory management. Apart from RocksDB using off-heap memory, all memory was on-heap, leading to significant GC pressure under heavy loads. Network shuffling also involved multiple data copies, reducing efficiency.
+Previously, Apache GeaFlow (Incubating) lacked centralized memory management. Apart from RocksDB using off-heap memory, all memory was on-heap, leading to significant GC pressure under heavy loads. Network shuffling also involved multiple data copies, reducing efficiency.
The new **Unified Memory Manager** governs memory allocation, release, and monitoring across modules (shuffle, state, framework) for both on-heap and off-heap memory. Key capabilities include:
- **Unified On-heap/Off-heap Management:** Abstracts memory access via `MemoryView`, shielding users from the underlying type. Off-heap chunks are pre-allocated (default chunk size: 30% of `-Xmx`, configurable via `off.heap.memory.chunkSize.MB`) and support dynamic resizing.
diff --git a/i18n/en-US/code.json b/i18n/en-US/code.json
index 7760b2e1b..fcff1bf72 100644
--- a/i18n/en-US/code.json
+++ b/i18n/en-US/code.json
@@ -12,7 +12,7 @@
"message": "OVERVIEW"
},
"product.intro.desc": {
- "message": "Apache GeaFlow (Incubating) is a distributed unified stream-batch graph computing product that supports core capabilities including mixed table-graph processing, real-time graph computing, and interactive graph analysis. It provides high availability and one-stop cloud-native development and deployment capabilities. Based on Ant Group's self-developed trillion-scale graph computing practices, GeaFlow is currently widely applied in scenarios such as data warehouse acceleration, financial risk control, knowledge graphs, and social networks. GeaFlow has now become an incubator project of the Apache Software Foundation (ASF)."
+ "message": "Apache GeaFlow (Incubating) is a distributed unified stream-batch graph computing product that supports core capabilities including mixed table-graph processing, real-time graph computing, and interactive graph analysis. It provides high availability and one-stop cloud-native development and deployment capabilities. Based on self-developed trillion-scale graph computing practices, GeaFlow is currently widely applied in scenarios such as data warehouse acceleration, financial risk control, knowledge graphs, and social networks. GeaFlow has now become an incubator project of the Apache Software Foundation (ASF)."
},
"product.repo": {
"message": "TuGraph Family"
diff --git a/i18n/zh-CN/docusaurus-plugin-content-blog/27.md b/i18n/zh-CN/docusaurus-plugin-content-blog/27.md
index 8b0d9b919..8db9e51a4 100644
--- a/i18n/zh-CN/docusaurus-plugin-content-blog/27.md
+++ b/i18n/zh-CN/docusaurus-plugin-content-blog/27.md
@@ -3,15 +3,13 @@ title: "Stream4Graph:动态图上的增量计算"
date: "2025-3-11"
---
-
-
> 作者:张奇
众所周知,当我们需要对数据做关联性分析的时候,一般会采用表连接(SQL join)的方式完成。但是 SQL join 时的笛卡尔积计算需要维护大量的中间结果,从而对整体的数据分析性能带来巨大影响。相比而言,基于图的方式维护数据的关联性,原本的关联性分析可以转换为图上的遍历操作,从而大幅降低数据分析的成本。
然而,随着数据规模的不断增长,以及对数据处理更强的实时性需求,如何高效地解决大规模图数据上的实时计算问题,就变得越来越紧迫。传统的计算引擎,如 Spark、Flink 对于图数据的处理已经逐渐不能满足业务日益增长的诉求,因此设计一套面向大规模图数据的实时处理引擎,将会对大数据处理技术革新带来巨大的帮助。
-蚂蚁图计算团队开源的流图计算引擎[GeaFlow](https://github.com/TuGraph-family/tugraph-analytics),结合了图处理和流处理的技术优势,实现了动态图上的增量计算能力,在高性能关联性分析的基础上,进一步提升了图计算的实时性。接下来向大家介绍图计算技术的特点,业内如何解决大规模实时图计算问题,以及 GeaFlow 在动态图上的计算性能表现。
+开源的流图计算引擎[Apache GeaFlow (Incubating)](https://github.com/apache/geaflow),结合了图处理和流处理的技术优势,实现了动态图上的增量计算能力,在高性能关联性分析的基础上,进一步提升了图计算的实时性。接下来向大家介绍图计算技术的特点,业内如何解决大规模实时图计算问题,以及 Apache GeaFlow (Incubating) 在动态图上的计算性能表现。
@@ -96,25 +94,25 @@ date: "2025-3-11"
-## 4. 动态图增量计算:GeaFlow
+## 4. 动态图增量计算:Apache GeaFlow (Incubating)
-我们知道在传统的流计算引擎中,如 Flink,其处理模型允许系统能够处理不断流入的数据事件。处理每个事件时,Flink 可以评估变化并仅针对变化的部分执行计算。这意味着在增量计算过程中,Flink 会关注最新到达的数据,而不是整个数据集。于是受到 Flink 增量计算的启发,我们自研了增量图计算系统 GeaFlow(也叫流图计算引擎),能够很好的支持增量图迭代计算。
+我们知道在传统的流计算引擎中,如 Flink,其处理模型允许系统能够处理不断流入的数据事件。处理每个事件时,Flink 可以评估变化并仅针对变化的部分执行计算。这意味着在增量计算过程中,Flink 会关注最新到达的数据,而不是整个数据集。于是受到 Flink 增量计算的启发,我们自研了增量图计算系统 Apache GeaFlow (Incubating)(也叫流图计算引擎),能够很好的支持增量图迭代计算。
-那么 GeaFlow 是如何实现增量图计算的呢?首先,实时数据通过 connector 消息源输入的 GeaFlow 中,GeaFlow 依据实时数据,生成内部的点边结构数据,并且将点边数据插入进底图中。当前窗口的实时数据涉及到的点会被激活,触发图迭代计算。
+那么 Apache GeaFlow (Incubating) 是如何实现增量图计算的呢?首先,实时数据通过 connector 消息源输入的 GeaFlow 中,GeaFlow 依据实时数据,生成内部的点边结构数据,并且将点边数据插入进底图中。当前窗口的实时数据涉及到的点会被激活,触发图迭代计算。
这里以 WCC 算法为例,对联通分量算法而言,在一个时间窗口内每条边对应的 src id 和 tar id 对应的顶点会被激活,第一次迭代需要将其 id 信息通知其邻居节点。如果邻居节点收到消息后,发现需要更新自己的信息,那么它需要继续将更新消息通知给它的邻居节点;如果说邻居节点不需要更新自己的信息,那么它就不需要通知其邻居节点,它对应的迭代终止。

-## 5. GeaFlow 架构简析
+## 5. Apache GeaFlow (Incubating) 架构简析
-GeaFlow 引擎主要由三大主要部分组成,DSL、Framework 和 State,同时向上为用户提供了 Stream API、静态图 API 和动态图 API。DSL 主要负责图查询语言 SQL+ISO/GQL 的解析和执行计划的优化,同时负责 schema 的推导,也向外部承接了多种 Connector,比如 hive、hudi、kafka、odps 等。Framework 层负责运行时的调度和容灾,shuffle 以及框架内各个组件的管理协调。State 层负责存储底层图数据和数据的持久化,同时也负责索引、下推等众多性能优化工作。
+Apache GeaFlow (Incubating) 引擎主要由三大主要部分组成,DSL、Framework 和 State,同时向上为用户提供了 Stream API、静态图 API 和动态图 API。DSL 主要负责图查询语言 SQL+ISO/GQL 的解析和执行计划的优化,同时负责 schema 的推导,也向外部承接了多种 Connector,比如 hive、hudi、kafka、odps 等。Framework 层负责运行时的调度和容灾,shuffle 以及框架内各个组件的管理协调。State 层负责存储底层图数据和数据的持久化,同时也负责索引、下推等众多性能优化工作。

-## 6. GeaFlow 性能测试
+## 6. Apache GeaFlow (Incubating) 性能测试
为了验证 GeaFlow 的增量图计算性能,我们设计了这样的实验。一批数据按照固定时间窗口实时输入到计算引擎中,我们分别用 Spark 和 GeaFlow 对全图做联通分量算法计算,比较两者计算耗时。实验在 3 台 24 核内存 128G 的机器上开展,使用的数据集是公开数据集[soc-Livejournal](https://snap.stanford.edu/data/soc-LiveJournal1.html),测试的图算法是弱联通分量算法。我们以 50w 条数据作为一个计算窗口,每输入到引擎中 50w 条数据,就触发一次图计算。
@@ -214,17 +212,17 @@ RETURN vid, component
2. GeaFlow 通过增量计算避免了全量数据的重复处理,计算效率更高,计算时间更短性能不明显下降。
3. GeaFlow 支持 SQL+GQL 混合处理语言,更适合开发复杂的图数据处理任务。
-GeaFlow 项目代码已全部开源,我们完成了部分流图引擎基础能力的构建,未来希望基于 GeaFlow 构建面向图数据的统一湖仓处理引擎,以解决多样化的大数据关联性分析诉求。同时我们也在积极筹备加入 Apache 基金会,丰富大数据开源生态,因此非常欢迎对图技术有浓厚兴趣同学加入社区共建。
+Apache GeaFlow (Incubating) 项目代码已全部开源,我们完成了部分流图引擎基础能力的构建,未来希望基于 GeaFlow 构建面向图数据的统一湖仓处理引擎,以解决多样化的大数据关联性分析诉求。同时我们也在积极筹备加入 Apache 基金会,丰富大数据开源生态,因此非常欢迎对图技术有浓厚兴趣同学加入社区共建。
社区中有诸多有趣的工作尚待完成,你可以从如下简单的「Good First Issue」开始,期待你加入同行。
-- 支持 Paimon Connector 插件,连接数据湖生态。([Issue 361](https://github.com/TuGraph-family/tugraph-analytics/issues/361))
-- 优化 GQL match 语句性能。([Issue 363](https://github.com/TuGraph-family/tugraph-analytics/issues/363))
-- 新增 ISO/GQL 语法,支持 same 谓词。([Issue 368](https://github.com/TuGraph-family/tugraph-analytics/issues/368))
+- 支持 Paimon Connector 插件,连接数据湖生态。([Issue 361](https://github.com/apache/geaflow/issues/361))
+- 优化 GQL match 语句性能。([Issue 363](https://github.com/apache/geaflow/issues/363))
+- 新增 ISO/GQL 语法,支持 same 谓词。([Issue 368](https://github.com/apache/geaflow/issues/368))
- ...
## 参考链接
-1. GeaFlow 项目地址:[https://github.com/TuGraph-family/tugraph-analytics](https://github.com/TuGraph-family/tugraph-analytics)
+1. Apache GeaFlow (Incubating) 项目地址:[https://github.com/apache/geaflow](https://github.com/apache/geaflow)
2. soc-Livejournal 数据集地址:[https://snap.stanford.edu/data/soc-LiveJournal1.html](https://snap.stanford.edu/data/soc-LiveJournal1.html)
-3. GeaFlow Issues:[https://github.com/TuGraph-family/tugraph-analytics/issues](https://github.com/TuGraph-family/tugraph-analytics/issues)
+3. Apache GeaFlow (Incubating) Issues:[https://github.com/apache/geaflow/issues](https://github.com/apache/geaflow/issues)
diff --git a/i18n/zh-CN/docusaurus-plugin-content-blog/28.md b/i18n/zh-CN/docusaurus-plugin-content-blog/28.md
index 86401fb1c..47cf295e7 100644
--- a/i18n/zh-CN/docusaurus-plugin-content-blog/28.md
+++ b/i18n/zh-CN/docusaurus-plugin-content-blog/28.md
@@ -3,21 +3,19 @@ title: 流图计算之增量match原理与应用
date: 2025-6-3
---
-
-
## 问题背景
在流式计算中,数据往往不是全部一批到来,而会源源不断地进行输入和计算,在图计算/图查询领域,也存在类似的场景,图的点边不断地从数据源读取,进行构图,从而形成增量图。在增量图查询中,图随时发生着变化,在不同的图版本中,进行图查询的结果也会有所不同。对于某一次新增的点边,构成了一个新的版本的图,如果重新对全图(即当前所有点边)进行图遍历,开销较大,并且也会和历史数据有重复。由于历史的数据已经计算过一遍,理想情况下,只需要对增量所影响的部分进行计算/查询,而不需要对全图重新进行查询。
-GQL(Graph Query Language)是国际标准化组织(ISO)为标准化图查询语言所制定的一个标准,用于在图上执行查询的语言。Geaflow 是蚂蚁图计算团队开源的流图计算引擎,专注于处理动态变化的图数据,支持大规模、高并发的实时图计算场景。本文将介绍在 Geaflow 引擎中,对增量图使用 GQL 进行增量 Match 的方法,目的尽可能地只对增量的数据进行查询,避免冗余的全量计算。
+GQL(Graph Query Language)是国际标准化组织(ISO)为标准化图查询语言所制定的一个标准,用于在图上执行查询的语言。Apache GeaFlow (Incubating) 是开源的流图计算引擎,专注于处理动态变化的图数据,支持大规模、高并发的实时图计算场景。本文将介绍在 Apache GeaFlow (Incubating) 引擎中,对增量图使用 GQL 进行增量 Match 的方法,目的尽可能地只对增量的数据进行查询,避免冗余的全量计算。

## 当前问题
-Geaflow 引擎基于点中心框架(vertex center),通过迭代的方式,每一轮迭代中,每个点向其他点发送消息,并在下一轮收到消息时进行处理、分析。在 Geaflow 的框架中,GQL 的查询需要从前往后进行 Traversal 遍历走图,即从起始节点开始出发,进行扩散,依次进行点边匹配,直到匹配到所需要的查询 pattern。在动态图里场景,如果只使用当前批次新增的点边触发计算,增量的结果会有缺失,例如下面例子所示。
+Apache GeaFlow (Incubating) 引擎基于点中心框架(vertex center),通过迭代的方式,每一轮迭代中,每个点向其他点发送消息,并在下一轮收到消息时进行处理、分析。在 Geaflow 的框架中,GQL 的查询需要从前往后进行 Traversal 遍历走图,即从起始节点开始出发,进行扩散,依次进行点边匹配,直到匹配到所需要的查询 pattern。在动态图里场景,如果只使用当前批次新增的点边触发计算,增量的结果会有缺失,例如下面例子所示。

@@ -85,7 +83,7 @@ public void compute(Object vertexId, Iterator messageIterator) {
## Demo 示例
-在 Geaflow 中,通过设置点表或边表的 windowSize 来默认实现增量逻辑,即每一批读入 windowSize 大小的点边数据,来构建增量图。
+在 Apache GeaFlow (Incubating) 中,通过设置点表或边表的 windowSize 来默认实现增量逻辑,即每一批读入 windowSize 大小的点边数据,来构建增量图。
```sql
CREATE GRAPH modern (
@@ -163,4 +161,4 @@ INSERT INTO tbl_result
## 总结和展望
-在动态图/流图的场景中,图的点边是在实时变化的,在进行图查询时,对于不同窗口数据的图,我们往往可以根据一些历史信息,只对增量的部分触发计算,来进行增量地计算,避免触发全图的遍历。Geaflow 使用了一种基于子图扩展的增量 match 方法,应用于点中心分布式图计算框架,在动态图场景下进行增量的查询,未来期望实现更多更复杂场景下的增量匹配逻辑。
+在动态图/流图的场景中,图的点边是在实时变化的,在进行图查询时,对于不同窗口数据的图,我们往往可以根据一些历史信息,只对增量的部分触发计算,来进行增量地计算,避免触发全图的遍历。Apache GeaFlow (Incubating) 使用了一种基于子图扩展的增量 match 方法,应用于点中心分布式图计算框架,在动态图场景下进行增量的查询,未来期望实现更多更复杂场景下的增量匹配逻辑。
diff --git a/i18n/zh-CN/docusaurus-plugin-content-blog/29.md b/i18n/zh-CN/docusaurus-plugin-content-blog/29.md
index aeb5ed770..ba44ce78a 100644
--- a/i18n/zh-CN/docusaurus-plugin-content-blog/29.md
+++ b/i18n/zh-CN/docusaurus-plugin-content-blog/29.md
@@ -1,5 +1,5 @@
---
-title: GeaFlow 时序能力探秘——让时间数据焕发新生!
+title: Apache GeaFlow (Incubating) 时序能力探秘——让时间数据焕发新生!
date: 2025-6-25
---
@@ -28,15 +28,15 @@ date: 2025-6-25
很多工具只支持单一类型的数据分析,无法同时处理实时流数据和历史数据。
- 为了解决上述问题,GeaFlow 创新性地提出了时序图计算的概念。作为一款专为动态图数据处理设计的分布式流图计算引擎,GeaFlow 能够高效应对动态数据带来的挑战。针对实时变化的图结构,用户可以轻松进行图遍历、图匹配和图计算等操作,从而满足复杂场景下的分析需求。通过结合时间维度与动态图处理能力,GeaFlow 为实时数据分析提供了全新的解决方案,帮助用户更精准地挖掘动态数据中的价值。
+ 为了解决上述问题,Apache GeaFlow (Incubating) 创新性地提出了时序图计算的概念。作为一款专为动态图数据处理设计的分布式流图计算引擎,GeaFlow 能够高效应对动态数据带来的挑战。针对实时变化的图结构,用户可以轻松进行图遍历、图匹配和图计算等操作,从而满足复杂场景下的分析需求。通过结合时间维度与动态图处理能力,GeaFlow 为实时数据分析提供了全新的解决方案,帮助用户更精准地挖掘动态数据中的价值。
-## 什么是 GeaFlow?
+## 什么是 Apache GeaFlow (Incubating)?
-GeaFlow 是一个强大的分布式计算平台,结合了图计算和流处理的优势,能够高效处理动态图和时序数据。它不仅支持复杂的图算法,还具备实时分析能力,适用于各种动态场景。其主要特点包括:
+Apache GeaFlow (Incubating) 是一个强大的分布式计算平台,结合了图计算和流处理的优势,能够高效处理动态图和时序数据。它不仅支持复杂的图算法,还具备实时分析能力,适用于各种动态场景。其主要特点包括:
- 分布式架构
-GeaFlow 基于分布式计算框架,能够高效处理超大规模的动态图数据(例如数十亿节点和边)。通过分区和副本机制,GeaFlow 确保了系统的高可用性和可扩展性。
+Apache GeaFlow (Incubating) 基于分布式计算框架,能够高效处理超大规模的动态图数据(例如数十亿节点和边)。通过分区和副本机制,GeaFlow 确保了系统的高可用性和可扩展性。
- 流图与时序图的无缝集成
@@ -92,7 +92,7 @@ date: 2025-6-25
通过引入时间戳,时序图使得流图能够进行更复杂的分析,例如时间窗口分析、趋势预测等。
-### **4. GeaFlow 的实现细节**
+### **4. Apache GeaFlow (Incubating) 的实现细节**
GeaFlow 通过以下技术手段实现了流图与时序图的无缝结合:
@@ -432,26 +432,26 @@ a_id | e1_ts | b_id | e2_ts | c_id
### **10. 技术优势**
-- **实时性**:GeaFlow 支持毫秒级的数据流处理,确保用户关系图始终是最新的。
+- **实时性**:Apache GeaFlow (Incubating) 支持毫秒级的数据流处理,确保用户关系图始终是最新的。
- **时间敏感性:**通过时间戳字段,精确管理好友关系的时间顺序。
- **灵活性:**SQL 驱动的开发模式,降低了开发门槛,提升了开发效率。
- **可拓展性:**支持大规模动态图的增量计算,能够轻松应对社交平台的海量用户数据。
-## GeaFlow 时序能力的核心亮点
+## Apache GeaFlow (Incubating) 时序能力的核心亮点
### **1. 时间感知的数据处理**
-每条数据都带有时间戳,能够精确记录事件发生的时间。GeaFlow 支持基于时间窗口的分析,例如:
+每条数据都带有时间戳,能够精确记录事件发生的时间。Apache GeaFlow (Incubating) 支持基于时间窗口的分析,例如:
- **最近 5 分钟的趋势变化**
用户可以通过设置时间窗口,分析最近 5 分钟内的数据变化趋势。例如,在社交网络中,分析用户互动的频率变化。****
- **过去一天的动态模式**
- GeaFlow 支持长时间跨度的分析,帮助用户发现长期趋势。例如,在电商推荐系统中,分析用户在过去一天内的购买行为。
+ Apache GeaFlow (Incubating) 支持长时间跨度的分析,帮助用户发现长期趋势。例如,在电商推荐系统中,分析用户在过去一天内的购买行为。
### **2. 动态图与时序结合**
-GeaFlow 将图结构与时间维度结合,能够捕捉图中关系的演变。例如:
+Apache GeaFlow (Incubating) 将图结构与时间维度结合,能够捕捉图中关系的演变。例如:
- **社交网络中好友关系的变化**
@@ -460,7 +460,7 @@ a_id | e1_ts | b_id | e2_ts | c_id
### **3. 实时与历史数据的无缝融合**
-GeaFlow 不仅支持实时流数据的处理,还能结合历史数据进行对比分析。这种能力特别适合需要长期趋势分析和短期实时监控的场景。例如:
+Apache GeaFlow (Incubating) 不仅支持实时流数据的处理,还能结合历史数据进行对比分析。这种能力特别适合需要长期趋势分析和短期实时监控的场景。例如:
- **物联网设备监控**
@@ -469,7 +469,7 @@ a_id | e1_ts | b_id | e2_ts | c_id
### **4. 丰富的内置算法**
-GeaFlow 提供针对时序数据优化的算法,例如:
+Apache GeaFlow (Incubating) 提供针对时序数据优化的算法,例如:
- 最短路径
- 弱联通分量
@@ -485,4 +485,4 @@ a_id | e1_ts | b_id | e2_ts | c_id
## 术语****
-**DSL: **Domain-Specific Language。融合 DSL 是 GeaFlow 提供的图表一体的数据分析语言,支持标准 SQL+ISO/GQL 进行图表分析.通过融合 DSL 可以对表数据做关系运算处理,也可以对图数据做图匹配和图算法计算,同时也支持同时图表数据的联合处理。
+**DSL: **Domain-Specific Language。融合 DSL 是 Apache GeaFlow (Incubating) 提供的图表一体的数据分析语言,支持标准 SQL+ISO/GQL 进行图表分析.通过融合 DSL 可以对表数据做关系运算处理,也可以对图数据做图匹配和图算法计算,同时也支持同时图表数据的联合处理。
diff --git a/i18n/zh-CN/docusaurus-plugin-content-blog/30.md b/i18n/zh-CN/docusaurus-plugin-content-blog/30.md
index 874a4b2a9..5675486d6 100644
--- a/i18n/zh-CN/docusaurus-plugin-content-blog/30.md
+++ b/i18n/zh-CN/docusaurus-plugin-content-blog/30.md
@@ -72,7 +72,7 @@ date: 2025-5-15
### 2. 数据通道:物化数据交互能力
-类似于传统数据仓库,图数仓基于 GeaFlow 引擎能力与 TuMaker 成熟的业务平台提供数据任务编排能力,即将多个数据处理任务(如数据抽取、转换、加载等)按照一定的逻辑顺序组织起来,自动执行的过程。提供可视化界面、任务调度机制、监听事件触发、错误处理、监控与日志、版本控制与回滚、智能调度集群资源等关键能力。
+类似于传统数据仓库,图数仓基于 Apache GeaFlow (Incubating) 引擎能力与 TuMaker 成熟的业务平台提供数据任务编排能力,即将多个数据处理任务(如数据抽取、转换、加载等)按照一定的逻辑顺序组织起来,自动执行的过程。提供可视化界面、任务调度机制、监听事件触发、错误处理、监控与日志、版本控制与回滚、智能调度集群资源等关键能力。
在 Schema 转换器的加持下,可以得到从表存储到图存储的物化方案,它构建了连接传统数仓与图数仓的数据通道。基于表转图的物化方案,可以根据业务实际配置的加速表、加速关系、字段、权限等信息,全自动生成数据同步的任务编排,再通过图数仓平台调度,实现数据迁移全程无感,后续实时更新与增量同步,同步效率可达延迟十分钟级别。
diff --git a/i18n/zh-CN/docusaurus-plugin-content-blog/31.md b/i18n/zh-CN/docusaurus-plugin-content-blog/31.md
index 2b7980b9e..f5d0ceb4b 100644
--- a/i18n/zh-CN/docusaurus-plugin-content-blog/31.md
+++ b/i18n/zh-CN/docusaurus-plugin-content-blog/31.md
@@ -3,11 +3,9 @@ title: Graph4Stream:基于图的流计算加速
date: 2025-3-25
---
-
-
> 作者:坤羽;审校:东朔。
-之前在「姊妹篇」[《Stream4Graph:动态图上的增量计算》](https://zhuanlan.zhihu.com/p/27618053733)中,向大家介绍了在图计算技术中引入增量计算能力「图+流」,GeaFlow 流图计算相比 Spark GraphX 取得了显著的性能提升。那么在流计算技术中引入图计算能力「流+图」,GeaFlow 流图计算相比 Flink 关联计算性能如何呢?
+之前在「姊妹篇」[《Stream4Graph:动态图上的增量计算》](https://zhuanlan.zhihu.com/p/27618053733)中,向大家介绍了在图计算技术中引入增量计算能力「图+流」,Apache GeaFlow (Incubating) 流图计算相比 Spark GraphX 取得了显著的性能提升。那么在流计算技术中引入图计算能力「流+图」,GeaFlow 流图计算相比 Flink 关联计算性能如何呢?
当今时代,数据正以前所未有的速度和规模产生,对海量数据进行实时处理在异常检测、搜索推荐、金融交易等各个领域都有着广泛的应用。流计算作为最主要的实时数据处理技术也变得越来越重要。
@@ -19,7 +17,7 @@ date: 2025-3-25
-蚂蚁图计算团队开源的流图计算引擎GeaFlow,将图计算与流计算相结合,提供了高效的流图处理框架,大幅提升了计算性能。下面为大家介绍传统流计算引擎在关联关系计算的局限性,GeaFlow 流图计算高效的原理以及他们的性能对比。
+开源的流图计算引擎Apache GeaFlow (Incubating),将图计算与流计算相结合,提供了高效的流图处理框架,大幅提升了计算性能。下面为大家介绍传统流计算引擎在关联关系计算的局限性,GeaFlow 流图计算高效的原理以及他们的性能对比。
@@ -92,7 +90,7 @@ ON `e`.`dst` = `v`.`vid`;
Flink Join 算子实现
-## 流图计算引擎:GeaFlow
+## 流图计算引擎:Apache GeaFlow (Incubating)
### 图计算&流图
@@ -106,9 +104,9 @@ Flink Join 算子实现
-### GeaFlow 架构
+### Apache GeaFlow (Incubating) 架构
-GeaFlow 引擎的计算流程分为流数据输入、分布式增量图计算、增量结果输出几个部分。和传统的流计算引擎一样,输入的实时数据按照窗口被切分成微批。对于当前批次的数据,先按照建模策略解析成点边构成增量图。增量图和之前数据构成的历史图一道组成完整的流图。计算框架在流图上应用增量图算法得到增量结果输出,最后把增量图添加到历史图中。
+Apache GeaFlow (Incubating) 引擎的计算流程分为流数据输入、分布式增量图计算、增量结果输出几个部分。和传统的流计算引擎一样,输入的实时数据按照窗口被切分成微批。对于当前批次的数据,先按照建模策略解析成点边构成增量图。增量图和之前数据构成的历史图一道组成完整的流图。计算框架在流图上应用增量图算法得到增量结果输出,最后把增量图添加到历史图中。

@@ -116,7 +114,7 @@ GeaFlow 引擎的计算流程分为流数据输入、分布式增量图计算、
GeaFlow 计算框架是以点为中心的迭代计算模型。他以增量图中的点作为第一轮迭代的起点。在每一轮迭代中,每个点都独立维护自身的状态,根据与每个点关联的历史图和增量图完成当前迭代轮次的计算,最后将计算结果通过消息传递给邻居点,开启下一轮迭代。
-以前文中提到的 k-Hop 为例,增量算法如下:在第一轮迭代中,我们找到增量图中的所有边,将这些边作为初始的入向路径和出向路径,分别发送到他们的起点和终点。在后续的迭代中不断扩展入向路径和出向路径。当达到求取跳数时,将出向路径和入向路径发送给起点,在起点组合成最终结果。详细代码实现在开源仓库的[IncKHopAlgorithm.java](https://github.com/TuGraph-family/tugraph-analytics/blob/master/geaflow/geaflow-dsl/geaflow-dsl-plan/src/main/java/com/antgroup/geaflow/dsl/udf/graph/IncKHopAlgorithm.java)文件中。
+以前文中提到的 k-Hop 为例,增量算法如下:在第一轮迭代中,我们找到增量图中的所有边,将这些边作为初始的入向路径和出向路径,分别发送到他们的起点和终点。在后续的迭代中不断扩展入向路径和出向路径。当达到求取跳数时,将出向路径和入向路径发送给起点,在起点组合成最终结果。详细代码实现在开源仓库的[IncKHopAlgorithm.java](https://github.com/apache/geaflow/blob/master/geaflow/geaflow-dsl/geaflow-dsl-plan/src/main/java/com/antgroup/geaflow/dsl/udf/graph/IncKHopAlgorithm.java)文件中。
下图是两跳场景的描述。在第一轮迭代,增量边 B->C 分别构建入向路径和出向路径,将他们分别发送给点 B 和点 C。在第二轮迭代,B 收到入向路径,并加上当前点的入边形成 2 跳入向路径,发送给点 B。同样点 C 也收到出向路径,加上当前的出边形成 2 跳出向路径,发送给点 B。最后一轮迭代在 B 点将收到的出向和入向路径整合成新增的路径。可以看到,和 Flink 中需要查找所有的历史关系不同,GeaFlow 采用基于流图的增量图算法,计算量和图中的增量路径成正比。
@@ -187,9 +185,9 @@ RETURN ret
-## GeaFlow 性能测试
+## Apache GeaFlow (Incubating) 性能测试
-为了验证 GeaFlow 的流图计算性能,我们以k-Hop算法为例设计了和 Flink 的对比实验。我们将指定数据作为输入源输入到计算引擎中,执行k-Hop算法,并统计所有数据完成计算的时间来比较系统的性能。我们采用公开数据集[web-Google.txt](https://snap.stanford.edu/data/web-Google.html)作为输入,实验环境为 16 台 8 核 16G 的服务器,分别比较了一跳、两跳、三跳、四跳关系计算的场景。
+为了验证 Apache GeaFlow (Incubating) 的流图计算性能,我们以k-Hop算法为例设计了和 Flink 的对比实验。我们将指定数据作为输入源输入到计算引擎中,执行k-Hop算法,并统计所有数据完成计算的时间来比较系统的性能。我们采用公开数据集[web-Google.txt](https://snap.stanford.edu/data/web-Google.html)作为输入,实验环境为 16 台 8 核 16G 的服务器,分别比较了一跳、两跳、三跳、四跳关系计算的场景。
实验结果如图所示,横坐标是分别是一跳关系、两跳关系、三跳关系、四跳关系,纵坐标是处理完所有数据的耗时,采用对数指标。可以看到在一跳、两跳场景中,Flink 的性能要好于 GeaFlow,这是因为在一跳、两跳场景中参与 join 计算的数据量比较小,join 需要遍历的左表和右表都很小,遍历本身耗时短,而且 Flink 的计算框架可以缓存 join 的历史计算结果。但是到了三跳、四跳场景时候,由于计算复杂度的上升,join 算子需要遍历的表迅速膨胀,带来计算性能的急剧下降,甚至四跳场景超过一天也无法完成计算。而 GeaFlow采用基于流图增量图算法,计算耗时只和增量路径相关,和历史的关联关系计算结果无关,所以性能明显优于 Flink。
@@ -201,17 +199,17 @@ k-Hop 计算性能对比
传统的 Flink 等流计算引擎在计算关联关系时需要用到 join 算子,join 算子需要遍历全量的历史数据,这使得他们在大数据关联计算场景中性能不佳。GeaFlow 引擎通过支持流图计算框架,将图计算引入到流计算中,采用增量图计算的方法大大提升了实时数据的处理系性能。
-目前 GeaFlow 项目代码已经开源,我们希望基于 GeaFlow 构建面向图数据的统一湖仓处理引擎,以解决多样化的大数据关联性分析诉求。同时我们也在积极筹备加入 Apache 基金会,丰富大数据开源生态,因此非常欢迎对图技术有浓厚兴趣同学加入社区共建。
+目前 Apache GeaFlow (Incubating) 项目代码已经开源,我们希望基于 GeaFlow 构建面向图数据的统一湖仓处理引擎,以解决多样化的大数据关联性分析诉求。同时我们也在积极筹备加入 Apache 基金会,丰富大数据开源生态,因此非常欢迎对图技术有浓厚兴趣同学加入社区共建。
社区中有诸多有趣的工作尚待完成,你可以从如下简单的「Good First Issue」开始,期待你加入同行。
-- 支持增量 k-Core 算法。([Issue 466](https://github.com/TuGraph-family/tugraph-analytics/issues/466))
-- 支持增量最小生成树算法。([Issue 465](https://github.com/TuGraph-family/tugraph-analytics/issues/465))
+- 支持增量 k-Core 算法。([Issue 466](https://github.com/apache/geaflow/issues/466))
+- 支持增量最小生成树算法。([Issue 465](https://github.com/apache/geaflow/issues/465))
- ...
## 参考链接
-1. GeaFlow 项目地址:[https://github.com/TuGraph-family/tugraph-analytics](https://github.com/TuGraph-family/tugraph-analytics)
+1. Apache GeaFlow (Incubating) 项目地址:[https://github.com/apache/geaflow](https://github.com/apache/geaflow)
2. web-Google 数据集地址:[https://snap.stanford.edu/data/web-Google.html](https://snap.stanford.edu/data/web-Google.html)
-3. GeaFlow Issues:[https://github.com/TuGraph-family/tugraph-analytics/issues](https://github.com/TuGraph-family/tugraph-analytics/issues)
-4. 增量 k-Hop 算法实现源码:[https://github.com/TuGraph-family/tugraph-analytics/blob/master/geaflow/geaflow-dsl/geaflow-dsl-plan/src/main/java/com/antgroup/geaflow/dsl/udf/graph/IncKHopAlgorithm.java](https://github.com/TuGraph-family/tugraph-analytics/blob/master/geaflow/geaflow-dsl/geaflow-dsl-plan/src/main/java/com/antgroup/geaflow/dsl/udf/graph/IncKHopAlgorithm.java)
+3. Apache GeaFlow (Incubating) Issues:[https://github.com/apache/geaflow/issues](https://github.com/apache/geaflow/issues)
+4. 增量 k-Hop 算法实现源码:[https://github.com/apache/geaflow/blob/master/geaflow/geaflow-dsl/geaflow-dsl-plan/src/main/java/com/antgroup/geaflow/dsl/udf/graph/IncKHopAlgorithm.java](https://github.com/apache/geaflow/blob/master/geaflow/geaflow-dsl/geaflow-dsl-plan/src/main/java/com/antgroup/geaflow/dsl/udf/graph/IncKHopAlgorithm.java)
diff --git a/i18n/zh-CN/docusaurus-plugin-content-blog/32.md b/i18n/zh-CN/docusaurus-plugin-content-blog/32.md
index 8cb11555e..781340831 100644
--- a/i18n/zh-CN/docusaurus-plugin-content-blog/32.md
+++ b/i18n/zh-CN/docusaurus-plugin-content-blog/32.md
@@ -1,9 +1,9 @@
---
-title: "流式图计算引擎 GeaFlow v0.6.4 发布,支持关系型访问图数据,增量匹配优化实时处理"
+title: "流式图计算引擎 Apache GeaFlow (Incubating) v0.6.4 发布,支持关系型访问图数据,增量匹配优化实时处理"
date: 2025-4-3
---
-2025 年 3 月发布了流式图计算引擎 GeaFlow v0.6.4,新版本实现了多个重要特性更新,包括:
+2025 年 3 月发布了流式图计算引擎 Apache GeaFlow (Incubating) v0.6.4,新版本实现了多个重要特性更新,包括:
- 🍀GeaFlow 图存储扩展支持 paimon 数据湖(实验性功能)
- 🍀图数仓能力扩展:支持对图中的实体进行关系型访问
@@ -15,7 +15,7 @@ date: 2025-4-3
## ✨ 新增功能
-### 🍀GeaFlow 图存储扩展支持 paimon 数据湖(实验性功能)
+### 🍀Apache GeaFlow (Incubating) 图存储扩展支持 paimon 数据湖(实验性功能)
为提升 GeaFlow 数据存储系统的扩展性、实时数据处理能力及成本效率,本次更新加入了对 Apache Paimon 的支持。Paimon 作为新一代流式数据湖存储格式,在设计理念、功能特性上,与 GeaFlow 之前使用的 RocksDB 存在许多差异:
@@ -29,7 +29,7 @@ date: 2025-4-3
- 当前为实验性功能,仅支持使用本地文件系统作为 paimon 的存储后端,且暂不支持 recover 能力,暂不支持动态图数据存储。
- 通过配置`geaflow.store.paimon.options.warehouse`参数来指定存储路径,默认路径为"file:///tmp/paimon/"。
-当前 GeaFlow 的存储架构图如下。
+当前 Apache GeaFlow (Incubating) 的存储架构图如下。
