Skip to content

Latest commit

 

History

History
240 lines (185 loc) · 8.76 KB

File metadata and controls

240 lines (185 loc) · 8.76 KB

AGENTS.md

This file provides guidance to AI coding assistants when working with code in this repository.

Repository Overview

This is the Apache HugeGraph-Computer repository containing two distinct graph computing systems:

  1. computer (Java/Maven): A distributed BSP/Pregel-style graph processing framework that runs on Kubernetes or YARN
  2. vermeer (Go): A high-performance in-memory graph computing platform with master-worker architecture

Both integrate with HugeGraph for graph data input/output.

Build & Test Commands

Computer (Java)

Prerequisites:

  • JDK 11 for building/running
  • JDK 8 for HDFS dependencies
  • Maven 3.5+
  • For K8s module: run mvn clean install first to generate CRD classes under computer-k8s

Build:

cd computer
mvn clean compile -Dmaven.javadoc.skip=true

Tests:

# Unit tests
mvn test -P unit-test

# Integration tests
mvn test -P integrate-test

Run single test:

# Run specific test class
mvn test -P unit-test -Dtest=ClassName

# Run specific test method
mvn test -P unit-test -Dtest=ClassName#methodName

License check:

mvn apache-rat:check

Package:

mvn clean package -DskipTests

Vermeer (Go)

Prerequisites:

  • Go 1.23+
  • curl and unzip (for downloading binary dependencies)

First-time setup:

cd vermeer
make init  # Downloads supervisord and protoc binaries, installs Go deps

Build:

make          # Build for current platform
make build-linux-amd64
make build-linux-arm64

Development build with hot-reload UI:

go build -tags=dev

Clean:

make clean      # Remove built binaries and generated assets
make clean-all  # Also remove downloaded tools

Run:

# Using binary directly
./vermeer --env=master
./vermeer --env=worker

# Using script (configure in vermeer.sh)
./vermeer.sh start master
./vermeer.sh start worker

Regenerate protobuf (if proto files changed):

go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.28.0
go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@v1.2.0

# Generate (adjust protoc path for your platform)
vermeer/tools/protoc/linux64/protoc vermeer/apps/protos/*.proto --go-grpc_out=vermeer/apps/protos/. --go_out=vermeer/apps/protos/. # please note remove license header if any

Architecture

Computer (Java) - BSP/Pregel Framework

Module Structure:

  • computer-api: Public interfaces for graph processing (Computation, Vertex, Edge, Aggregator, Combiner, GraphFactory)
  • computer-core: Runtime implementation (WorkerService, MasterService, messaging, BSP coordination, managers)
  • computer-algorithm: Built-in algorithms (PageRank, LPA, WCC, SSSP, TriangleCount, etc.)
  • computer-driver: Job submission and driver-side coordination
  • computer-k8s: Kubernetes deployment integration
  • computer-yarn: YARN deployment integration
  • computer-k8s-operator: Kubernetes operator for job management
  • computer-dist: Distribution packaging
  • computer-test: Integration and unit tests

Key Design Patterns:

  1. API/Implementation Separation: Algorithms depend only on computer-api interfaces; computer-core provides runtime implementation. Algorithms are dynamically loaded via config.

  2. Manager Pattern: WorkerService composes multiple managers (MessageSendManager, MessageRecvManager, WorkerAggrManager, DataServerManager, SortManagers, SnapshotManager, etc.) with lifecycle hooks: initAll(), beforeSuperstep(), afterSuperstep(), closeAll().

  3. BSP Coordination: Explicit barrier synchronization via etcd (EtcdBspClient). Each superstep follows:

    • workerStepPrepareDonewaitMasterStepPrepareDone
    • Local compute (vertices process messages)
    • workerStepComputeDonewaitMasterStepComputeDone
    • Aggregators/snapshots
    • workerStepDonewaitMasterStepDone (master returns SuperstepStat)
  4. Computation Contract: Algorithms implement Computation<M extends Value>:

    • compute0(context, vertex): Initialize at superstep 0
    • compute(context, vertex, messages): Process messages in subsequent supersteps
    • Access to aggregators, combiners, and message sending via ComputationContext

Important Files:

  • Algorithm contract: computer/computer-api/src/main/java/org/apache/hugegraph/computer/core/worker/Computation.java
  • Runtime orchestration: computer/computer-core/src/main/java/org/apache/hugegraph/computer/core/worker/WorkerService.java
  • BSP coordination: computer/computer-core/src/main/java/org/apache/hugegraph/computer/core/bsp/Bsp4Worker.java
  • Example algorithm: computer/computer-algorithm/src/main/java/org/apache/hugegraph/computer/algorithm/centrality/pagerank/PageRank.java

Vermeer (Go) - In-Memory Computing Engine

Directory Structure:

  • algorithms/: Go algorithm implementations (pagerank.go, sssp.go, louvain.go, etc.)
  • apps/:
    • bsp/: BSP coordination helpers
    • graphio/: HugeGraph I/O adapters (reads via gRPC to store/pd, writes via HTTP REST)
    • master/: Master scheduling, HTTP endpoints, worker management
    • compute/: Worker-side compute logic
    • protos/: Generated protobuf/gRPC definitions
    • common/: Utilities, logging, metrics
  • client/: Client libraries
  • tools/: Binary dependencies (supervisord, protoc)
  • ui/: Web UI assets

Key Patterns:

  1. Maker/Registry Pattern: Graph loaders/writers register themselves via init() (e.g., LoadMakers[LoadTypeHugegraph] = &HugegraphMaker{}). Master selects loader by type.

  2. HugeGraph Integration:

    • hugegraph.go implements HugegraphMaker, HugegraphLoader, HugegraphWriter
    • Queries PD via gRPC for partition metadata
    • Streams vertex/edge data via gRPC from store (ScanPartition)
    • Writes results back via HugeGraph HTTP REST API
  3. Master-Worker: Master schedules LoadPartition tasks to workers, manages worker lifecycle via WorkerManager/WorkerClient, exposes HTTP admin endpoints.

Important Files:

  • HugeGraph integration: vermeer/apps/graphio/hugegraph.go
  • Master scheduling: vermeer/apps/master/tasks/tasks.go
  • Worker management: vermeer/apps/master/workers/workers.go
  • HTTP endpoints: vermeer/apps/master/services/http_master.go
  • Scheduler: vermeer/apps/master/bl/scheduler_bl.go

Integration with HugeGraph

Computer (Java):

  • WorkerInputManager reads vertices/edges from HugeGraph via GraphFactory abstraction
  • Graph data is partitioned and distributed to workers via input splits

Vermeer (Go):

  • Directly queries HugeGraph PD (metadata service) for partition information
  • Uses gRPC to stream graph data from HugeGraph store
  • Writes computed results back via HugeGraph HTTP REST API (adds properties to vertices)

Development Workflow

Adding a New Algorithm (Computer):

  1. Create class in computer-algorithm implementing Computation<MessageType>
  2. Implement compute0() for initialization and compute() for message processing
  3. Use context.sendMessage() or context.sendMessageToAllEdges() for message passing
  4. Register aggregators in beforeSuperstep(), read/write in compute()
  5. Configure algorithm class name in job config

K8s-Operator Development:

  • CRD classes are auto-generated; run mvn clean install in computer-k8s-operator first
  • Generated classes appear in computer-k8s/target/generated-sources/
  • CRD generation script: computer-k8s-operator/crd-generate/Makefile

Vermeer Asset Updates:

  • Web UI assets must be regenerated after changes: cd asset && go generate
  • Or use make generate-assets from vermeer root
  • For dev mode with hot-reload: go build -tags=dev

Testing Notes

Computer:

  • Integration tests require etcd, HDFS, HugeGraph, and Kubernetes (see .github/workflows/computer-ci.yml)
  • Test environment setup scripts in computer-dist/src/assembly/travis/
  • Unit tests run in isolation without external dependencies

Vermeer:

  • Test scripts in vermeer/test/,with vermeer_test.go and vermeer_test.sh
  • Configuration files in vermeer/config/ (master.ini, worker.ini templates)

CI/CD

CI pipeline (.github/workflows/computer-ci.yml) runs:

  1. License check (Apache RAT)
  2. Setup HDFS (Hadoop 3.3.2)
  3. Setup Minikube/Kubernetes
  4. Load test data into HugeGraph
  5. Compile with Java 11
  6. Run integration tests (-P integrate-test)
  7. Run unit tests (-P unit-test)
  8. Upload coverage to Codecov

Important Notes

  • Computer K8s module: Must run mvn clean install before editing to generate CRD classes
  • Java version: Build requires JDK 11; HDFS dependencies require JDK 8
  • Vermeer binary deps: First-time builds need make init to download supervisord/protoc
  • BSP coordination: Computer uses etcd for barrier synchronization (configure via BSP_ETCD_URL)
  • Memory management: Both systems auto-manage memory by spilling to disk when needed