Add High Availability (HA) support

## Enhancement Proposal

### Abstract

Ella Core is easy to deploy and operate, but one of its main issues is its inability to scale its user plane capacity and survive faults. This specification outlines an approach to implement scaling in Ella Core. The recommended approach uses Raft to share persistent data between nodes and a modified PFCP protocol to allow the Ella Core leader unit to order a different unit to forward packets. 

<img src="https://github.com/user-attachments/assets/96a17f36-dcbd-4159-a7da-56ce79a3aab7" width="500"/>

### State sharing and Leadership

Nodes share persistent data via the Raft consensus algorithm. `dqlite` will be used as the embedded, replicated SQL engine to back the persistent data across the cluster.

#### API changes

Raft Cluster
- PUT `api/v1/cluster`: Edit cluster configuration
  - enabled
  - n2_vip
- POST `api/v1/cluster/stepdown`: Steps down from leadership   

Raft peers
- POST `api/v1/cluster/peers`: Add a raft peer
- GET  `api/v1/cluster/peers`: List raft peers
- DELETE `api/v1/cluster/peers/<peer id>`: Delete a peer

#### UI changes

We should add a new `Cluster` page to the UI.


#### A new cluster communication endpoint

Ella Core will expose a new network endpoint (using a dedicated cluster address/port) for inter-node communication.
All communication between cluster nodes will be secured using mutual TLS. Each node will have its certificate and private key, and nodes will validate each other’s certificates before accepting connections.

### User Plane Selection with a modified PFCP protocol

As the number of Core nodes increases, the user plane capacity should also increase. To implement User Plane scaling, the leader unit should select which unit will handle the user plane traffic for a given session.

In 5G networks, the PFCP protocol is used between the SMF and the UPF to manage PDU tunnels in the UPF. Here, we propose a simplified PFCP protocol over HTTPs, used between the leader and follower units. Only the "session" part of the protocol needs to be implemented, as "associations" can be assumed from nodes already being part of the cluster.  

## Further Information

### Load Balancing

Load Balancing is an optional part of scaling. Users may use an HTTPs load balancer in front of the node's API services to ensure they always access the leader node via the same address, even if the leader changes. An NGAPP load balancer can also be used between the radios and Ella Core units so that gNodeBs always send signaling information to the leader node. 

<img src="https://github.com/user-attachments/assets/7714ef55-efb2-49d7-8916-c4844f333d02" width="500"/>

### Reference

- Raft: https://raft.github.io/
- dqlite: https://github.com/canonical/dqlite
- Vault raft configuration: https://developer.hashicorp.com/vault/docs/configuration/storage/raft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add High Availability (HA) support #164

Enhancement Proposal

Abstract

State sharing and Leadership

API changes

UI changes

A new cluster communication endpoint

User Plane Selection with a modified PFCP protocol

Further Information

Load Balancing

Reference

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add High Availability (HA) support #164

Description

Enhancement Proposal

Abstract

State sharing and Leadership

API changes

UI changes

A new cluster communication endpoint

User Plane Selection with a modified PFCP protocol

Further Information

Load Balancing

Reference

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions