Add Test Suite for Long Running Tests #314

samliok · 2025-12-12T20:12:16Z

This PR adds a functionality for a different style of testing. It allows the tester to spin up tests where they control the behavior of the network rather than the behavior of individual nodes. This framework can test Simplex in ways that would have otherwise been to tedious and cumbersome to test. For example this new simple test allows us to spin up a network of 10 nodes, wait for them to enter a specific round, disconnect a few of them, and then reconnect at a later time. At the end of the test, we assert that all nodes are properly functioning.

func TestLongRunningReplication(t *testing.T) {
	net := testutil.NewDefaultLongRunningNetwork(t, 10)
	net.StartInstances()

	net.WaitForAllNodesToEnterRound(40)
	net.NoMoreBlocks()
	net.DisconnectNodes(2)
	net.ContinueBlocks()
	net.WaitForCertainNodesToEnterRound(70, 1, 3, 4, 5, 6)
	net.DisconnectNodes(4)
	net.WaitForCertainNodesToEnterRound(90, 1, 3, 5, 6, 7, 8, 9)
	net.ConnectNodes(2, 4)
	net.WaitForAllNodesToEnterRound(150)
	net.StopAndAssert(false)
}

Before this would have required a tremendous amount of boilerplate code, and specific wherewithal to properly orchestrate replication, block building, etc..

type LongRunningInMemoryNetwork struct {
	*InMemNetwork
	stopped atomic.Bool
}

This new struct wraps InMemNetwork, so all the previous functionality of InMemNetwork can still be used in these tests. However, we add an additional set of helper functions

func (n *LongRunningInMemoryNetwork) UpdateTime(frequency time.Duration, amount time.Duration)
func (n *LongRunningInMemoryNetwork) CrashNodes(nodeIndexes ...uint64)
func (n *LongRunningInMemoryNetwork) RestartNodes(nodeIndexes ...uint64)
func (n *LongRunningInMemoryNetwork) NoMoreBlocks()
func (n *LongRunningInMemoryNetwork) ContinueBlocks()
func (n *LongRunningInMemoryNetwork) WaitForNodesToEnterRound(round uint64, nodeIndexes ...uint64)
func (n *LongRunningInMemoryNetwork) StopAndAssert(tailingMessages bool) 
func (n *LongRunningInMemoryNetwork) ConnectNodes(nodeIndexes ...uint64) 
func (n *LongRunningInMemoryNetwork) DisconnectNodes(nodeIndexes ...uint64)

yacovm

made a quick pass, I think it's pretty useful overall. we can build something on top that will generate random test cases and just run this.

Will make another pass later.

long_running_test.go

testutil/long_running_network.go

yacovm · 2025-12-28T21:15:40Z

testutil/long_running_network.go

+	}
+}
+
+func (n *LongRunningInMemoryNetwork) waitUntilAllRoundsEqual() {


Can't this function return even if no blocks were committed? e.g at round 0?

Shouldn't we perhaps pass in some kind of predicate on the round / sequence?

hmm, i'm using it as more of a helper function for StopAndAssert but I can see a predicate being helpful if we decide to expose this function.

yacovm · 2025-12-28T21:18:30Z

testutil/wal.go

+}
+
+// AssertHealthy checks that the WAL has at most one of each record type per round.
+func (tw *TestWAL) AssertHealthy(bd simplex.BlockDeserializer, qcd simplex.QCDeserializer) {


Can't we have a notarization and an empty notarization in the WAL? via replicating one of them?

yea we can have one of each for the same round, but we should never have two of the same one for the same round.

long_running_test.go

testutil/storage.go

yacovm · 2025-12-31T20:42:07Z

long_running_test.go

+func TestLongRunningReplication(t *testing.T) {
+	net := testutil.NewDefaultLongRunningNetwork(t, 10)
+	for _, instance := range net.Instances {
+		instance.SilenceExceptKeywords("Received replication response", "Resending replication requests for missing rounds/sequences")


I don't understand why we're doing this. There is nothing in the test that I see that requires intercepting the log, so why do we care that it's printed?

Can you explain?

i'll remove it. it was there to help debug the test, same with the other comment you had below.

long_running_test.go

yacovm · 2025-12-31T20:44:12Z

testutil/node.go

+	l                *TestLogger
+	t                *testing.T
+	BB               ControlledBlockBuilder
+	messageTypesSent map[string]uint64


we're only incrementing the message types but never reading them... why do we need this?

i accidentally forgot to uncomment out the code that prints messageTypesSent. Relates to this comment(#314 (comment)) It was useful to print the message types for debugging purposes.

yacovm · 2025-12-31T20:44:46Z

testutil/node.go

 			msg  *simplex.Message
 			from simplex.NodeID
-		}, 1000)}
+		}, 100000)}


Do the tests not work with the previous buffer size?

some tests were getting flakey at 1000, so i bumped it.

How is that possible? we increment the time only 10 times per second 🤷‍♂️

yacovm · 2025-12-31T20:48:26Z

testutil/long_running_network.go

+	}
+
+	amount := simplex.DefaultEmptyVoteRebroadcastTimeout / 5
+	go n.UpdateTime(100*time.Millisecond, amount)


UpdateTime iterates over instances without taking a lock, but we may re-create an instance via RestartNodes. Isn't it unsafe from a concurrency aspect?

yacovm · 2025-12-31T20:49:26Z

testutil/long_running_network.go

+}
+
+func (n *LongRunningInMemoryNetwork) RestartNodes(nodeIndexes ...uint64) {
+	for _, idx := range nodeIndexes {


don't we need to stop the previous instance? I can't find where we're doing it.

CrashNodes calls stop on the instances. I guess currently there is no way for the epoch to know if we have already called Stop on it. Should the LongRunningInMemoryNetwork struct keep track of whether or not we already stopped an instance and check it here?

Just add to stop the nodes in the restart. You never restart the nodes alongside crash so it should be fine.

testutil/long_running_network.go

yacovm · 2026-01-02T17:11:19Z

2026-01-02T17:08:48.3796074Z [01-02|17:06:55.868] INFO Simplex/epoch.go:2592 Moving to a new round {"prev round": 13, "next round": 14, "prev leader": "0200000000000000", "next leader": "0300000000000000"}
2026-01-02T17:08:48.3796478Z [01-02|17:06:55.868] INFO Simplex/epoch.go:2592 Moving to a new round {"prev round": 14, "next round": 15, "prev leader": "0300000000000000", "next leader": "0400000000000000"}
2026-01-02T17:08:48.3796890Z [01-02|17:06:55.868] INFO Simplex/epoch.go:2592 Moving to a new round {"prev round": 15, "next round": 16, "prev leader": "0400000000000000", "next leader": "0500000000000000"}
2026-01-02T17:08:48.3797039Z     util.go:237: 
2026-01-02T17:08:48.3797402Z         	Error Trace:	/home/runner/work/Simplex/Simplex/testutil/util.go:237
2026-01-02T17:08:48.3797971Z         	            				/home/runner/work/Simplex/Simplex/testutil/network.go:160
2026-01-02T17:08:48.3798667Z         	            				/home/runner/work/Simplex/Simplex/replication_test.go:310
2026-01-02T17:08:48.3799228Z         	            				/home/runner/work/Simplex/Simplex/replication_test.go:267
2026-01-02T17:08:48.3799422Z         	Error:      	timed out waiting for event
2026-01-02T17:08:48.3799804Z         	Test:       	TestReplicationEmptyNotarizations/Empty_notarizations_end_round13
2026-01-02T17:08:48.3799943Z --- FAIL: TestReplicationEmptyNotarizations (0.00s)
2026-01-02T17:08:48.3800262Z     --- PASS: TestReplicationEmptyNotarizations/Empty_notarizations_end_round2 (0.28s)
2026-01-02T17:08:48.3800578Z     --- PASS: TestReplicationEmptyNotarizations/Empty_notarizations_end_round3 (0.64s)
2026-01-02T17:08:48.3800953Z     --- PASS: TestReplicationEmptyNotarizations/Empty_notarizations_end_round6 (2.44s)
2026-01-02T17:08:48.3801265Z     --- PASS: TestReplicationEmptyNotarizations/Empty_notarizations_end_round9 (6.05s)
2026-01-02T17:08:48.3801571Z     --- PASS: TestReplicationEmptyNotarizations/Empty_notarizations_end_round10 (7.05s)
2026-01-02T17:08:48.3801870Z     --- PASS: TestReplicationEmptyNotarizations/Empty_notarizations_end_round8 (4.76s)
2026-01-02T17:08:48.3802174Z     --- PASS: TestReplicationEmptyNotarizations/Empty_notarizations_end_round4 (1.14s)
2026-01-02T17:08:48.3802474Z     --- PASS: TestReplicationEmptyNotarizations/Empty_notarizations_end_round7 (3.76s)
2026-01-02T17:08:48.3802787Z     --- PASS: TestReplicationEmptyNotarizations/Empty_notarizations_end_round12 (11.99s)
2026-01-02T17:08:48.3803101Z     --- PASS: TestReplicationEmptyNotarizations/Empty_notarizations_end_round16 (21.34s)
2026-01-02T17:08:48.3803465Z     --- PASS: TestReplicationEmptyNotarizations/Empty_notarizations_end_round18 (24.86s)
2026-01-02T17:08:48.3803784Z     --- PASS: TestReplicationEmptyNotarizations/Empty_notarizations_end_round19 (29.07s)
2026-01-02T17:08:48.3804097Z     --- PASS: TestReplicationEmptyNotarizations/Empty_notarizations_end_round20 (33.12s)
2026-01-02T17:08:48.3804404Z     --- PASS: TestReplicationEmptyNotarizations/Empty_notarizations_end_round14 (15.42s)
2026-01-02T17:08:48.3804714Z     --- PASS: TestReplicationEmptyNotarizations/Empty_notarizations_end_round15 (18.30s)
2026-01-02T17:08:48.3805019Z     --- FAIL: TestReplicationEmptyNotarizations/Empty_notarizations_end_round13 (60.01s)

🤔

samliok added 5 commits December 12, 2025 14:40

Add Test Suite for Long Running Tests

a64f705

change to one function

3986c99

nil check

fcde8df

add storage check and fix flake

db7c4d9

fix race condition

b4e9e04

yacovm reviewed Dec 28, 2025

View reviewed changes

samliok added 2 commits December 29, 2025 17:18

license header

ac0dbb1

pre-merge

b5ec286

samliok force-pushed the long-running-test branch from ca4efa4 to b5ec286 Compare December 30, 2025 20:14

samliok added 3 commits December 30, 2025 15:40

merge main into long running test

444aa40

dont lock

c125e5c

change from cond to channel and listen for context cancel

f964eaf

yacovm reviewed Dec 31, 2025

View reviewed changes

remove debug, use bytes, uncomment

464ce4c

samliok and others added 3 commits January 2, 2026 12:31

rename

8cbb4fd

re-order

a75cf64

Merge branch 'main' into long-running-test

55f34d8

yacovm approved these changes Jan 2, 2026

View reviewed changes

yacovm and others added 2 commits January 5, 2026 22:17

Merge branch 'main' into long-running-test

6c1cd22

merge conflicts

73ac486

samliok merged commit df25cd2 into main Jan 5, 2026
5 checks passed

samliok deleted the long-running-test branch January 5, 2026 21:26

Add Test Suite for Long Running Tests #314

Add Test Suite for Long Running Tests #314

Uh oh!

Conversation

samliok commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yacovm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yacovm commented Jan 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

samliok commented Dec 12, 2025 •

edited

Loading