-
Notifications
You must be signed in to change notification settings - Fork 154
Description
From the Postgrex.Notifications documentation:
Note however that when the notification system is waiting for a connection, any notifications that occur during the disconnection period are not queued and cannot be recovered. Similarly, any listen command will be queued until the connection is up.
There is a race condition between starting to listen and notifications being issued "at the same time", as explained in the PostgreSQL documentation. If your application needs to keep a consistent representation of data, follow the three-step approach of first subscribing, then obtaining the current state of data, then handling the incoming notifications.
Beware that the same race condition applies to auto-reconnects. A simple way of dealing with this issue is not using the auto-reconnect feature directly, but monitoring and re-starting the Notifications process, then subscribing to channel messages over again, using the same three-step approach.
However, in the configuration we have this:
eventstore/lib/event_store/config.ex
Lines 125 to 129 in 0bf4f2e
| def postgrex_notifications_opts(config, name) do | |
| config | |
| |> session_mode_pool_config() | |
| |> default_postgrex_opts() | |
| |> Keyword.put(:auto_reconnect, true) |
So essentially, there is the possibility that some event notifications are missed if the DB connection used by the Postgrex.Notifications process loses connection to the DB. Even if it reconnects automatically, some event notifications may have been missed due to the race condition cited above.
The Subscriptions.Subscription process is able to detect when it has not received an inbetween event, as seen here:
eventstore/lib/event_store/subscriptions/subscription_fsm.ex
Lines 142 to 146 in 0bf4f2e
| future when future > expected_event -> | |
| Logger.debug(describe(data) <> " received unexpected event(s), requesting catch up") | |
| # Missed event(s), request catch-up with any unseen events from storage | |
| next_state(:request_catch_up, data) |
However, this logic only applies when a next event is received. But for quiet or not-so-busy streams, a long time can happen before a new event is emitted. And during this time the application will remain in a broken state because it will not know that it has missed event notifications.
We have experienced problems in production with subscriptions getting stuck and not processing events until they are restarted. The issue I am describing above might be one of the reasons.
To fix this we would need to not automatically reconnect and introduce the possibility to do a catch up in some way so that any missed notification can be retrieved and processed. I am not sure whether this logic can be done directly from the Notifications.Listener process in a general way, or if it has to be done from each individual subscription, depending on their last seen event.