Fix: Add explicit timeouts for ElasticSearch connections#1505
Fix: Add explicit timeouts for ElasticSearch connections#1505
Conversation
Addresses socket timeout errors occurring at exactly 10 seconds for long-running ElasticSearch queries (e.g., queries with 1000+ sub-requests that take several minutes). Root cause: After ductile PR #45 was merged, the new connection management defaults include a 10-second connection-timeout that is being reused as socket-timeout when not explicitly set. This causes intermittent failures for requests that take longer than 10 seconds. Solution: Explicitly set timeout parameters when creating ES connections: - socket-timeout: 600000ms (10 minutes) - allows long-running queries - connection-timeout: 10000ms (10 seconds) - reasonable for establishing connection - validate-after-inactivity: 5000ms (5 seconds) - prevents NoHttpResponseException This is a temporary workaround until ctia's properties schema is updated to support these new ductile parameters (socket-timeout, connection-timeout, validate-after-inactivity) as configurable properties. Related: - ductile PR #45: threatgrid/ductile#45 - Symptom: Requests failing at exactly 10s with socket timeout errors - Evidence: Some requests succeed at 16s, 24s, 28s while others fail at 10s
User-provided configuration in props should take precedence over default timeout values. This allows future flexibility if these parameters are added to the properties schema.
|
default values may be fixed in ductile. |
@ereteog I don't see anything in ductile repo since threatgrid/ductile#45 that sets socket-timeout. The issue is that under the hood, socket-timeout is set to connection-timeout if not set explicitly. The default connection-timeout is set to 10000ms (10s) which is too short for socket-timeout. |
|
@sayerada the defaults are currently centralized in this ns: https://github.com/threatgrid/ductile/blob/master/src/ductile/conn.clj. Any other default values can be injected there. |
|
@ereteog You're right that the proper fix would be in ductile (adding a sensible default for I suggest we merge this workaround first to validate it actually fixes the issue in production. If it does, we can follow up with a proper ductile PR to centralize the default — that way other consumers would benefit too. The fix here is minimal, well-documented, and the merge order ensures user config still takes precedence. LGTM 👍 |
Problem
After ductile PR #45 was merged, we're seeing intermittent socket timeout errors at exactly 10 seconds for long-running ElasticSearch queries.
Evidence:
Root Cause:
The new ductile connection management defaults include a 10-second
connection-timeoutthat appears to be reused assocket-timeoutwhen not explicitly set. This causes failures for ES queries that take longer than 10 seconds (e.g., queries with 1000+ sub-requests that take several minutes).Solution
Explicitly set timeout parameters when creating ES connections:
Technical Details
Modified
ctia/stores/es/init.cljline 94 to pass explicit timeout configuration toductile.conn/connect.This is a temporary workaround until ctia's properties schema is updated to support these new ductile parameters as configurable properties (which would be the proper long-term solution).
Testing
After deployment:
Related
Follow-up Work
For a proper long-term fix, we should:
ctia/properties.cljto add schema for these parameters