-
Notifications
You must be signed in to change notification settings - Fork 43
Description
Hi
We've been running Strelka on our Slurm cluster (local mode). In certain scenarios, the runs cause a massive spike in DNS polls, which affects the load balancer when a lot of jobs run concurrently on different nodes. They're worried that this excess demand could negatively affect the whole cluster and have asked me to try to resolve it.
The image they sent me showing one example of this spike is below.

I couldn't find anything obvious in the Strelka code but I believe I found the culprit in pyFlow. The getHostName function in pyflowTaskWrapper.py seems to call socket.getfqdn() repeatedly while generating the log messages. I found another function elsewhere that caches this result but this one doesn't (maybe it's not possible here).
I patched our local installation such that it defaults to checking commonly used environment variables to find the hostname before falling back on getfqdn(). I've tested this on our cluster and on linux servers and it resolves the issue. Just wanted to see if there was interest in adopting this or an alternate patch.
class StringBling(object) :
def __init__(self, runid, taskStr) :
def getHostName() :
import os, socket
return os.environ.get("SLURMD_NODENAME") or os.environ.get("HOSTNAME") or socket.gethostname()
# import socket
# return socket.gethostbyaddr(socket.gethostname())[0]
#return socket.getfqdn()
self.runid = runid
self.taskStr = taskStr
self.hostname = getHostName()