Skip to content

Repeated calls to socket.getfqdn() in getHostName #28

@rdmorin

Description

@rdmorin

Hi
We've been running Strelka on our Slurm cluster (local mode). In certain scenarios, the runs cause a massive spike in DNS polls, which affects the load balancer when a lot of jobs run concurrently on different nodes. They're worried that this excess demand could negatively affect the whole cluster and have asked me to try to resolve it.

The image they sent me showing one example of this spike is below.
Image

I couldn't find anything obvious in the Strelka code but I believe I found the culprit in pyFlow. The getHostName function in pyflowTaskWrapper.py seems to call socket.getfqdn() repeatedly while generating the log messages. I found another function elsewhere that caches this result but this one doesn't (maybe it's not possible here).

I patched our local installation such that it defaults to checking commonly used environment variables to find the hostname before falling back on getfqdn(). I've tested this on our cluster and on linux servers and it resolves the issue. Just wanted to see if there was interest in adopting this or an alternate patch.

class StringBling(object) :
    def __init__(self, runid, taskStr) :
        def getHostName() :
            import os, socket
            return os.environ.get("SLURMD_NODENAME") or os.environ.get("HOSTNAME") or socket.gethostname()
            # import socket
            # return socket.gethostbyaddr(socket.gethostname())[0]
            #return socket.getfqdn()

        self.runid = runid
        self.taskStr = taskStr
        self.hostname = getHostName()

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions