System for allowing clients to run some code (currently Java and Python) on a possibly GPU accelerated server. The system is using docker with nvidia gpu pass-through. The server is can be run in two modes ether in master or in worker/slave mode. when running in master mode a worker is spawned as well so launching both is not necessary
For running in ether slave or master mode:
- docker and docker-compose has to be installed.
- If the worker is going to be gpu accelerated, it is going to need the nvidia doker toolkit.
- In the config folder rename the main.env.example -> main.env and the nginx_certbot.env.example -> nginx_certbot.env . The values in main.env needs to be filled in with whatever config you need. the nginx_certbot.env file is probably fine but you may want to uncomment the staging option while developing
- The system system_resources.yaml needs to be filled it tels the application how mutch resources you are ok with docker using.
If you want to run more than one worker you will need:
- some sort of network disk system is required to share the save data dir. chose whatever system you are comfortable with given that the application is kind of state based ish makes write conflicts practically not possible. I personally just use the nfs-kernel-server because it is easy and works. some notes on the shared dirs:
- When setting up the shared dir some of the sub dir's shold be generated to avoid ownership issues by creating them elswhere jus go to the shared dir and paste
mkdir -p ./{build_helpers,logs,run,save,send} && chown nobody:docker -R $PWDyou may have to sudo this. - It is important that the volume is mounted to the Remote-run-server/save_data Before start_(master/worker) is run.
- When setting up the shared dir some of the sub dir's shold be generated to avoid ownership issues by creating them elswhere jus go to the shared dir and paste
- A NIC with a local adress to allow other workers to connect to the host DB it is strongly recommended not to use the public interface if so it is very important to change the default password currently used
Download and decompress the release and If all the requirements above are met, just launch the start_master.sh to (big suprise) start the master and launch the start_worker.sh to start in worker mode. If you compile the project yourself you have to have the Remote-run-client installed in your local maven repo
- some whining about permission:
- this means that someone who where not supposed to have made dirs in the shared folder this can be fixed by going to the nfs host and running
sudo chown nobody:docker -R /path/to/nfs/data/save_data
- this means that someone who where not supposed to have made dirs in the shared folder this can be fixed by going to the nfs host and running
- The resource tags in
ConfigFiles/resource_tags.yamlcan be changed while running but some editors like vim and gedit does replace the file when saving instead of changing is thereby braking the docker binding to the file. This does not brake anything but it will make the key unchangeable until next restart of the master. tldr; use nano to change the resource keys in runtime