Skip to content

Farm Tips and Tricks

Emily Josephs edited this page Dec 8, 2017 · 14 revisions

Max Number of Command Line Arguments

The maximum number of arguments in any command is 131072. If you are getting errors from bash saying you have too many arguments, try and specify the maximum number of arguments as 131072. For example, in removing a bunch of files:

$ find -name "[name_of_file].*" | xargs -n131072 rm

This command will remove all files that match the pattern [name_of_file].*. * is a wildcard character. The -name is different from [name_of_file] and should not be substituted!

If you run many jobs on farm with intensive I/O access, such as MCMC and simulations.

We would suggest you to output your files to the locally to where the machine running your job, you can output to folder /scratch/<username>/<jobid>. Below is why it helps.

Each node should have a /scratch that you can access and write to. Have your code write locally to /scratch. Then when your script is done, have it move the final file to your home directory (or wherever), so that you are using the network only once. Make sure you do indeed move/delete the file, so you don't fill up /scratch (which has <1Tb and is for everyone's use).

This can help quite a bit. But it's a 1TB disk made of spinning rust. So the maximum is around 100 seeks/second. It's not shared (like your /home). But for I/O intensive loads with many random access can still be faster on a laptop.

SSDs in general manage around 50,000 random IO access per second. A disk of spinning rust manages around 100. So laptops can be 500 times faster, however in most cases you can get better performance from a compute node by asking for more ram. Unfortunately that depends on how your application is written. If your application asks for the data to be synced to disk it can't be cached.

In any case the first step is to try /scratch, don't forget to have your job clean up after itself. We suggest /scratch/<username>/<jobid>. You can use the -p option with mkdir to make yourself a directory on /scratch/ and your script won't break if that directory already exists.

You can check on the files you're making in scratch by ssh-ing into the node it's running on. Do not run jobs while ssh-ed into a node. The safest way to check on files is log into farm and use a command like this (change bigmem9 to the node your job is running on and myfile.txt to the file you're making): $ ssh bigmem9 'cd /scratch; ls; wc -l myfile.txt'

Shared datasets in farm

Some datasets will be used by more than one member of the lab. In order to prevent unnecessary use of storage space by having duplicated copies of the same datasets, we have a shared directory: /group/jrigrp/Share/ where we create directories and put the files we want to share (e.g. /group/jrigrp/Share/MaizeHapMapV3.2.1/)

It can be useful to include some information about the files such as:

  • description and/or links to description
  • source (where were the files downloaded from e.g. iplant path, web url)
  • publication associated with the data

Clone this wiki locally