-
Notifications
You must be signed in to change notification settings - Fork 1
Getting started
#####g++
If you don't have g++ 4.8 or higher installed on your machines, install it as follows:
sudo add-apt-repository -y ppa:ubuntu-toolchain-r/test
sudo apt-get update -y
sudo apt-get install -y g++-4.8
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-4.8 50
#####MPICH
If you don't have MPICH-3.1 installed on the cluster, you will have to ensure password-less ssh access between all the nodes on the cluster. Here is a guide to help you out.
Once the password-less ssh has been established, you can follow these steps to install MPICH on each of the nodes on your cluster.
wget http://www.mpich.org/static/downloads/3.1.4/mpich-3.1.4.tar.gz
tar -xzf mpich-3.1.4.tar.gz
cd mpich-3.1.4/
./configure --prefix=/usr/local/mpich --disable-fortran
sudo make install
You can use any other path for the prefix as long as the same path is used on all nodes of the cluster. Add the following lines to the ~/.bashrc file:
export MPICH_HOME=/usr/local/mpich
export PATH=$PATH:$MPICH_HOME/bin
#####zlib
Use the following command to install the zlib compression library:
sudo apt-get install zlib1g-dev
#####Thread Building Blocks
Intel's TBB library can be installed using:
sudo apt-get install libtbb-dev
##Configuring nodes
- Edit the machines file to make sure it contains names of all the hosts on the cluster. It is important to note that the machines file should contain the hostnames alone and not the number of CPUs present at each host.
- Replace all instances of
~/GraphSim/in the mpirsync file with the absolute path ofGraphSimon your machines. - To ensure GraphSim can process large datasets, add the following lines to the
/etc/security/limits.conffile in all the nodes:
username hard nofile 64000
username soft nofile 64000
##Understanding file formats
-
The vertex file consists of a JSONArray. The index of each element corresponds to the id of the vertex. Each element in the array is a JSONObject where the key-value pairs correspond to the attributes of the vertex.
-
The edge file represents the structure of the data graph. The edge file has to be present in one of the following formats:
- Edge List
- Adjacency List
- Binary Edge List
- METIS
-
The query file is a JSONObject containing two key-value pairs:
- The key
"node"is mapped to a JSONArray similar to the vertex file, it represents the nodes in the query graph. - The key
"edge"is mapped to a JSONArray where each element is a JSONObject with two key-value pairs,"source"and"target"are mapped to the source and target of an edge in the query graph.
- The key
##Compiling & syncing
Use the make command in the ./GraphSim directory of the master machine (first node listed in the machines file) to compile GraphSim.
To sync all the machines on the cluster run the ./GraphSim/mpirsync script on the master machine in the following directories:
-
GraphSim/bin/src/(Only once) -
GraphSim/conf/(Only once) -
GraphSim/datasets/<dataset_name>/(Every time a dataset is added/altered)
Here is the command used to run GraphSim on the Political-blogs dataset.
mpiexec -n 1 ./bin/src/GraphSim_master file ./datasets/polblogs/polblogs_edge.txt vertexfile ./datasets/polblogs/polblogs_node.json queryfile ./datasets/polblogs/query_polblogs.json edges 20000 vertices 1500 filetype edgelist memory 800 outputfile output_polblogs.txt
The following parameters are mandatory for running GraphSim:
- file - Path to the edge file of the data graph.
- vertexfile - Path to the vertex file of the data graph.
- queryfile - Path to the query file.
- vertices - Number of vertices in the data graph.
- edges - Number of edges in the data graph.
Note Fair approximates of the number of vertices and edges would suffice as these parameters help the GraphSim engine predict the optimal number of machines for the given parameters. In the absence of these parameters, all the machines on the cluster are used.
The following parameters are optional and would help optimize GraphSim:
- filetype - One of the 4 edge file types. This parameter is mandatory for the first run of each dataset.
- memory - Memory available (in MB) at each machine. The default is set to 2000 MB.
- outputfile - The output of GraphSim can be redirected to the specified file. In the absence of this parameter, the output is stored in
<edgefile_path>_output.txt.