An introduction to Cascalog.
cascalog-workshop requires the following:
- Leiningen 2: latest stable version
- Hadoop 0.20.2: tgz and md5sum
If you do not already have Leiningen 2, you can download the latest stable version from the repository on GitHub.
wget https://raw.github.com/technomancy/leiningen/stable/bin/lein
Remember to make it executable.
chmod +x lein
You may also want to add it to your PATH.
The first time you run lein, it will download its own dependencies and
bootstrap itself.
To exercise Leiningen, try running lein marg in the
cascalog-workshop project directory. This ought to run the
lein-marginalia plugin to
generate browseable documentation under the docs/ directory.
If you do not already have a working installation of Hadoop, you can follow these instructions for a minimal setup.
-
Download the tgz archive for Hadoop version 0.20.2 and verify the md5 checksum.
wget http://archive.apache.org/dist/hadoop/core/hadoop-0.20.2/hadoop-0.20.2{.tar.gz,.tar.gz.md5} md5sum -c hadoop-0.20.2.tar.gz.md5 -
Extract the archive and enter the destination directory.
tar xfvz hadoop-0.20.2.tar.gz cd hadoop-0.20.2 -
Run the Hadoop executable script. This ought to print usage information and exit.
bin/hadoop -
Populate a folder with some input data for an example Hadoop job.
mkdir /tmp/input cp bin/*.sh /tmp/input/ -
Run a job in non-distributed mode. This job reads the contents of the
/tmp/input/directory as input, greps for'hadoop', and writes output files to/tmp/output.bin/hadoop jar hadoop-0.20.2-examples.jar grep /tmp/input /tmp/output 'hadoop' -
View the job output.
cat /tmp/output/part-*
If you can run this example job, then your Hadoop setup is probably ready for this workshop.
Sam Ritchie generously granted permission to use material from his cascalog-koans project. Thanks Sam!
Copyright © 2013 Steve M. Kim
Distributed under the Eclipse Public License, the same as Clojure.