I took SEIS 736, Big Data Architecture, during the Spring 2016 with Dr. Brad Rubin. The course explored big data technologies with an emphasis on Apache Hadoop, an open-source, Java-based framework that facilitates the processing and storage of extremely large data sets in a distributed computing environment. We also explored topics such as information retrieval and computer security, as well as technologies such as Apache Spark, Apache Hive, and MapReduce.
To complete weekly assignments and the project, students had to install a Cloudera-configured Virtual Machine running the CentOS 6.3 Linux distribution onto their own laptop computer. Apache Hadoop was installed on the VM on a single node “pseudo” cluster. We used the University of Saint Thomas’s larger, multi-node cluster to conduct research that could not be solved using our VM.
The final project required that we use Hadoop or a similar technology to analyze a large dataset of our choosing. Dr. Rubin suggested that we choose from one of several online, opensourse datasets to complete the project. My research is based on data from the MovieLens web site (http://movielens.org) collected by a University of Minnesota researchers from GroupLens Research.
I conducted my research primarily using Java and Apache Hadoop, buttressing my findings with sprinkles of SQL and Apache Hive. The code I wrote to conduct my research can be found in the source folder. The "research_and_analysis" file offers a detailed explanation of the source code in its overall analysis of the film dataset. The code is best explored in tandem with the research paper.