Kent State University 
CS 4/5/6/79995: Advanced Computing Platforms for Data Processing 
Spring 2012

Instructor: Ruoming Jin

Tuesdays and Thursdays 17:30PM-18:45PM; Rm. SMH - 00202
Office Hours: Mondays and Wednesdays 16:30PM-17:30PM

TA: Lin Liu (lliu AT kent cs email domain)

---------------------------------------------------------

Prerequisites

CS 4/56101 Algorithms
CS 33001 Data Structures

CS 4/53005 Database Design

CS 6/73015 Data Mining Techniques
Or Consent of the Instructor

Course Overview

In the past few years, we have witnessed the emergence of several computing platforms, ranging from Cloud, to Multi-Core, to Mobile environment. They are quickly shaping the computing landscape. A major objective of these new computing platforms is to assist managing and processing the massive amount of data which become ubiquitous. For instance, organizations from government to business routinely collect terabytes of data on a daily basis, and gaining non-trivial insights and knowledge from the data is the underlying force supporting these organizations. From financial institutions, issuance companies, to IT giants, such Google, Amazon, Yahoo!, to social networking stars, like Facebook and Twitter, data and information processing takes a central role in all of them. However, how to utilize the emerging computing platform to handle large scale data is a non-trivial problem and not only needs  skills in algorithm design, database, but also requires a good understanding of parallelism and computing environments (system capability in process, disk, memory, and network, etc) .


This course will introduce the state-of-arts computing platforms with the focus on how to utilize them in processing (managing and analyzing) massive datasets. Specifically, we will discuss the MapReduce (Hadoop) framework, which provides the most accessible and practical means of computing in the Cloud. We will also introduce the emerging distributed database and services, such as HBase, HyperTable, Amazon’s Simple Storage Services (S3), etc. We will also cover Latin Pigs and Hive for large scale data analysis. Finally, we will utilize several key data processing tasks, including simple statistics, data aggregation, join processing, frequent pattern mining, data clustering, information retrieval, PageRank, and massive graph analytics as the case study for large scale data processing.

Reference Textbook

Hadoop: The Definitive Guide, Tom White, O’Reilly

Hadoop In Action, Chuck Lam, Manning

Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer (www.umiacs.umd.edu/~jimmylin/MapReduce-book-final.pdf)

 

Slides

1.      Intro to Cloud Computing

2.      Big Data and Basic Data Analysis

3.      Intro to Hadoop/Mapreduce/HDFS (Chapter 2 in Data-Intensive Text Processing with MapReduce)

4.      Running Hadoop and Hadoop on VM

5.      MapReduce Algorithm Design (Chapter 3 in Data-Intensive Text Processing with MapReduce)

6.      Information Retrieval (Chapter 4)

7.      EC2_Hadoop Tutorial

8.      Graph Processing with MapReduce

9.      Database Management with MapReduce

10.  Data Mining and Machine Learning with MapReduce

11.  Hadoop Practice (by Casey Stella, Technical Lead from Explorys)

12.  NoSQL (CAP, HBase, Cassandra, Neo4j)+Hive/Pig

Homework

1. HW1 (Install and Run Hadoop)

2. HW2

3. HW3

Paper List