Advanced Computing Platforms for Data Processing

Kent State University
CS 4/5/6/79995: Advanced Computing Platforms for Data Processing
Spring 2012

Instructor: Ruoming Jin

Tuesdays and Thursdays 17:30PM-18:45PM; Rm. SMH - 00202
Office Hours: Mondays and Wednesdays 16:30PM-17:30PM

TA: Lin Liu (lliu AT kent cs email domain)

---------------------------------------------------------

Prerequisites

CS 4/56101 Algorithms
CS 33001 Data Structures

CS 4/53005 Database Design

CS 6/73015 Data Mining Techniques
Or Consent of the Instructor

Course Overview

In the past few years, we have witnessed the emergence of several computing platforms, ranging from Cloud, to Multi-Core, to Mobile environment. They are quickly shaping the computing landscape. A major objective of these new computing platforms is to assist managing and processing the massive amount of data which become ubiquitous. For instance, organizations from government to business routinely collect terabytes of data on a daily basis, and gaining non-trivial insights and knowledge from the data is the underlying force supporting these organizations. From financial institutions, issuance companies, to IT giants, such Google, Amazon, Yahoo!, to social networking stars, like Facebook and Twitter, data and information processing takes a central role in all of them. However, how to utilize the emerging computing platform to handle large scale data is a non-trivial problem and not only needs skills in algorithm design, database, but also requires a good understanding of parallelism and computing environments (system capability in process, disk, memory, and network, etc) .

This course will introduce the state-of-arts computing platforms with the focus on how to utilize them in processing (managing and analyzing) massive datasets. Specifically, we will discuss the MapReduce (Hadoop) framework, which provides the most accessible and practical means of computing in the Cloud. We will also introduce the emerging distributed database and services, such as HBase, HyperTable, Amazon’s Simple Storage Services (S3), etc. We will also cover Latin Pigs and Hive for large scale data analysis. Finally, we will utilize several key data processing tasks, including simple statistics, data aggregation, join processing, frequent pattern mining, data clustering, information retrieval, PageRank, and massive graph analytics as the case study for large scale data processing.