Thursdays 17:30PM-18:45PM; Rm. SMH - 00202
Office Hours: Mondays and Wednesdays 16:30PM-17:30PM
TA: Lin Liu (lliu AT kent cs email domain)
CS 4/56101 Algorithms
CS 33001 Data Structures
CS 4/53005 Database Design
CS 6/73015 Data Mining Techniques
Or Consent of the Instructor
In the past few years, we have witnessed the emergence of several computing platforms, ranging from Cloud, to Multi-Core, to Mobile environment. They are quickly shaping the computing landscape. A major objective of these new computing platforms is to assist managing and processing the massive amount of data which become ubiquitous. For instance, organizations from government to business routinely collect terabytes of data on a daily basis, and gaining non-trivial insights and knowledge from the data is the underlying force supporting these organizations. From financial institutions, issuance companies, to IT giants, such Google, Amazon, Yahoo!, to social networking stars, like Facebook and Twitter, data and information processing takes a central role in all of them. However, how to utilize the emerging computing platform to handle large scale data is a non-trivial problem and not only needs† skills in algorithm design, database, but also requires a good understanding of parallelism and computing environments (system capability in process, disk, memory, and network, etc) .
This course will introduce the state-of-arts computing platforms with the focus on how to utilize them in processing (managing and analyzing) massive datasets. Specifically, we will discuss the MapReduce (Hadoop) framework, which provides the most accessible and practical means of computing in the Cloud. We will also introduce the emerging distributed database and services, such as HBase, HyperTable, Amazonís Simple Storage Services (S3), etc. We will also cover Latin Pigs and Hive for large scale data analysis. Finally, we will utilize several key data processing tasks, including simple statistics, data aggregation, join processing, frequent pattern mining, data clustering, information retrieval, PageRank, and massive graph analytics as the case study for large scale data processing.
Hadoop: The Definitive Guide, Tom White, OíReilly
Hadoop In Action, Chuck Lam, Manning
Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer (www.umiacs.umd.edu/~jimmylin/MapReduce-book-final.pdf)
3. Intro to Hadoop/Mapreduce/HDFS (Chapter 2 in Data-Intensive Text Processing with MapReduce)
5. MapReduce Algorithm Design (Chapter 3 in Data-Intensive Text Processing with MapReduce)
11. Hadoop Practice (by Casey Stella, Technical Lead from Explorys)