Tuesdays and
Thursdays 17:30PM-18:45PM; Rm. SMH - 00202
Office Hours: Mondays and Wednesdays 16:30PM-17:30PM
TA: Lin Liu (lliu AT kent cs
email domain)
---------------------------------------------------------
CS 4/56101 Algorithms
CS 33001 Data Structures
CS 4/53005 Database Design
CS 6/73015 Data Mining Techniques
Or Consent of the Instructor
In the past few years, we have
witnessed the emergence of several computing platforms, ranging from Cloud, to
Multi-Core, to Mobile environment. They are quickly shaping the computing
landscape. A major objective of these new computing platforms is to assist managing
and processing the massive amount of data which become ubiquitous. For
instance, organizations from government to business routinely collect terabytes
of data on a daily basis, and gaining non-trivial insights and knowledge from
the data is the underlying force supporting these organizations. From financial
institutions, issuance companies, to IT giants, such Google, Amazon, Yahoo!, to
social networking stars, like Facebook and Twitter, data and information
processing takes a central role in all of them. However, how to utilize the
emerging computing platform to handle large scale data is a non-trivial problem
and not only needs skills in algorithm
design, database, but also requires a good understanding of parallelism and
computing environments (system capability in process, disk, memory, and
network, etc) .
This course
will introduce the state-of-arts computing platforms with the focus on how to
utilize them in processing (managing and analyzing) massive datasets.
Specifically, we will discuss the MapReduce (Hadoop)
framework, which provides the most accessible and practical means of computing
in the Cloud. We will also introduce the emerging distributed database and
services, such as HBase, HyperTable,
Amazon’s Simple Storage Services (S3), etc. We will also cover Latin Pigs and
Hive for large scale data analysis. Finally, we will utilize several key data
processing tasks, including simple statistics, data aggregation, join
processing, frequent pattern mining, data clustering, information retrieval,
PageRank, and massive graph analytics as the case study for large scale data
processing.
Hadoop: The Definitive Guide, Tom
White, O’Reilly
Hadoop In Action, Chuck Lam,
Manning
Data-Intensive Text
Processing with MapReduce, Jimmy Lin and Chris Dyer
(www.umiacs.umd.edu/~jimmylin/MapReduce-book-final.pdf)
2. Big Data and Basic Data Analysis
3. Intro to Hadoop/Mapreduce/HDFS (Chapter 2 in Data-Intensive Text Processing with MapReduce)
4. Running Hadoop and Hadoop on VM
5. MapReduce Algorithm Design (Chapter 3 in Data-Intensive Text Processing with MapReduce)
6. Information Retrieval (Chapter 4)
8. Graph Processing with MapReduce
9. Database Management with MapReduce
10. Data Mining and Machine Learning with MapReduce
11. Hadoop Practice (by Casey Stella, Technical Lead from Explorys)
12. NoSQL (CAP, HBase, Cassandra, Neo4j)+Hive/Pig