I. Frequent Pattern Mining
In many situations, such as in a
data warehouse, the user usually has a view of multiple datasets collected
from different data sources or from different time points. In such scenarios,
comparing the patterns from different datasets and understanding their
relationships can be an extremely important part of the KDD process. For
example, consider mining the data warehouse for a nation-wide store, which has
three branches, in
We refer to the above queries, which involve multiple datasets, as CMT (Complex Mining Tasks) queries. In this work, we address the new functionality and new optimizations required for the CMT queries, specifically focusing on frequent itemset mining. We make the following contributions:
For details, please look at our paper,
Currently, we are developing a system to evaluate and optimize multiple CMT queries. Particularly, we will study how a cache can help to speedup the query evaluation. The framework of our system is as follows.

In recent years, database and data mining communities focus on a new model of data processing, where data arrives in the form of continuous streams. Because it is not feasible to store all data, it is quite challenging to perform the traditional data mining operations in a streaming environment. Our current and proposed research focuses on many of the challenges associated with mining streaming data. Particularly, we are driven by 1) the need to have low memory requirements, as stream mining is often carried out in small and hand held devices that do not have much memory, 2) some variants of the mining that may be desirable in a streaming environment, and 3) the need for having high accuracy and confidence in the results generated.
In this work, we take a new approach for mining frequent itemsets in data streams and make two major contributions. First, we propose a one pass algorithm for frequent itemset mining, which has deterministic bounds on the accuracy, and does not require any out-of-core summary structure. Second, because our one pass algorithm does not produce any false negatives, it can be easily extended to a two pass accurate algorithm. Our two pass algorithm is very memory efficient, and allows mining of datasets with large number of distinct items and/or very low support levels.
Our detailed experimental evaluation on synthetic and
real datasets shows the following. First, our one pass algorithm is
very accurate in practice. Second, our algorithm requires significantly lower
memory than Manku and Motwani’s one pass algorithm
and the multi-pass apriori algorithm. Our two pass
algorithm outperforms apriori and FP-tree when the
number of distinct items is large and/or support levels are very low. In
other cases, it is quite competitive, with possible exception of cases where
the average length of frequent itemsets is quite high.
For details, please look at our paper,
(3) New Data Structures for Maintaining Frequent Itemsets
When mining frequent itemsets on data streams, or caching the
results of mining operators in CMT queries, there are several common
operations.
1) Insertion of a set of new
itemsets with their frequency;
2) Deletion of the set of
itemsets less than certain threshold;
3) Query about the set of
itemsets larger than or equal to a certain threshold;
4) Query about the existence
or frequency of some itemsets;
5) Traversal of all itemsets.
To support these operations, efficient data structures are desired. However,
the traditional data structures, such as Prefix Tree and Hash Tree, can not
meet all these requirements.
In this work, we propose two new data structures, Dynamic Hash Tree
(DHT) and TreeHash, to support all these operations efficiently. The basic idea of these data structure is to
combine the hash table and tree structure in an efficient way to reduce both
memory and operation costs. In DHT, each node associates with a hash table,
which can grow or shrink according to the ratio between the children nodes and
current size of the hash table. As an alternative, TreeHash stores a prefix
tree using hash tables. It not only inherits the benefits of a hash table for
its easy deletion, but also compacts like a prefix tree. In TreeHash, a hashing
function maps each node in the prefix tree to a hash bucket. For each itemset
in a hash bucket, the following information is stored: 1) the length of the
itemset, 2) hash address of its immediate prefix itemset, which is used for
identifying this itemset, 3) a serial number to this itemset within this hash
bucket, and 4) a count of the number of occurrences of the itemset.
The detailed experiment to study the efficacy of our new data structures is on-going.
II. Efficient and Exact K-Means Clustering on Very Large Datasets
Clustering has been one of the most widely studied topics in data mining and k-means clustering has been one of the popular clustering algorithms. K-means requires several passes on the entire dataset, which can make it very expensive for large disk-resident datasets. In view of this, a lot of work has been done on various approximate versions of k-means, which require only one or a small number of passes on the entire dataset.
In this work, we propose a new algorithm FEKM which typically
requires only one or a small number of passes on the entire dataset, and
provably produces the same cluster centers as reported by the original k-means
algorithm. The algorithm uses sampling to create initial cluster centers, and
then takes one or more passes over the entire dataset to adjust these cluster
centers. We provide theoretical analysis to show that the cluster centers thus
reported are the same as the ones computed by the original k-means algorithm.
Experimental results from a number of real and synthetic datasets show speedup
between a factor of 2 and 4.5, as compared to k-means.
The work is presented in our paper:
In more recent work, we study how to compute the exact k-means
clustering in a distributed environment efficiently.
III. Efficient and Effective Decision Tree Construction on Streaming Data
Decision
tree construction is a well studied problem in data mining. Recently,
there has been much interest in mining streaming data. Domingos
and Hulten have proposed a one-pass algorithm for decision tree construction. Their
work uses Hoeffding inequality to achieve a probabilistic bound on the accuracy
of the tree constructed.
In this work, we revisit this problem. We make the following three contributions: 1) We propose a numerical interval pruning (NIP) approach for efficiently processing numerical attributes. Our results show an average of 39% reduction in execution times. 2) We develop a new algorithm which attempts to find, and mark as such, the exact split points by using only a single pass on data. Our experimental results on applying this algorithm to the top 5 levels of the tree show that an average of 79% of the non-leaf nodes could be determined to be split exactly. 3) We exploit the properties of the gain function entropy (and gini) to reduce the sample size required for obtaining a given bound on the accuracy. Our experimental results show a 37% reduction in the number of data instances required.
Overall, the three new techniques introduced here significantly
improve the efficiency and effectiveness of decision tree construction on
streaming data.
This work is presented in the following papers.