I. Frequent Pattern Mining

(1) Mining Multiple Datasets 

In many situations, such as in a data warehouse, the user usually has a view of multiple datasets collected from different data sources or from different time points. In such scenarios, comparing the patterns from different datasets and understanding their relationships can be an extremely important part of the KDD process. For example, consider mining the data warehouse for a nation-wide store, which has three branches, in New Jersey, New York, and California, respectively. Each of them maintains a database with last one week’s retail transactions. To understand how the geographical factors impact shopping patterns, queries of the following type are likely to be asked:

  • Q1: Find the itemsets that are frequent with support level 0.1% in any of the stores.
  • Q2: Find the itemsets that are frequent with support level 0.1% in each store. 
  • Q3: Find the itemsets that are frequent with support level 0.05% in both the stores on east coast, but are very infrequent (support less than 0.01%) in the west coast store.

 

We refer to the above queries, which involve multiple datasets, as CMT (Complex Mining Tasks) queries.   In this work, we address the new functionality and new optimizations required for the CMT queries, specifically focusing on frequent itemset mining. We make the following contributions:  

  • We propose an SQL-based mechanism for querying frequent itemsets across multiple datasets.
  • We establish an algebra for such queries.
  • We develop a number of transformations on this algebra, and introduce several new operators for this purpose.
  • We propose several heuristic algorithms for finding the efficient query plan.
  • We evaluate our algorithms on real and synthetic datasets, and show a performance gain up to an order of magnitude.

 

For details, please look at our paper,

 

Currently, we are developing a system to evaluate and optimize multiple CMT queries. Particularly, we will study how a cache can help to speedup the query evaluation. The framework of our system is as follows.

 

 

 

(2) Mining Data Streams

 In recent years, database and data mining communities focus on a new model of data processing, where data arrives in the form of continuous streams. Because it is not feasible to store all data, it is quite challenging to perform the traditional data mining operations in a streaming environment. Our current and proposed research focuses on many of the challenges associated with mining streaming data. Particularly, we are driven by 1) the need to have low memory requirements, as stream mining is often carried out in small and hand held devices that do not have much memory, 2) some variants of the mining that may be desirable in a streaming environment, and 3) the need for having high accuracy and confidence in the results generated.

 

In this work, we take a new approach for mining frequent itemsets in data streams and make two major contributions. First, we propose a one pass algorithm for frequent itemset mining, which has deterministic bounds on the accuracy, and does not require any out-of-core summary structure. Second, because our one pass algorithm does not produce any false negatives, it can be easily extended to a two pass accurate algorithm. Our two pass algorithm is very memory efficient, and allows mining of datasets with large number of distinct items and/or very low support levels.

 

Our detailed experimental evaluation on synthetic and real datasets shows the following. First, our one pass algorithm is very accurate in practice. Second, our algorithm requires significantly lower memory than Manku and Motwani’s one pass algorithm and the multi-pass apriori algorithm. Our two pass algorithm outperforms apriori and FP-tree when the number of distinct items is large and/or support levels are very low.  In other cases, it is quite competitive, with possible exception of cases where the average length of frequent itemsets is quite high.

 

For details, please look at our paper,

 

 

(3) New Data Structures for Maintaining Frequent Itemsets 

When mining frequent itemsets on data streams, or caching the results of mining operators in CMT queries, there are several common operations.
            1) Insertion of a set of new itemsets with their frequency;
            2) Deletion of the set of itemsets less than certain threshold;
            3) Query about the set of itemsets larger than or equal to a certain threshold;
            4) Query about the existence or frequency of some itemsets;
            5) Traversal of all itemsets.
To support these operations, efficient data structures are desired. However, the traditional data structures, such as Prefix Tree and Hash Tree, can not meet all these requirements.
 

In this work, we propose two new data structures, Dynamic Hash Tree (DHT) and TreeHash, to support all these operations efficiently.  The basic idea of these data structure is to combine the hash table and tree structure in an efficient way to reduce both memory and operation costs. In DHT, each node associates with a hash table, which can grow or shrink according to the ratio between the children nodes and current size of the hash table. As an alternative, TreeHash stores a prefix tree using hash tables. It not only inherits the benefits of a hash table for its easy deletion, but also compacts like a prefix tree. In TreeHash, a hashing function maps each node in the prefix tree to a hash bucket. For each itemset in a hash bucket, the following information is stored: 1) the length of the itemset, 2) hash address of its immediate prefix itemset, which is used for identifying this itemset, 3) a serial number to this itemset within this hash bucket, and 4) a count of the number of occurrences of the itemset.

 

The detailed experiment to study the efficacy of our new data structures is on-going.

 

II. Efficient and Exact K-Means Clustering on Very Large Datasets

Clustering has been one of the most widely studied topics in data mining and k-means clustering has been one of the popular clustering algorithms. K-means requires several passes on the entire dataset, which can make it very expensive for large disk-resident datasets. In view of this, a lot of work has been done on various approximate versions of k-means, which require only one or a small number of passes on the entire dataset. 

 

In this work, we propose a new algorithm FEKM which typically requires only one or a small number of passes on the entire dataset, and provably produces the same cluster centers as reported by the original k-means algorithm. The algorithm uses sampling to create initial cluster centers, and then takes one or more passes over the entire dataset to adjust these cluster centers. We provide theoretical analysis to show that the cluster centers thus reported are the same as the ones computed by the original k-means algorithm. Experimental results from a number of real and synthetic datasets show speedup between a factor of 2 and 4.5, as compared to k-means.

 

The work is presented in our paper:

 

In more recent work, we study how to compute the exact k-means clustering in a distributed environment efficiently.

 

III. Efficient and Effective Decision Tree Construction on Streaming Data

Decision tree construction is a well studied problem in data mining.  Recently, there has been much interest in mining streaming data. Domingos and Hulten have proposed a one-pass algorithm for decision tree construction. Their work uses Hoeffding inequality to achieve a probabilistic bound on the accuracy of the tree constructed. 

 

In this work, we revisit this problem. We make the following three contributions:  1) We propose a numerical interval pruning (NIP) approach for efficiently processing numerical attributes. Our results show an average of 39% reduction in execution times. 2) We develop a new algorithm which attempts to find, and mark as such, the exact split points by using only a single pass on data. Our experimental results on applying this algorithm to the top 5 levels of the tree show that an average of 79% of the non-leaf nodes could be determined to be split exactly.  3) We exploit the properties of the gain function entropy (and gini) to reduce the sample size required for obtaining a given bound on the accuracy. Our experimental results show a 37% reduction in the number of data instances required.              

 

Overall, the three new techniques introduced here significantly improve the efficiency and effectiveness of decision tree construction on streaming data.

 

This work is presented in the following papers.

  • Ruoming Jin and Gagan Agrawal, Efficient Decision Tree Construction on Streaming Data,  in the Proc. of 9th International Conference on Knowledge Discovery and Data Mining (SIGKDD), August 2003.
  • Ruoming Jin, Anjan Goswami, and Gagan Agrawal, Effective Decision Tree Construction on Streaming Data, submitted for publication.