CS 69191: Masters Seminar
CS 89191: Doctoral Seminar

Spring 2009


Doctoral Student Presentation:
Feature Selection Techniques for Enhancing Text Categorization

Mohammed Al-Refai


This research aims to enhance text categorization by increasing results accuracy (classification correctness), reducing required classification time, and reducing dataset size (save memory). To achieve these goals the classification process was divided into two steps. In the first step, a technique that is called feature selection is utilized to select a set of features from the dataset, where these features reflects the semantic and meaning of the dataset; while removing features that are redundant or meaningless. This was done through introducing three feature selection methods namely, stemming, light-stemming, and word clustering technique. In the second step, text classification process was applied with the selected features only. K-nearest neighbors' classifier was applied for categorizing text documents. The following points are brief definitions for the three introduced feature selection techniques: 1-Stemming: reduces words to their stems. 2-Light-stemming: this introduced approach doesn't produce the linguistic stem exactly, rather it removes most frequent suffixes and prefixes. 3-Word clustering: clustering words based on the Symantec relation between them. As a future work, there are many ideas such as applying feature selection methods for web mining, and developing a statistical feature selection method for enhancing text categorization and information retrieval.