With the recent advances in theoretical and computational methods for characterizing chemical, physicochemical and biological phenomena, volumes of information (data matrices) are often available for analysis. This is however not necessarily all advantageous as it usually engenders high dimensionality (i.e. “small sample-many features”) space, which has detrimental influence on the performance of regression and classification algorithms. Moreover, an exhaustive examination of the entire feature (variable) space in the search of subsets that best describe a specified phenomenon comes along with high computational complexity, in addition to the fact that such exploration may lead to the selection of features that aggravate data overfitting. It is thus important to develop procedures that filter out noisy, redundant or highly correlated variables without affecting the learning performance. It is known that dimensionality reduction usually improves the quality of models (especially, their predictive power), in addition to permitting greater computational efficiency. In this sense, the IMMAN (acronym for Information theory-based CheMoMetrics ANalysis) software is conceived as a free computational tool for supervised and unsupervised feature selection based on information-theoretic parameters.
- To obtain the finest low-dimensional representation of data matrices, using unsupervised feature selection filters.
- To select sets of features which best correlate with classification labels, using supervised feature selection tools.
- Data pre-processing which includes tasks such as dataset partitioning, missing values processing and basic descriptive statistical analysis.
- Perform graphical analysis of the results for better comprehension and interpretation.