Abstract: Appropriate feature engineering, the process of combining and transforming raw features, can contribute significantly to improving the performance of a supervised learning task. Because of the combinatorial explosion of possible engineered features, very often successful feature engineering in a corporate environment requires domain knowledge. However, with the increasing complexity of data and numbers of sources of data, applying human insight to the feature engineering process becomes difficult.
In this presentation, I introduce a novel, fast, and hopefully elegant algorithm where the complexity of the process of feature engineering is only connected to the number of features relevant to modelling the target variable, instead of all the features in a dataset. When the number of relevant features is much smaller than the total number of features, this amounts to increasing the efficiency to almost constant time. This method draws in insights from previous work linking machine learning and information theory. I then present the results of tests of the algorithm, which show that the engineering method I have developed is indeed effective in creating a feature that improves the performance of both classification and regression algorithms when the engineered feature is included along with the rest of the features. In order to reach statistically valid conclusions, it is necessary to test the algorithm on large numbers of appropriate datasets. Therefore I also introduce a method by which to generate synthetic datasets with desirable characteristics on which to test machine learning algorithms.
At this stage, the work is a proof of concept of the method created. Future work would include creating more generalised methods and coding these into Python packages for use by the data science community.
Comments are closed.