Selecting the Best Features for Your Machine Learning Model: A Comprehensive Review of Three Popular Methods

7 min readFeb 25, 2023

Correlation-based Feature Selection (CFS): CFS is a filter-based feature selection method that selects features based on their correlation with the class variable and their intercorrelation. CFS computes a subset of features that maximizes the correlation with the class while minimizing the intercorrelation between features. It does this by computing the correlation coefficient between each feature and the class and between pairs of features. The correlation coefficient measures the strength of the linear relationship between two variables, ranging from -1 (perfect negative correlation) to 1 (perfect positive correlation). CFS selects features with high correlation with the class and low intercorrelation, achieved through a search process that evaluates the subset of features.

In addition to its ability to select relevant features and minimize redundancy, CFS has advantages over other feature selection methods. CFS is computationally efficient, as it does not require training a model on the entire dataset and can handle continuous and discrete data. Moreover, CFS can handle datasets with missing values, using a subset of available features to calculate the correlation between features and the class variable.

However, CFS has some limitations as well. First, CFS assumes a linear relationship between features and the class variable, which may not be accurate in some datasets. Second, CFS may select features that are highly correlated with each other but do not provide unique information about the class variable. This can lead to overfitting and reduce the generalization performance of the model. CFS may only be suitable for datasets with a few features, as it requires computing the correlation coefficient for each pair of features, which can be computationally expensive.

To sum up, CFS is a valuable feature selection method that can help improve machine learning models' performance by selecting the most relevant features and minimizing redundancy. However, its limitations should be carefully considered before applying it to a specific dataset, and it should be compared to other feature selection methods to ensure the best performance.

The algorithm of the CFS

1. Calculate the entropy of the target class variable using the formula: H(C) = — sum(p_i * log2(p_i)), where p_i is the proportion of instances in class i.
2. For each feature F_i, calculate the conditional entropy H(C|F_i) using the formula: H(C|F_i) = — sum(p_ij * log2(p_ij)), where p_ij is the proportion of instances in class j for which feature F_i takes the value i.
3. Calculate the information gain of each feature using the formula: IG(F_i) = H(C) — H(C|F_i).
4. Rank the features based on their information gain in descending order.
5. Select the top-k features with the highest information gain as the subset of features for training the model.

The sample code will be like

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif

X, y = load_dataset()

k = 10
selector = SelectKBest(score_func=mutual_info_classif, k=k)
X_new = selector.fit_transform(X, y)

model = train_model(X_new, y)

In this example, we first load the dataset X and target variable y. Then, we create a SelectKBest object and specify mutual_info_classif the score function to use for selecting the features. We set k=10 to select the top 10 features with the highest information gain. We then fit the selector on the dataset using fit_transform and obtain the new dataset X_new with only the selected features.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Pearson correlation: Pearson correlation is a statistical method that measures the linear relationship between two variables. It calculates the correlation coefficient between each feature and the class variable, where a higher value indicates a stronger correlation. The Pearson correlation method assumes a linear relationship between variables and assumes that the variables follow a normal distribution. This method can be used for continuous variables and is sensitive to outliers. Pearson correlation is a commonly used statistical method for measuring the linear relationship between two variables. In the context of feature selection, it is used to identify the features most strongly correlated with the target variable (i.e., the class variable). The Pearson correlation coefficient, denoted by r, ranges from -1 to 1, where -1 indicates a perfect negative correlation, 0 indicates no correlation, and 1 indicates a perfect positive correlation. A higher absolute value of r indicates a stronger linear relationship between the two variables.

To use Pearson correlation for feature selection, we first calculate the correlation coefficient between each feature and the target variable. This can be done using the pearsonr function from the scipy.stats module in Python. The pearsonr function returns two values: the correlation coefficient r and a p-value that measures the probability that the observed correlation is due to chance. In feature selection, we typically focus on the absolute value of r, since we are interested in the strength of the correlation rather than its direction.

After calculating the correlation coefficients for all features, we can rank the features based on their absolute correlation coefficients and select the top-k features with the highest values. This can be done using the SelectKBest class from sklearn.feature_selection module in Python. We specify the pearsonr function as the scoring function and set k to select the desired number of features.

It’s important to note that Pearson correlation assumes a linear relationship between the variables and is sensitive to outliers. Therefore, it may not be suitable for datasets with non-linear relationships or extreme values. Additionally, Pearson correlation only measures the strength of the linear relationship between two variables and may miss important non-linear or higher-order relationships.

Algorithm for using Pearson correlation for feature selection:

1. Calculate the Pearson correlation coefficient between each feature and the target variable using the pearsonr function from the scipy.stats module.
2. Calculate the absolute value of the correlation coefficient for each feature.
3. Rank the features based on their absolute correlation coefficients in descending order.
4. Select the top-k features with the highest correlation coefficients using the SelectKBest class from sklearn.feature_selection module, with k set to select the desired number of features.

from sklearn.feature_selection import SelectKBest
from scipy.stats import pearsonr

X, y = load_dataset()
k = 10
def pearson_score(X, y):
    scores = []
    for i in range(X.shape[1]):
        score, _ = pearsonr(X[:, i], y)
        scores.append(abs(score))
    return scores

selector = SelectKBest(score_func=pearson_score, k=k)
X_new = selector.fit_transform(X, y)

model = train_model(X_new, y)

In this example, we first load the dataset X and target variable y. Then, we define a custom pearson_score function that calculates the absolute Pearson correlation coefficient between each feature in X and the target variable y. We use the SelectKBest object and specify pearson_score as the score function to use for selecting the features. We set k=10 to select the top 10 features with the highest absolute correlation coefficients. We then fit the selector on the dataset using fit_transform and obtain the new dataset X_new with only the selected features. we train a model using the selected features X_new and the target variable y using the train_model function (which is not shown in this example). Note that the specific model used.

Correlation-based Feature Selection (CFS) and Pearson correlation are both feature selection methods used in machine learning. However, there are some key differences between the two.

CFS is a filter-based method that selects features based on their correlation with the class variable and their intercorrelation. CFS computes a subset of features that maximizes the correlation with the class while minimizing the intercorrelation between features. On the other hand, Pearson correlation is a statistical method that measures the linear relationship between two variables. It calculates the correlation coefficient between each feature and the class variable, where a higher value indicates a stronger correlation.

While both methods use correlation as a metric for feature selection, CFS evaluates the intercorrelation between features, while Pearson correlation evaluates the correlation between a feature and the class variable. CFS can handle redundant features and remove them, which is not possible with Pearson correlation. However, Pearson correlation can handle continuous variables and is sensitive to outliers, while CFS is suitable for discrete variables.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Information Gain (IG): IG is an entropy-based feature selection method that selects features with the highest information gain. It measures the information a feature provides about the class variable by computing the difference in entropy before and after splitting the data based on the feature. A feature with high information gain reduces the entropy of the class variable and helps to classify the data accurately. IG is commonly used for discrete variables and can be applied to continuous variables by discretizing them.

In more detail, Information Gain (IG) is a feature selection method that uses the concept of entropy to quantify the amount of information gained by splitting data on a given feature. Entropy is a measure of the impurity of a dataset, and the goal of IG is to select the feature that maximizes the reduction in entropy when the dataset is split based on that feature.

The algorithm for Information Gain feature selection is as follows:

1. Compute the entropy of the original dataset based on the class variable.
2. For each feature, compute the entropy of the dataset after splitting it based on that feature.
3. Calculate the information gain of each feature as the difference between the entropy of the original dataset and the entropy of the dataset after splitting it based on that feature.
4. Select the feature with the highest information gain as the next feature to add to the subset of selected features.
5. Repeat steps 2–4 until the desired number of features is selected or until all features have been evaluated.

IG is commonly used for discrete variables but can be applied to continuous variables by discretizing them into bins. However, the quality of the discretization can affect the performance of the feature selection algorithm.

from sklearn.feature_selection import mutual_info_classif

ig_scores = mutual_info_classif(X, y)
sorted_indices = ig_scores.argsort()[::-1]
k = 10
selected_features = sorted_indices[:k]
X_selected = X[:, selected_features]

In this code, we first import the mutual_info_classif function from scikit-learn's feature_selection module, which computes the mutual information between each feature and the target variable. Then, we apply this function to the feature matrix X and target variable y to get an array of information gain scores for each feature.

Selecting the Best Features for Your Machine Learning Model: A Comprehensive Review of Three Popular Methods

Written by Mehmet Akif Cifci

No responses yet