Synthetic Minority Over-sampling Technique

Mehmet Akif Cifci
3 min readFeb 24, 2023

--

SMOTE stands for Synthetic Minority Over-sampling Technique. It is a data augmentation technique commonly used in machine learning to deal with imbalanced datasets. Imbalanced datasets are those in which the classes to be predicted are not represented equally. For example, the dataset is imbalanced if we have a binary classification problem in which 95% of the samples belong to class A, and only 5% belong to class B. In such cases, the minority class (class B) is often underrepresented in the training data, making it difficult for the machine learning algorithm to learn to classify the minority class correctly. SMOTE is a technique that generates synthetic samples of the minority class by interpolating new data points between the existing minority class samples.

Mehmet Akif CIFCI

An imbalanced dataset is one in which the predicted classes are represented in different ways. In other words, one or more classes have a much smaller number of samples than the others. For example, in a binary classification problem, if 90% of the samples belong to class A and only 10% belong to class B, the dataset is imbalanced. Similarly, in a multi-class classification problem, the dataset is also considered imbalanced if one or more classes have a much smaller number of samples than the others.

Imbalanced datasets are standard in many real-world applications, such as fraud detection, disease diagnosis, and anomaly detection. However, they can challenge machine learning algorithms because they tend to be biased toward the majority class. This means that the algorithm may have higher accuracy for predicting the majority class but perform poorly on the minority class. To address this issue, SMOTE is used. The SMOTE algorithm works by randomly selecting a minority class sample, finding its k nearest neighbors, and generating new synthetic samples by linearly interpolating between the chosen sample and its k nearest neighbors. The user typically specifies the number of synthetic samples to be generated.

Zaki Jefferson

Applying SMOTE is a more balanced augmented dataset that provides the machine learning algorithm with more examples of the minority class to learn from, thereby improving its performance on the imbalanced dataset.

Sure, here is an example code using Python and the imbalanced-learn library to apply SMOTE on an imbalanced dataset:

import os
from imblearn.over_sampling import SMOTE
import pandas as pd

# Load dataset
dataset = '/reditcard.csv'
df = pd.read_csv(dataset)

X = df.drop('Class', axis=1)
y = df['Class']

smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X, y)

print("Number of samples before SMOTE:", len(X))
print("Number of samples after SMOTE:", len(X_resampled))

# Save the resampled dataset to a new CSV file
dirname = os.path.dirname(dataset)
resampled_df = pd.concat([pd.DataFrame(X_resampled), pd.DataFrame(y_resampled)], axis=1)
resampled_df.to_csv(os.path.join(dirname, 'resampled_dataset.csv'), index=False)

The code applies the SMOTE algorithm to an imbalanced dataset using the imbalanced-learn library in Python. The dataset used in the example is a credit card fraud dataset, which is a common use case for SMOTE as the fraud class is often underrepresented.

Here are the steps:

Load the dataset from a CSV file using pandas. Separate the features (X) and target variable (y). Create an instance of the SMOTE class. Call the fit_resample() method on the SMOTE instance and pass in the features (X) and target variable (y) as arguments. This generates synthetic samples of the minority class (fraud) and balances the dataset. Print the number of samples before and after SMOTE to verify that the dataset has been balanced. Save the resampled dataset to a new CSV file using pandas and os.path.join() to specify the directory and file name.

Mehmet Akif CIFCI

--

--

Mehmet Akif Cifci
Mehmet Akif Cifci

Written by Mehmet Akif Cifci

Mehmet Akif Cifci holds the position of associate professor in the field of computer science in Austria.

No responses yet