Decision trees

6 min readFeb 22, 2023

Decision trees are a prominent machine learning algorithm for classification and regression applications. A decision tree is a tree-like model in which each node represents a decision or test on a feature or attribute, and each branch reflects the result of that decision or test. The leaves of the tree symbolize the ultimate forecast or decision.

The decision tree feature attribute algorithm divides the input data recursively into smaller groups based on the values of various features or attributes. Based on an assessment measure such as information gain or Gini impurity, the algorithm selects the feature or attribute at each step that offers the optimal split or division of the data. The splitting process continues until a stopping requirement is satisfied, such as when all data points in a subset belong to the same class or when a maximum depth or number of nodes is achieved.

By following the route from the root to a leaf node based on the attribute values of the input features or attributes, the decision tree may be used to generate predictions on fresh data. The projected class or value is then the value associated with the leaf node.

There are various benefits of decision trees, such as their ability to handle categorical and continuous input features, their interpretability, and their ability to manage non-linear correlations between input features and the target variable. Nevertheless, if not adequately regularized, they might be prone to overfitting the training data and may not be as accurate as other, more sophisticated models on specific tasks.

Sure, let’s build a decision tree from scratch. To keep things simple, we will build a decision tree for a binary classification problem with only two input features.

A dataset with the following four data points

Our goal is to build a decision tree that can predict the label (0 or 1) of a new data point based on its values for feature 1 and feature 2.

The first step in building a decision tree is to choose the best feature to split the data on. We can do this by calculating the information gain or Gini impurity for each feature. For simplicity, let’s use the Gini impurity as our evaluation metric.

The Gini impurity measures the probability of misclassifying a randomly chosen data point from the dataset, given the current node in the decision tree. A Gini impurity of 0 means that all data points in the node belong to the same class, while a Gini impurity of 0.5 means that the data points are evenly split between the two classes.

Let’s calculate the Gini impurity for each feature:

Since both features have the same Gini impurity, we can choose either feature to split the data on. Let’s choose feature 1.

We will split the data on feature 1 and create two child nodes. The left child node will contain the data points where feature 1 is 0, and the right child node will contain the data points where feature 1 is 1.

The left child node contains only two data points, both with different labels. We cannot split this node further, so we will create a leaf node that predicts the most common label in the node, which is 0.

The right child node contains two data points, one with label 0 and one with label 1. We can split this node further using feature 2.

We will start by defining a class Node to represent a node in the decision tree:

class Node:
    def __init__(self, data, depth):
        self.data = data
        self.depth = depth
        self.left = None
        self.right = None
        self.feature = None
        self.threshold = None
        self.label = None

The data parameter represents the subset of the training data that belongs to this node. The depth parameter represents the depth of the node in the decision tree.

We will also define a function gini_impurity to calculate the Gini impurity for a given subset of the training data:

def gini_impurity(data):
    labels, counts = np.unique(data[:, -1], return_counts=True)
    probabilities = counts / len(data)
    return 1 - np.sum(probabilities ** 2)

The labels variable contains the unique labels in the data, and the counts variable contains the number of data points with each label. We calculate the probability of each label and use them to calculate the Gini impurity.

Next, we will define a function best_split to find the best feature and threshold to split the data:

def best_split(data):
    best_feature = None
    best_threshold = None
    best_gain = 0
    
    for feature in range(data.shape[1] - 1):
        values = np.unique(data[:, feature])
        
        for threshold in values:
            left_data = data[data[:, feature] < threshold]
            right_data = data[data[:, feature] >= threshold]
            
            if len(left_data) == 0 or len(right_data) == 0:
                continue
                
            gain = gini_impurity(data) - (len(left_data) / len(data) * gini_impurity(left_data) 
                                         + len(right_data) / len(data) * gini_impurity(right_data))
            
            if gain > best_gain:
                best_gain = gain
                best_feature = feature
                best_threshold = threshold
                
    return best_feature, best_threshold

The values variable contains the unique values of the current feature in the data. We loop through each value and split the data into two subsets based on whether the feature value is less than or greater than or equal to the threshold.

We calculate the information gain using the Gini impurity of the parent node and the Gini impurities of the left and right child nodes. We select the feature and threshold that maximize the information gain.

Finally, we will define a function build_tree to build the decision tree recursively:

def build_tree(data, max_depth, depth):
    node = Node(data, depth)
    
    if depth == max_depth or len(np.unique(data[:, -1])) == 1:
        node.label = np.bincount(data[:, -1]).argmax()
        return node
    
    feature, threshold = best_split(data)
    
    if feature is None:
        node.label = np.bincount(data[:, -1]).argmax()
        return node
    
    node.feature = feature
    node.threshold = threshold
    
    left_data = data[data[:, feature] < threshold]
    right_data = data[data[:, feature] >= threshold]
    
    node.left = build_tree(left_data, max_depth, depth + 1)
    node.right = build_tree(right_data, max_depth, depth + 1)
    
    return node

The max_depth parameter controls the maximum depth of the decision tree. If the current depth equals the maximum depth or all data.

The complete code will be as follows:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np

# Load the iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define a class for the nodes of the decision tree
class Node:
    def __init__(self, feature=None, threshold=None, left=None, right=None, *, value=None):
        self.feature = feature
        self.threshold = threshold
        self.left = left
        self.right = right
        self.value = value

# Define a class for the decision tree
class DecisionTree:
    def __init__(self, min_samples_split=2, max_depth=100, n_feats=None):
        self.min_samples_split = min_samples_split
        self.max_depth = max_depth
        self.n_feats = n_feats
        self.root = None

    # Define a function to calculate the Gini index
    def gini(self, y):
        _, counts = np.unique(y, return_counts=True)
        p = counts / len(y)
        return 1 - np.sum(p ** 2)

    # Define a function to calculate the information gain
    def information_gain(self, y, y1, y2):
        p = len(y1) / len(y)
        return self.gini(y) - p * self.gini(y1) - (1 - p) * self.gini(y2)

    # Define a function to find the best split
    def best_split(self, X, y):
        best_gain = -1
        best_feature = None
        best_threshold = None
        for feature in range(X.shape[1]):
            thresholds, values = zip(*sorted(zip(X[:, feature], y)))
            for i in range(1, len(thresholds)):
                if thresholds[i] != thresholds[i - 1]:
                    y1, y2 = values[:i], values[i:]
                    gain = self.information_gain(y, y1, y2)
                    if gain > best_gain:
                        best_gain, best_feature, best_threshold = gain, feature, thresholds[i]
        return best_gain, best_feature, best_threshold

    # Define a function to build the tree
    def build_tree(self, X, y, depth=0):
        largest_class = np.bincount(y).argmax()
        node = Node(value=largest_class)
        gain, feature, threshold = self.best_split(X, y)
        if gain == 0 or depth >= self.max_depth or len(y) < self.min_samples_split:
            return node
        indices_left = X[:, feature] < threshold
        X_left, y_left = X[indices_left], y[indices_left]
        X_right, y_right = X[~indices_left], y[~indices_left]
        node.feature = feature
        node.threshold = threshold
        node.left = self.build_tree(X_left, y_left, depth + 1)
        node.right = self.build_tree(X_right, y_right, depth + 1)
        return node

    # Define a function to fit the decision tree
    def fit(self, X, y):
        self.n_feats = X.shape[1] if not self.n_feats else min(self.n_feats, X.shape[1])
        self.root = self.build_tree(X, y)

    # Define a function to predict the labels
    def predict(self, X):
        return np.array
        
# Define a function to calculate the accuracy
def accuracy(y_true, y_pred):
    accuracy = np.sum(y_true == y_pred) / len(y_true)
    return accuracy

# Define a function to train the decision tree
def train(X_train, y_train, X_test, y_test, max_depth):
    model = DecisionTree(max_depth=max_depth)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy(y_test, y_pred)
    return acc

# Train the decision tree
acc = train(X_train, y_train, X_test, y_test, max_depth=10)
print('Accuracy:', acc)

Decision trees

Written by Mehmet Akif Cifci

No responses yet