Decision trees
Decision trees are a prominent machine learning algorithm for classification and regression applications. A decision tree is a tree-like model in which each node represents a decision or test on a feature or attribute, and each branch reflects the result of that decision or test. The leaves of the tree symbolize the ultimate forecast or decision.
The decision tree feature attribute algorithm divides the input data recursively into smaller groups based on the values of various features or attributes. Based on an assessment measure such as information gain or Gini impurity, the algorithm selects the feature or attribute at each step that offers the optimal split or division of the data. The splitting process continues until a stopping requirement is satisfied, such as when all data points in a subset belong to the same class or when a maximum depth or number of nodes is achieved.
By following the route from the root to a leaf node based on the attribute values of the input features or attributes, the decision tree may be used to generate predictions on fresh data. The projected class or value is then the value associated with the leaf node.
There are various benefits of decision trees, such as their ability to handle categorical and continuous input features, their interpretability, and their ability to manage non-linear correlations between input features and the target variable. Nevertheless, if not adequately regularized, they might be prone to overfitting the training data and may not be as accurate as other, more sophisticated models on specific tasks.
Sure, let’s build a decision tree from scratch. To keep things simple, we will build a decision tree for a binary classification problem with only two input features.
Our goal is to build a decision tree that can predict the label (0 or 1) of a new data point based on its values for feature 1 and feature 2.
The first step in building a decision tree is to choose the best feature to split the data on. We can do this by calculating the information gain or Gini impurity for each feature. For simplicity, let’s use the Gini impurity as our evaluation metric.
The Gini impurity measures the probability of misclassifying a randomly chosen data point from the dataset, given the current node in the decision tree. A Gini impurity of 0 means that all data points in the node belong to the same class, while a Gini impurity of 0.5 means that the data points are evenly split between the two classes.
Let’s calculate the Gini impurity for each feature:
Since both features have the same Gini impurity, we can choose either feature to split the data on. Let’s choose feature 1.
We will split the data on feature 1 and create two child nodes. The left child node will contain the data points where feature 1 is 0, and the right child node will contain the data points where feature 1 is 1.
The left child node contains only two data points, both with different labels. We cannot split this node further, so we will create a leaf node that predicts the most common label in the node, which is 0.
The right child node contains two data points, one with label 0 and one with label 1. We can split this node further using feature 2.
We will start by defining a class Node to represent a node in the decision tree:
class Node:
def __init__(self, data, depth):
self.data = data
self.depth = depth
self.left = None
self.right = None
self.feature = None
self.threshold = None
self.label = None
The data
parameter represents the subset of the training data that belongs to this node. The depth
parameter represents the depth of the node in the decision tree.
We will also define a function gini_impurity
to calculate the Gini impurity for a given subset of the training data:
def gini_impurity(data):
labels, counts = np.unique(data[:, -1], return_counts=True)
probabilities = counts / len(data)
return 1 - np.sum(probabilities ** 2)
The labels
variable contains the unique labels in the data, and the counts
variable contains the number of data points with each label. We calculate the probability of each label and use them to calculate the Gini impurity.
Next, we will define a function best_split
to find the best feature and threshold to split the data:
def best_split(data):
best_feature = None
best_threshold = None
best_gain = 0
for feature in range(data.shape[1] - 1):
values = np.unique(data[:, feature])
for threshold in values:
left_data = data[data[:, feature] < threshold]
right_data = data[data[:, feature] >= threshold]
if len(left_data) == 0 or len(right_data) == 0:
continue
gain = gini_impurity(data) - (len(left_data) / len(data) * gini_impurity(left_data)
+ len(right_data) / len(data) * gini_impurity(right_data))
if gain > best_gain:
best_gain = gain
best_feature = feature
best_threshold = threshold
return best_feature, best_threshold
The values
variable contains the unique values of the current feature in the data. We loop through each value and split the data into two subsets based on whether the feature value is less than or greater than or equal to the threshold.
We calculate the information gain using the Gini impurity of the parent node and the Gini impurities of the left and right child nodes. We select the feature and threshold that maximize the information gain.
Finally, we will define a function build_tree
to build the decision tree recursively:
def build_tree(data, max_depth, depth):
node = Node(data, depth)
if depth == max_depth or len(np.unique(data[:, -1])) == 1:
node.label = np.bincount(data[:, -1]).argmax()
return node
feature, threshold = best_split(data)
if feature is None:
node.label = np.bincount(data[:, -1]).argmax()
return node
node.feature = feature
node.threshold = threshold
left_data = data[data[:, feature] < threshold]
right_data = data[data[:, feature] >= threshold]
node.left = build_tree(left_data, max_depth, depth + 1)
node.right = build_tree(right_data, max_depth, depth + 1)
return node
The max_depth
parameter controls the maximum depth of the decision tree. If the current depth equals the maximum depth or all data.
The complete code will be as follows:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np
# Load the iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define a class for the nodes of the decision tree
class Node:
def __init__(self, feature=None, threshold=None, left=None, right=None, *, value=None):
self.feature = feature
self.threshold = threshold
self.left = left
self.right = right
self.value = value
# Define a class for the decision tree
class DecisionTree:
def __init__(self, min_samples_split=2, max_depth=100, n_feats=None):
self.min_samples_split = min_samples_split
self.max_depth = max_depth
self.n_feats = n_feats
self.root = None
# Define a function to calculate the Gini index
def gini(self, y):
_, counts = np.unique(y, return_counts=True)
p = counts / len(y)
return 1 - np.sum(p ** 2)
# Define a function to calculate the information gain
def information_gain(self, y, y1, y2):
p = len(y1) / len(y)
return self.gini(y) - p * self.gini(y1) - (1 - p) * self.gini(y2)
# Define a function to find the best split
def best_split(self, X, y):
best_gain = -1
best_feature = None
best_threshold = None
for feature in range(X.shape[1]):
thresholds, values = zip(*sorted(zip(X[:, feature], y)))
for i in range(1, len(thresholds)):
if thresholds[i] != thresholds[i - 1]:
y1, y2 = values[:i], values[i:]
gain = self.information_gain(y, y1, y2)
if gain > best_gain:
best_gain, best_feature, best_threshold = gain, feature, thresholds[i]
return best_gain, best_feature, best_threshold
# Define a function to build the tree
def build_tree(self, X, y, depth=0):
largest_class = np.bincount(y).argmax()
node = Node(value=largest_class)
gain, feature, threshold = self.best_split(X, y)
if gain == 0 or depth >= self.max_depth or len(y) < self.min_samples_split:
return node
indices_left = X[:, feature] < threshold
X_left, y_left = X[indices_left], y[indices_left]
X_right, y_right = X[~indices_left], y[~indices_left]
node.feature = feature
node.threshold = threshold
node.left = self.build_tree(X_left, y_left, depth + 1)
node.right = self.build_tree(X_right, y_right, depth + 1)
return node
# Define a function to fit the decision tree
def fit(self, X, y):
self.n_feats = X.shape[1] if not self.n_feats else min(self.n_feats, X.shape[1])
self.root = self.build_tree(X, y)
# Define a function to predict the labels
def predict(self, X):
return np.array
# Define a function to calculate the accuracy
def accuracy(y_true, y_pred):
accuracy = np.sum(y_true == y_pred) / len(y_true)
return accuracy
# Define a function to train the decision tree
def train(X_train, y_train, X_test, y_test, max_depth):
model = DecisionTree(max_depth=max_depth)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy(y_test, y_pred)
return acc
# Train the decision tree
acc = train(X_train, y_train, X_test, y_test, max_depth=10)
print('Accuracy:', acc)