START-UP GUIDE TO MACHINE LEARNING
Machine learning is a method that combines traditional mathematics with contemporary strong computing processing to discover patterns inherent in a data set. Machine learning aims to build an algorithm to exploit these patterns to perform some defined task. In the case of supervised machine learning, the goal can be to develop a model that can identify which category or class a set of inputs belongs to or predict a constant value such as the price of a house.
1. Features
We’ve been discussing what we call features in machine learning. Features are a set of qualities ascribed to a data point. The “Boston housing prices” dataset is a well-known example of a machine learning practice issue. Various factors go into determining a home’s worth, including its age, average number of rooms, and property tax assessments.
Figure 1. Boston housing prices (https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTk1J7fEsqqms8OxpqtBqo69MaJY579T8y-VQ&usqp=CAU)
Statistical correlations between at least some of these features and the house price are required before a machine learning model can be considered successful in its task.
2. Feature selection and engineering
One crucial stage in constructing machine learning models is optimization. Training the model with the best features is one technique to guarantee that the model we develop performs optimally.
It is not always useful to incorporate every feature. Some features may not have a relevant statistical link to the variable we seek to predict, but others may highly correlate. Model performance may be harmed by any of these two types of training-phase noise introductions. The process of picking the best features to include in the training phase is known as ‘feature selection.’
Similarly, the features in their raw form may not give enough useful data to train a performant model. In addition, date/time-based features, for example, cannot be utilized at all in their raw form. Because a machine learning model cannot directly use a date or timestamp, we must first utilize the date to create useful features. We may utilize parts of dates in their integer forms, such as the month, day, or week number, or compute disparities between two dates to create patterns the algorithm can make sense of. Feature engineering is the term for this practice.
3. Labels
Supervised machine learning requires something known as tagged data. This indicates data where each set of features has a matching label. As with the Boston housing price data set, where the label represents the price, these labels may represent either a discrete value (such as “cat”) or a continuous value (such as “dog”). When creating machine learning models, the features are commonly referred to as X and the label as y.
Figure 2. Labels(https://cdn.thenewstack.io/media/2017/04/5aa7a227-ml-3-fl.png)
4. Training
Labeled data is required for the algorithms to ‘learn’ patterns, which, if successful, will allow the model to reliably predict labels on new unlabeled data using supervised machine learning.
In the machine learning process, this phase of learning is known as the training phase. After this phase, you have a model that can be used to predict labels or values for new unlabelled data. The training phase is sometimes referred to as fitting a model.
5. Tuning
Earlier in this essay, I have already described one optimization process while addressing feature selection. Optimizing an algorithm’s parameters to find the most effective combination for your particular data set is a step in the tuning process.
Models for machine learning all have parameters that may be configured in various ways. For example, a random forest model contains many configurable parameters. In this case, the number of trees in the forest is estimated using n estimators, for example. Typically, the bigger the number of trees, the better the outcomes, but the benefit reduces as you add more trees at a certain point. Finding the appropriate number of trees for your data set is one technique to modify the parameters for a random forest algorithm.
Each algorithm has many configurable parameters, and each parameter has a potentially enormous number of alternatives. Fortunately, there are automated approaches to finding the ideal combination of these parameters, known as hyperparameter optimization.
Parameters in all machine learning models have several values. For example, a random forest model contains several configurable parameters. The number of trees in a forest may be estimated using n estimators. Typically, the bigger the number of trees, the better the outcomes; however, the benefit reduces when you add more trees at a certain point (and this changes depending on the data set). Finding the appropriate number of trees for your data set is one technique to modify the parameters for a random forest algorithm. Each algorithm has several configurable parameters, and each parameter has a potentially enormous number of alternatives. Fortunately, there are automated approaches to finding the ideal combination of these parameters, known as hyperparameter optimization.
6. Validation
Once a model is constructed, it must be tested to see whether it is up to the task at hand. In our example data, we will want to examine how correctly the model can predict the price of a house. The optimum performance measure for machine learning will depend on the issue at hand, and this must be determined before any work can establish. We usually divide the dataset we’re working with into two parts before beginning a machine learning project. One we utilize for training the model, and the other is employed for the testing phase. Testing is more often referred to as validation in machine learning. We utilize the model to make predictions on the reserved test data set and assess the specified performance metric to determine how effectively the model can perform the given task.