Machine Learning : Workflow

This post gives a brief introduction to a workflow of machine learning model and mostly used R packages before diving into the details of these models.



Machine Learning : Suggested Workflow



Given a problem to be solved, all machine learning (ML) models use the same input but produce different output. It is, therefore, useful to understand a common workflow of ML model. As there is no only one workflow but a variety of it, we also introduce one of them.


Sample Splitting


Construction of ML model starts from a sample splitting. Most commonly used technique is a K-fold cross validation with random shuffling. In case of time-series or panel data, the K-fold cross validation without random shuffling is used for preserving temporal sequence (future data can not be used as a predictor of past data). This method is called as K-fold forward chaining cross validation or forward chaining shortly. Two cross validations are illustrated in the following figures.
cross validation in machine learning

Workflow of Machine Learning


Although there are many alternatives for each step, most ML models have the following workflow in common.
Workflow of Machine Learning


In the above workflow, the stage for feature selection or variable selection can be carried out independently by using LASSO . This means that caldidate ML models use the same input if the feature selection is done independently but each ML model has a similar but not exactly same explanatory variables if the feature selection is jointly in the cross validation process.

We prefer the former approach because the independent feature selection reduces some burden of computations and we think that there won't be much difference in results of feature selection among candidate ML models. Of course, if a research topic centers on the difference in feature importantce among candidate ML models, the latter approach will be appropriate.


Hyperparameters and R packages


R provides many ML packages which are updated irregularly. We use representative time-tested and mostly used R packages for some selected ML models in the following way.
hyperparameters of Machine Learning R packages
Here, names of selected ML models include Logistic Regression (LR), Decision Tree (DT), Support Vector Machine (SVM), Random Forest (RF), Artificial Neural Network (ANN), Gradient Boosting (GBoost) and Extreme Gradient Boosting (XGBoost). Numerical values for hyperparameters of each ML model are presented as a example and are not absolute.


Concluding Remarks


Based on this workflow of ML model, we are going to investigate each ML model and implement it by using R ML packages step by step in a series of next posts. \(\blacksquare\)


No comments:

Post a Comment