cross validation

In machine learning (ML), cross validation is a method in which the data scientists perform an evaluation of an ML model's performance on unlabelled data, i.e. data which the ML model has not seen before. In the method of cross validation, the data which is available in the dataset is split into multiple subsets. One of the created subsets is utilized as the validation subset, while the remaining subsets are used for the actual ML model training (training subset and testing subset).

The same process is repeated in multiple iterations and each time a different subset of the original dataset's data is used as the validation subset. The final model's generalization performance is created as an average of all the ML model's validation results from all validation iterations. Cross validation is a useful method to apply in any ML model to achieve resistance to the phenomenon of overfitting and to ensure that the ML model generalizes well to new data which it has not seen before. There are various types of cross validation methods, including the k-fold cross validation, leave-one-out cross validation, holdout validation and the stratified cross-validation.

Cross validation methods are used in the process of regularization (such as L1 or L2 regularization), with the final goal being the minimization of the cost function of an ML model. A similar approach to cross validation is used in gradient descent optimization where the dataset is split into batches. A good analysis of cross validation best practices in machine learning is provided in the following article:

Related Cloud terms