stratified cross validation

Stratified cross validation is a data validation technique which is used when splitting the ML dataset into k subsets, of which k-1 subsets are used as training subsets (folds) and one (1) is used as the test subset (fold). This process is repeated k times. Stratified cross validation uses stratified sampling in the dataset, in order to split it in such as a way that each class in the dataset (e.g. binary classification with two classes) has the same proportional representation in both the training and the test subsets in each iteration. For example, assuming k=2, in a dataset of 100 samples, in which there are two classes (sample is either classified as "car" or "airplane"), if we have 80 cars and 20 airplanes, then the stratified sampling would choose for instance 80% of the cars (64 cars) and 80% of the planes (16) to be in the training subset. Similarly, the remaining 20% of the cars (16 cars) and 20% of the airplanes (4 airplanes) will be chosen to be part of the testing subset. This way, both classes have equal representation in both the training and the testing subsets. Stratified Cross-Validation is very commonly used as the optimal approach in machine learning classification problems.

Related Cloud terms