gradient descent

Gradient descent is a method of minimizing the machine learning model cost function in linear regression models. In the gradient descent method, the (internal) parameters of the ML model are tuned over several training iterations by taking gradual "steps" down a slope in the function graph, aiming towards a minimum error value. In gradient descent, the size of each "step" down the slope is called the learning rate.

The following variants are commonly used under the gradient descent method.

  • Batch gradient descent (BGD). BGD makes use of the entire training dataset to calculate the step-wise gradients. Not recommended for large datasets, due to long training time.
  • Stochastic gradient descent (SGD). SGD selects a data example at random and computes its gradient at every step. Much faster than the BGD method.
  • Stochastic average gradient (SAG). SAG is similar to SGD, except that SAG retains a "memory" of the previously computed gradients. This enables it to converge faster than SGD. However, it only supports L2 regularization and may present performance issues in datasets with a large number of examples.
  • Mini-batch gradient descent (MBGD). MBGD attempts to find balance between the features of BGD and SGD. Mini-batch gradient descent selects a group of examples at random and steps in the direction of the average gradient from all examples in the mini-batch. This generally translates to better performance, as compared to the SGD method.

Comparing gradient descent with regularization

Gradient descent and its variations are internal optimization methods for regression algorithms, whereas regularization is mostly a method to tackle the overfitting problem by applying a penalizing hyperparameter in the ML algorithm cost function. Regularization is primarily used to overcome the overfitting problem, whereas gradient descent can be used in all sorts of regression scenarios. Regularization methods (L1, L2) are generally used with gradient descent optimization, so gradient descent and regularization are not mutually exclusive concepts. A good analysis of gradient descent optimization and L1/L2 regularization methods can be found at:

Related Cloud terms