Cookies on this website

We use cookies to ensure that we give you the best experience on our website. If you click 'Accept all cookies' we'll assume that you are happy to receive all cookies and you won't see this message again. If you click 'Reject all non-essential cookies' only necessary cookies providing core functionality such as security, network management, and accessibility will be enabled. Click 'Find out more' for information on how to change your cookie settings.

We study gradient descent with backtracking line search (GD-BLS) to solve the noisy optimization problem θ ⋆ ≔ a r g m i n θ ∈ R d E [ f ( θ , Z ) ] \theta _\star ≔argmin_{\theta \in \mathbb {R}^d} \mathbb {E}[f(\theta ,Z)] , imposing that the objective function F ( θ ) ≔ E [ f ( θ , Z ) ] F(\theta )≔\mathbb {E}[f(\theta ,Z)] is strictly convex but not necessarily L L -smooth. Assuming that E [ ‖ ∇ θ f ( θ ⋆ , Z ) ‖ 2 ] > ∞ \mathbb {E}[\|\nabla _\theta f(\theta _\star ,Z)\|^2]>\infty , we first prove that sample average approximation based on GD-BLS allows to estimate θ ⋆ \theta _\star with an error of size O P ( B − 0.25 ) \mathcal {O}_\mathbb {P}(B^{-0.25}) , where B B is the available computational budget. We then show that we can improve upon this rate by stopping the optimization process earlier when the gradient of the objective function is sufficiently close to zero, and use the residual computational budget to optimize, again with GD-BLS, a finer approximation of F F . By iteratively applying this strategy J J times we establish that we can estimate θ ⋆ \theta _\star with an error of size O P ( B − 1 2 ( 1 − δ J ) ) \mathcal {O}_\mathbb {P}(B^{-\frac {1}{2}(1-\delta ^{J})}) , where δ ∈ ( 1 / 2 , 1 ) \delta \in (1/2,1) is a user-specified parameter. More generally, we show that if E [ ‖ ∇ θ f ( θ ⋆ , Z ) ‖ 1 + α ] > ∞ \mathbb {E}[\|\nabla _\theta f(\theta _\star ,Z)\|^{1+\alpha }]>\infty for some known α ∈ ( 0 , 1 ] \alpha \in (0,1] then this approach, which can be seen as a retrospective approximation algorithm with a fixed computational budget, allows to learn θ ⋆ \theta _\star with an error of size O P ( B − α 1 + α ( 1 − δ J ) ) \mathcal {O}_\mathbb {P}(B^{-\frac {\alpha }{1+\alpha }(1-\delta ^{J})}) , where δ ∈ ( 2 α / ( 1 + 3 α ) , 1 ) \delta \in (2\alpha /(1+3\alpha ),1) is a tuning parameter. Beyond knowing α \alpha , achieving the aforementioned convergence rates does not require to tune the algorithms’ parameters according to the specific functions F F and f f at hand, and we exhibit a simple noisy optimization problem for which stochastic gradient is not guaranteed to converge while the algorithms discussed in this work are.

Original publication

DOI

10.1090/mcom/4103

Type

Journal article

Journal

Mathematics of Computation

Publisher

American Mathematical Society (AMS)

Publication Date

09/06/2025