Upper Confidence Bound

[WARNING] Could not convert TeX math \mathrm{UCB1}_i(t) = \hat{\mu}_i + \sqrt{\frac{2 \ln t}{n_i}}, rendering as TeX [WARNING] Could not convert TeX math \sqrt{2 \ln t / n_i}, rendering as TeX [WARNING] Could not convert TeX math R(n) = O\Bigl(\sum_{i:\Delta_i>0} \frac{\ln n}{\Delta_i}\Bigr),, rendering as TeX: R(n) = O\Bigl(\sum_{i:\Delta_i>0} \frac{ ^ unexpected control sequence \Bigl expecting "%", "\\label", "\\tag", "\\nonumber", whitespace or "\\allowbreak" [WARNING] Could not convert TeX math \hat\mu_i + \sqrt{\frac{\ln t}{n_i}\min\{1/4,\,V_i\}}., rendering as TeX

The Upper Confidence Bound (UCB) family of algorithms in machine learning and statistics is used to solve the multi-armed bandit problem and deal with the exploration-exploitation trade-off. UCB methods choose what to do by figuring out how likely each action is to get a reward, which balances trying out new options with taking advantage of ones that have worked well in the past. Confidence-bound methods for stochastic bandits originated from the research of Tze Leung Lai and Herbert Robbins in 1985, with the inaugural UCB algorithm developed by Lai in 1987. UCB and its variations have become common methods in reinforcement learning, online advertising, recommender systems, clinical trials, and Monte Carlo tree search.

Background

The multi-armed bandit problem models a scenario where an agent chooses repeatedly among K options ("arms"), each yielding stochastic rewards, with the goal of maximizing the sum of collected rewards over time. The main challenge is the exploration–exploitation trade-off: the agent must explore lesser-tried arms to learn their rewards, yet exploit the best-known arm to maximize payoff. Traditional ε-greedy or SoftMax strategies use randomness to force exploration; UCB algorithms instead use statistical confidence bounds to guide exploration more efficiently.

The UCB1 algorithm

Behaviour of an UCB algorithm on a bandit run

UCB1 is a widely used bounded-reward variant of UCB introduced by Auer, Cesa-Bianchi and Fischer (2002). It maintains for each arm i:

the empirical mean reward μ̂_i,
the count n_i of times arm i has been played.

At round t, it selects the arm maximizing:

$\mathrm{UCB1}_i(t) = \hat{\mu}_i + \sqrt{\frac{2 \ln t}{n_i}}$

Arms with n_i = 0 are initially played once. The bonus term $\sqrt{2 \ln t / n_i}$ shrinks as n_i grows, ensuring exploration of less-tried arms and exploitation of high-mean arms.

Pseudocode

for each arm i:
    n[i] ← 0; Q[i] ← 0
for t from 1 to T do:
    for each arm i do
        if n[i] = 0 then
            select arm i
        else
            index[i] ← Q[i] + sqrt((2 * ln t) / n[i])
    select arm a with highest index[a]
    observe reward r
    n[a] ← n[a] + 1
    Q[a] ← Q[a] + (r - Q[a]) / n[a]

Theoretical properties

Auer et al. proved that UCB1 achieves logarithmic regret: after n rounds, the expected regret R(n) satisfies

$R(n) = O\Bigl(\sum_{i:\Delta_i>0} \frac{\ln n}{\Delta_i}\Bigr),$

where Δ_i is the gap between the optimal arm’s mean and arm i’s mean. Thus, average regret per round tend to 0 as n → + ∞, and UCB1 is near-optimal against the Lai-Robbins lower bound.

UCB2

Introduced in the same paper as UCB1, UCB2 divides plays into epochs controlled by a parameter α, reducing the constant in the regret bound at the cost of more complex scheduling.

UCB1-Tuned

Incorporates empirical variance V_i to tighten the bonus: $\hat\mu_i + \sqrt{\frac{\ln t}{n_i}\min\{1/4,\,V_i\}}.$ This often outperforms UCB1 in practice but lacks a simple regret proof.

KL-UCB

Replaces Hoeffding’s bound with a Kullback–Leibler divergence condition, yielding asymptotically optimal regret (constant = 1) for Bernoulli rewards.

Bayesian UCB (Bayes-UCB)

Computes the (1−δ)-quantile of a Bayesian posterior (e.g. Beta for Bernoulli) as the index. Proven asymptotically optimal under certain priors.

Contextual UCB (e.g., LinUCB)

Extends UCB to contextual bandits by estimating a linear reward model and confidence ellipsoids in parameter space.

Applications

Online advertising & A/B testing: instead of sticking to a fixed traffic split, they gradually send more users toward better-performing options, which can improve conversion rates over time.

Monte Carlo Tree Search: in UCT, UCB1 is applied at each node to help decide which branches to explore next, something that has been key in game-playing systems like Go.

Adaptive clinical trials: patients tend to be assigned more often to treatments that are showing better results so far, often leading to improved outcomes compared to pure random assignment.

Recommender systems: helps in choosing personalized content while still handling uncertainty in user preferences.

Robotics & control: supports efficient exploration when the system is dealing with unknown or changing dynamics.

"LLM generated article with hallucinated references. In the initial revision, source 11 did not exist, the PMCID linked to a different article, the DOI was dead on arrival, and a Google search found no matching article. In addition, source 4 had a DOI that was DOA. I G15ed the article and was decl..." Read more

Nominated by Jumpytoo

by Dr vulpes

Articles for deletion/Upper Confidence Bound (XFDcloser)

View AfD discussion ↗

Sources

https://doi.org/10.1007%2F11871842_29

https://doi.org/10.1016%2F0196-8858(85)90002-8

https://doi.org/10.1023%2FA:1013689704352

Additional preserved links are available in the archive details below.

Archive Inventory

View stored source record counts

Revision rows stored: 25

Outgoing links stored: 18

External links stored: 18

Templates stored: 64

Talk exports stored: 1

AfD exports stored: 1

Raw API payloads stored: 13

Image records stored: 1

View full source metadata

Outgoing Wikipedia links (18)

Best, worst and average caseBibcode (identifier)Clinical trialsDoi (identifier)Hallucination (artificial intelligence)Herbert RobbinsISBN (identifier)Kullback–Leibler divergenceKullback–Leibler Upper Confidence BoundMonte Carlo tree searchMulti-armed banditOnline advertisingPMID (identifier)Recommender systemsReinforcement learningSpace complexityTime complexityTze Leung Lai

Backlinks (7)

Bayesian optimizationExplore-then-commit algorithmKullback–Leibler Upper Confidence BoundLai–Robbins lower boundMulti-armed banditUCBUpper Confidence Bound (UCB Algorithm)

External links (18)

https://doi.org/10.1007%2F11871842_29

https://doi.org/10.1016%2F0196-8858(85)90002-8

https://doi.org/10.1023%2FA:1013689704352

https://doi.org/10.1038%2Fnature16961

https://doi.org/10.1145%2F1772690.1772758

https://doi.org/10.1214%2F14-STS504

https://doi.org/10.1561%2F2200000024

link.springer.com/...

proceedings.mlr.press/...

https://pubmed.ncbi.nlm.nih.gov/26819042

scholar.google.com/...

ui.adsabs.harvard.edu/...

www.google.com/...

www.jstor.org/...

Templates (64)

AI-generatedAmboxArticle for deletion/datedArticle for deletion/switchCite bookCite conferenceCite journalDated maintenance categoryDated maintenance category (articles)DMCAFind sources mainspaceFULLROOTPAGENAMEInfoboxInfobox algorithmMain otherModule:ArgumentsModule:Category handlerModule:Category handler/blacklistModule:Category handler/configModule:Category handler/dataModule:Category handler/sharedModule:Check for unknown parametersModule:Citation/CS1Module:Citation/CS1/COinSModule:Citation/CS1/ConfigurationModule:Citation/CS1/Date validationModule:Citation/CS1/IdentifiersModule:Citation/CS1/styles.cssModule:Citation/CS1/UtilitiesModule:Citation/CS1/WhitelistModule:Disambiguation/templatesModule:Find sourcesModule:Find sources/configModule:Find sources/linksModule:Find sources/templates/Find sources mainspaceModule:InfoboxModule:Infobox/styles.cssModule:InfoboxImageModule:Message boxModule:Message box/ambox.cssModule:Message box/configurationModule:Namespace detect/configModule:Namespace detect/dataModule:Ns has subpagesModule:PagetypeModule:Pagetype/configModule:Pagetype/disambiguationModule:Pagetype/rfdModule:Pagetype/setindexModule:Pagetype/softredirectModule:SDcatModule:StringModule:UnsubstModule:Wikitext ParsingModule:YesnoNOINDEXNs has subpagesPagetypeSDcatShort descriptionShort description/lowercasecheckTemplate otherUse dmy datesYesno

📊 Wikidata

Subject Facts

QID: Q113833344

Label: Upper Confidence Bound

Description: family of machine learning algorithms for bandit problems

Wikidata Archive Data

Wikidata type ID: Q8366

Captured: April 21, 2026 8:56 PM

View raw claims JSON

Stored properties: 1

P31

{
    "P31": [
        {
            "id": "Q113833344$8da74392-420b-c5ac-d886-96231a33aa54",
            "rank": "normal",
            "type": "statement",
            "mainsnak": {
                "hash": "b5f5ce0c51fcd2a30f891b0b76482ab8252ed329",
                "datatype": "wikibase-item",
                "property": "P31",
                "snaktype": "value",
                "datavalue": {
                    "type": "wikibase-entityid",
                    "value": {
                        "id": "Q8366",
                        "numeric-id": 8366,
                        "entity-type": "item"
                    }
                }
            }
        }
    ]
}

Main Menu

Upper Confidence Bound

Background

The UCB1 algorithm

Pseudocode

Theoretical properties

UCB2

UCB1-Tuned

KL-UCB

Bayesian UCB (Bayes-UCB)

Contextual UCB (e.g., LinUCB)

Applications

See also

Upper Confidence Bound

Background

The UCB1 algorithm

Pseudocode

Theoretical properties

UCB2

UCB1-Tuned

KL-UCB

Bayesian UCB (Bayes-UCB)

Contextual UCB (e.g., LinUCB)

Applications

See also

See Also

AI Factory