Neural Contextual Bandits with UCB-based Exploration (NeuralUCB)

NeuralUCB utilizes the representational capabilities of deep neural networks and employs a neural network-based random feature mapping to create an upper confidence bound (UCB) for reward, enabling efficient exploration.

This is a contextual multi-armed bandit algorithm, meaning it is suited to RL problems with just a single timestep.

Example

from agilerl.algorithms.neural_ucb import NeuralUCB
from agilerl.components.replay_buffer import ReplayBuffer
from agilerl.wrappers.learning import BanditEnv

# Fetch data  https://archive.ics.uci.edu/
iris = fetch_ucirepo(id=53)
features = iris.data.features
targets = iris.data.targets

# Create environment
env = BanditEnv(features, targets)
context_dim = env.context_dim
action_dim = env.arms

field_names = ["state", "action", "reward", "next_state", "done"]
memory = ReplayBuffer(action_dim=action_dim, memory_size=10000, field_names=field_names)

bandit = NeuralUCB(state_dim=context_dim, action_dim=action_dim)

context = env.reset()  # Reset environment at start of episode
for _ in range(500):
    # Get next action from agent
    action = agent.getAction(context)
    next_context, reward = env.step(action)  # Act in environment

    # Save experience to replay buffer
    memory.save2memory(context[action], reward)

    # Learn according to learning frequency
    if memory.counter % agent.learn_step == 0 and len(memory) >= agent.batch_size:
        experiences = memory.sample(agent.batch_size) # Sample replay buffer
        agent.learn(experiences)    # Learn according to agent's RL algorithm

    context = next_context

To configure the network architecture, pass a dict to the NeuralUCB net_config field. For an MLP, this can be as simple as:

NET_CONFIG = {
      'arch': 'mlp',      # Network architecture
      'hidden_size': [32, 32]  # Network hidden size
  }

Or for a CNN:

NET_CONFIG = {
      'arch': 'cnn',      # Network architecture
      'hidden_size': [128],    # Network hidden size
      'channel_size': [32, 32], # CNN channel size
      'kernel_size': [8, 4],   # CNN kernel size
      'stride_size': [4, 2],   # CNN stride size
      'normalize': True   # Normalize image from range [0,255] to [0,1]
  }
agent = NeuralUCB(state_dim=state_dim, action_dim=action_dim, net_config=NET_CONFIG)   # Create NeuralUCB agent

Saving and loading agents

To save an agent, use the saveCheckpoint method:

from agilerl.algorithms.neural_ucb import NeuralUCB

agent = NeuralUCB(state_dim=state_dim, action_dim=action_dim)   # Create NeuralUCB agent

checkpoint_path = "path/to/checkpoint"
agent.saveCheckpoint(checkpoint_path)

To load a saved agent, use the load method:

from agilerl.algorithms.neural_ucb import NeuralUCB

checkpoint_path = "path/to/checkpoint"
agent = NeuralUCB.load(checkpoint_path)

Parameters

class agilerl.algorithms.neural_ucb_bandit.NeuralUCB(state_dim, action_dim, index=0, net_config={'arch': 'mlp', 'hidden_size': [128]}, gamma=1.0, lamb=1.0, reg=0.000625, batch_size=64, lr=0.001, learn_step=1, mut=None, actor_network=None, device='cpu', accelerator=None, wrap=True)

The NeuralUCB algorithm class. NeuralUCB paper: https://arxiv.org/abs/1911.04462

Parameters:
  • state_dim (list[int]) – State observation (context) dimension

  • action_dim (int) – Action dimension (number of arms)

  • index (int, optional) – Index to keep track of object instance during tournament selection and mutation, defaults to 0

  • net_config (dict, optional) – Network configuration, defaults to mlp with hidden size [64,64]

  • gamma (float, optional) – Positive scaling factor, defaults to 1.0

  • lamb (float, optional) – Regularization parameter lambda, defaults to 1.0

  • reg (float, optional) – Loss regularization parameter, defaults to 0.000625

  • batch_size (int, optional) – Size of batched sample from replay buffer for learning, defaults to 64

  • lr (float, optional) – Learning rate for optimizer, defaults to 1e-4

  • learn_step (int, optional) – Learning frequency, defaults to 1

  • mut (str, optional) – Most recent mutation to agent, defaults to None

  • actor_network (nn.Module, optional) – Custom actor network, defaults to None

  • device (str, optional) – Device for accelerated computing, ‘cpu’ or ‘cuda’, defaults to ‘cpu’

  • accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None

  • wrap (bool, optional) – Wrap models for distributed training upon creation, defaults to True

clone(index=None, wrap=True)

Returns cloned agent identical to self.

Parameters:

index (int, optional) – Index to keep track of agent for tournament selection and mutation, defaults to None

getAction(state, action_mask=None)

Returns the next action to take in the environment.

Parameters:
  • state (numpy.ndarray[float]) – State observation, or multiple observations in a batch

  • action_mask (numpy.ndarray, optional) – Mask of legal actions 1=legal 0=illegal, defaults to None

learn(experiences)

Updates agent network parameters to learn from experiences.

Parameters:

experiences – Batched states, rewards in that order.

classmethod load(path, device='cpu', accelerator=None)

Creates agent with properties and network weights loaded from path.

Parameters:
  • path (string) – Location to load checkpoint from

  • device (str, optional) – Device for accelerated computing, ‘cpu’ or ‘cuda’, defaults to ‘cpu’

  • accelerator (accelerate.Accelerator(), optional) – Accelerator for distributed computing, defaults to None

loadCheckpoint(path)

Loads saved agent properties and network weights from checkpoint.

Parameters:

path (string) – Location to load checkpoint from

saveCheckpoint(path)

Saves a checkpoint of agent properties and network weights to path.

Parameters:

path (string) – Location to save checkpoint at

test(env, swap_channels=False, max_steps=100, loop=1)

Returns mean test score of agent in environment with epsilon-greedy policy.

Parameters:
  • env (Gym-style environment) – The environment to be tested in

  • swap_channels (bool, optional) – Swap image channels dimension from last to first [H, W, C] -> [C, H, W], defaults to False

  • max_steps (int, optional) – Maximum number of testing steps, defaults to 500

  • loop (int, optional) – Number of testing loops/episodes to complete. The returned score is the mean over these tests. Defaults to 3