Skip to content

Brains

Base

policy_arena.brains.base

Brain ABC — the decision-making interface shared by all paradigms.

Brain

Bases: ABC

Base interface for agent controllers.

Same interface regardless of paradigm (rule-based, RL, LLM). Observation/action types are game-specific — each game's brains narrow the types in their own signatures.

name abstractmethod property

Human-readable identifier for this brain/strategy.

decide(observation) abstractmethod

Choose an action given the current observation.

Source code in src/policy_arena/brains/base.py
22
23
24
@abstractmethod
def decide(self, observation: Any) -> Any:
    """Choose an action given the current observation."""

decide_batch(observations)

Decide for multiple opponents at once.

Default: calls decide() individually. LLM brains override this to make a single LLM call for all opponents.

Source code in src/policy_arena/brains/base.py
26
27
28
29
30
31
32
def decide_batch(self, observations: list[Any]) -> list[Any]:
    """Decide for multiple opponents at once.

    Default: calls decide() individually. LLM brains override this
    to make a single LLM call for all opponents.
    """
    return [self.decide(obs) for obs in observations]

update(result) abstractmethod

Learn from the outcome of the last round.

Source code in src/policy_arena/brains/base.py
34
35
36
@abstractmethod
def update(self, result: Any) -> None:
    """Learn from the outcome of the last round."""

update_round_summary(summary)

Receive a consolidated round summary. Override in subclasses.

Source code in src/policy_arena/brains/base.py
38
39
def update_round_summary(self, summary: str) -> None:  # noqa: B027
    """Receive a consolidated round summary. Override in subclasses."""

reset() abstractmethod

Reset internal state for a new game.

Source code in src/policy_arena/brains/base.py
41
42
43
@abstractmethod
def reset(self) -> None:
    """Reset internal state for a new game."""

Rule-Based

policy_arena.brains.rule_based

AlwaysCooperate

Bases: Brain

Always plays COOPERATE regardless of history.

AlwaysDefect

Bases: Brain

Always plays DEFECT regardless of history.

Pavlov

Bases: Brain

Win-Stay, Lose-Shift.

Cooperates on first round. Then repeats last action if it got the higher payoffs (CC=3 or DC=5), switches otherwise (CD=0 or DD=1). Equivalently: cooperate if both players played the same action last round, defect if they differed.

RandomBrain(cooperation_probability=0.5, seed=None)

Bases: Brain

Cooperates with a configurable probability, otherwise defects.

Uses an internal RNG seeded at construction for reproducibility.

Source code in src/policy_arena/brains/rule_based/random_brain.py
15
16
17
def __init__(self, cooperation_probability: float = 0.5, seed: int | None = None):
    self._p_cooperate = cooperation_probability
    self._rng = stdlib_random.Random(seed)

TitForTat

Bases: Brain

Cooperates on the first round, then copies the opponent's last action.

Reinforcement Learning

policy_arena.brains.rl

BanditBrain(action_space, reward_extractor=None, epsilon=0.1, epsilon_decay=1.0, epsilon_min=0.01, seed=None)

Bases: Brain

Epsilon-greedy multi-armed bandit.

Parameters:

Name Type Description Default
action_space list of actions the brain can choose from.
required
reward_extractor callable that extracts a float reward from a round result.

If None, looks for a .payoff attribute.

None
epsilon exploration probability.
0.1
epsilon_decay multiply epsilon by this factor after each update.
1.0
epsilon_min floor for epsilon decay.
0.01
seed RNG seed for reproducibility.
None
Source code in src/policy_arena/brains/rl/bandit.py
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
def __init__(
    self,
    action_space: Sequence[Any],
    reward_extractor: Callable[[Any], float] | None = None,
    epsilon: float = 0.1,
    epsilon_decay: float = 1.0,
    epsilon_min: float = 0.01,
    seed: int | None = None,
):
    self._action_space = list(action_space)
    self._reward_extractor = reward_extractor or self._default_reward_extractor
    self._epsilon = epsilon
    self._epsilon_decay = epsilon_decay
    self._epsilon_min = epsilon_min
    self._rng = stdlib_random.Random(seed)

    # Running average reward per action
    self._totals: dict[Any, float] = {a: 0.0 for a in self._action_space}
    self._counts: dict[Any, int] = {a: 0 for a in self._action_space}

    self._pending_actions: list[Any] = []

BestResponseBrain(action_space, payoff_func, opponent_action_extractor=None, action_space_opponent=None)

Bases: Brain

Empirical best response to observed opponent behavior.

Parameters:

Name Type Description Default
action_space list of valid actions.
required
payoff_func callable(my_action, opponent_action) -> float payoff.

Used to compute expected payoff against the empirical distribution.

required
opponent_action_extractor callable(result) -> opponent action.

Extracts the opponent's action from a round result. If None, looks for .opponent_action attribute.

None
action_space_opponent opponent's action space. If None, uses same as action_space.
None
Source code in src/policy_arena/brains/rl/best_response.py
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
def __init__(
    self,
    action_space: Sequence[Any],
    payoff_func: Callable[[Any, Any], float],
    opponent_action_extractor: Callable[[Any], Any] | None = None,
    action_space_opponent: Sequence[Any] | None = None,
):
    self._action_space = list(action_space)
    self._payoff_func = payoff_func
    self._opponent_extractor = (
        opponent_action_extractor or self._default_opponent_extractor
    )
    self._action_space_opponent = list(action_space_opponent or action_space)

    self._opponent_counts: Counter = Counter()
    self._total_observations: int = 0

QLearningBrain(action_space, state_encoder=None, reward_extractor=None, learning_rate=0.1, discount=0.95, epsilon=0.1, epsilon_decay=1.0, epsilon_min=0.01, seed=None)

Bases: Brain

Tabular Q-learning with epsilon-greedy exploration.

Parameters:

Name Type Description Default
action_space list of actions the brain can choose from.
required
state_encoder callable that maps an observation to a hashable state key.

If None, a default encoder is used that returns the round_number clamped to 0 (first round) or 1 (subsequent).

None
reward_extractor callable that extracts a float reward from a round result.

If None, looks for a .payoff attribute.

None
learning_rate Q-value update step size.
0.1
discount future reward discount factor.
0.95
epsilon exploration probability (epsilon-greedy).
0.1
epsilon_decay multiply epsilon by this factor after each update.
1.0
epsilon_min floor for epsilon decay.
0.01
seed RNG seed for reproducibility.
None
Source code in src/policy_arena/brains/rl/q_learning.py
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
def __init__(
    self,
    action_space: Sequence[Any],
    state_encoder: Callable[[Any], Hashable] | None = None,
    reward_extractor: Callable[[Any], float] | None = None,
    learning_rate: float = 0.1,
    discount: float = 0.95,
    epsilon: float = 0.1,
    epsilon_decay: float = 1.0,
    epsilon_min: float = 0.01,
    seed: int | None = None,
):
    self._action_space = list(action_space)
    self._state_encoder = state_encoder or self._default_state_encoder
    self._reward_extractor = reward_extractor or self._default_reward_extractor
    self._lr = learning_rate
    self._discount = discount
    self._epsilon = epsilon
    self._epsilon_decay = epsilon_decay
    self._epsilon_min = epsilon_min
    self._rng = stdlib_random.Random(seed)

    self._q: dict[Hashable, dict[Any, float]] = defaultdict(
        lambda: {a: 0.0 for a in self._action_space}
    )

    self._pending: list[tuple[Hashable, Any]] = []