Skip to content

PolicyArena

Brains

BaklazhenkoNikita/policyarena

Brains¶

Base¶

`policy_arena.brains.base` ¶

Brain ABC — the decision-making interface shared by all paradigms.

`Brain` ¶

Bases: ABC

Base interface for agent controllers.

Same interface regardless of paradigm (rule-based, RL, LLM). Observation/action types are game-specific — each game's brains narrow the types in their own signatures.

`name` `abstractmethod` `property` ¶

Human-readable identifier for this brain/strategy.

`decide(observation)` `abstractmethod` ¶

Choose an action given the current observation.

Source code in src/policy_arena/brains/base.py

@abstractmethod
def decide(self, observation: Any) -> Any:
    """Choose an action given the current observation."""

`decide_batch(observations)` ¶

Decide for multiple opponents at once.

Default: calls decide() individually. LLM brains override this to make a single LLM call for all opponents.

Source code in src/policy_arena/brains/base.py

def decide_batch(self, observations: list[Any]) -> list[Any]:
    """Decide for multiple opponents at once.

    Default: calls decide() individually. LLM brains override this
    to make a single LLM call for all opponents.
    """
    return [self.decide(obs) for obs in observations]

`update(result)` `abstractmethod` ¶

Learn from the outcome of the last round.

Source code in src/policy_arena/brains/base.py

@abstractmethod
def update(self, result: Any) -> None:
    """Learn from the outcome of the last round."""

`update_round_summary(summary)` ¶

Receive a consolidated round summary. Override in subclasses.

Source code in src/policy_arena/brains/base.py

def update_round_summary(self, summary: str) -> None:  # noqa: B027
    """Receive a consolidated round summary. Override in subclasses."""

`reset()` `abstractmethod` ¶

Reset internal state for a new game.

Source code in src/policy_arena/brains/base.py

@abstractmethod
def reset(self) -> None:
    """Reset internal state for a new game."""

Rule-Based¶

`policy_arena.brains.rule_based` ¶

`AlwaysCooperate` ¶

Bases: Brain

Always plays COOPERATE regardless of history.

`AlwaysDefect` ¶

Bases: Brain

Always plays DEFECT regardless of history.

`Pavlov` ¶

Bases: Brain

Win-Stay, Lose-Shift.

Cooperates on first round. Then repeats last action if it got the higher payoffs (CC=3 or DC=5), switches otherwise (CD=0 or DD=1). Equivalently: cooperate if both players played the same action last round, defect if they differed.

`RandomBrain(cooperation_probability=0.5, seed=None)` ¶

Bases: Brain

Cooperates with a configurable probability, otherwise defects.

Uses an internal RNG seeded at construction for reproducibility.

Source code in src/policy_arena/brains/rule_based/random_brain.py

def __init__(self, cooperation_probability: float = 0.5, seed: int | None = None):
    self._p_cooperate = cooperation_probability
    self._rng = stdlib_random.Random(seed)

`TitForTat` ¶

Bases: Brain

Cooperates on the first round, then copies the opponent's last action.

Reinforcement Learning¶

`policy_arena.brains.rl` ¶

`BanditBrain(action_space, reward_extractor=None, epsilon=0.1, epsilon_decay=1.0, epsilon_min=0.01, seed=None)` ¶

Bases: Brain

Epsilon-greedy multi-armed bandit.

Parameters:

Name	Type	Description	Default
`action_space`	`list of actions the brain can choose from.`		required
`reward_extractor`	`callable that extracts a float reward from a round result.`	If None, looks for a .payoff attribute.	`None`
`epsilon`	`exploration probability.`		`0.1`
`epsilon_decay`	`multiply epsilon by this factor after each update.`		`1.0`
`epsilon_min`	`floor for epsilon decay.`		`0.01`
`seed`	`RNG seed for reproducibility.`		`None`

Source code in src/policy_arena/brains/rl/bandit.py

def __init__(
    self,
    action_space: Sequence[Any],
    reward_extractor: Callable[[Any], float] | None = None,
    epsilon: float = 0.1,
    epsilon_decay: float = 1.0,
    epsilon_min: float = 0.01,
    seed: int | None = None,
):
    self._action_space = list(action_space)
    self._reward_extractor = reward_extractor or self._default_reward_extractor
    self._epsilon = epsilon
    self._epsilon_decay = epsilon_decay
    self._epsilon_min = epsilon_min
    self._rng = stdlib_random.Random(seed)

    # Running average reward per action
    self._totals: dict[Any, float] = {a: 0.0 for a in self._action_space}
    self._counts: dict[Any, int] = {a: 0 for a in self._action_space}

    self._pending_actions: list[Any] = []

`BestResponseBrain(action_space, payoff_func, opponent_action_extractor=None, action_space_opponent=None)` ¶

Bases: Brain

Empirical best response to observed opponent behavior.

Parameters:

Name	Type	Description	Default
`action_space`	`list of valid actions.`		required
`payoff_func`	`callable(my_action, opponent_action) -> float payoff.`	Used to compute expected payoff against the empirical distribution.	required
`opponent_action_extractor`	`callable(result) -> opponent action.`	Extracts the opponent's action from a round result. If None, looks for .opponent_action attribute.	`None`
`action_space_opponent`	`opponent's action space. If None, uses same as action_space.`		`None`

Source code in src/policy_arena/brains/rl/best_response.py

def __init__(
    self,
    action_space: Sequence[Any],
    payoff_func: Callable[[Any, Any], float],
    opponent_action_extractor: Callable[[Any], Any] | None = None,
    action_space_opponent: Sequence[Any] | None = None,
):
    self._action_space = list(action_space)
    self._payoff_func = payoff_func
    self._opponent_extractor = (
        opponent_action_extractor or self._default_opponent_extractor
    )
    self._action_space_opponent = list(action_space_opponent or action_space)

    self._opponent_counts: Counter = Counter()
    self._total_observations: int = 0

`QLearningBrain(action_space, state_encoder=None, reward_extractor=None, learning_rate=0.1, discount=0.95, epsilon=0.1, epsilon_decay=1.0, epsilon_min=0.01, seed=None)` ¶

Bases: Brain

Tabular Q-learning with epsilon-greedy exploration.

Parameters:

Name	Type	Description	Default
`action_space`	`list of actions the brain can choose from.`		required
`state_encoder`	`callable that maps an observation to a hashable state key.`	If None, a default encoder is used that returns the round_number clamped to 0 (first round) or 1 (subsequent).	`None`
`reward_extractor`	`callable that extracts a float reward from a round result.`	If None, looks for a .payoff attribute.	`None`
`learning_rate`	`Q-value update step size.`		`0.1`
`discount`	`future reward discount factor.`		`0.95`
`epsilon`	`exploration probability (epsilon-greedy).`		`0.1`
`epsilon_decay`	`multiply epsilon by this factor after each update.`		`1.0`
`epsilon_min`	`floor for epsilon decay.`		`0.01`
`seed`	`RNG seed for reproducibility.`		`None`

Source code in src/policy_arena/brains/rl/q_learning.py

def __init__(
    self,
    action_space: Sequence[Any],
    state_encoder: Callable[[Any], Hashable] | None = None,
    reward_extractor: Callable[[Any], float] | None = None,
    learning_rate: float = 0.1,
    discount: float = 0.95,
    epsilon: float = 0.1,
    epsilon_decay: float = 1.0,
    epsilon_min: float = 0.01,
    seed: int | None = None,
):
    self._action_space = list(action_space)
    self._state_encoder = state_encoder or self._default_state_encoder
    self._reward_extractor = reward_extractor or self._default_reward_extractor
    self._lr = learning_rate
    self._discount = discount
    self._epsilon = epsilon
    self._epsilon_decay = epsilon_decay
    self._epsilon_min = epsilon_min
    self._rng = stdlib_random.Random(seed)

    self._q: dict[Hashable, dict[Any, float]] = defaultdict(
        lambda: {a: 0.0 for a in self._action_space}
    )

    self._pending: list[tuple[Hashable, Any]] = []