Brains¶
Base¶
policy_arena.brains.base
¶
Brain ABC — the decision-making interface shared by all paradigms.
Brain
¶
Bases: ABC
Base interface for agent controllers.
Same interface regardless of paradigm (rule-based, RL, LLM). Observation/action types are game-specific — each game's brains narrow the types in their own signatures.
name
abstractmethod
property
¶
Human-readable identifier for this brain/strategy.
decide(observation)
abstractmethod
¶
Choose an action given the current observation.
Source code in src/policy_arena/brains/base.py
22 23 24 | |
decide_batch(observations)
¶
Decide for multiple opponents at once.
Default: calls decide() individually. LLM brains override this to make a single LLM call for all opponents.
Source code in src/policy_arena/brains/base.py
26 27 28 29 30 31 32 | |
update(result)
abstractmethod
¶
Learn from the outcome of the last round.
Source code in src/policy_arena/brains/base.py
34 35 36 | |
update_round_summary(summary)
¶
Receive a consolidated round summary. Override in subclasses.
Source code in src/policy_arena/brains/base.py
38 39 | |
reset()
abstractmethod
¶
Reset internal state for a new game.
Source code in src/policy_arena/brains/base.py
41 42 43 | |
Rule-Based¶
policy_arena.brains.rule_based
¶
Pavlov
¶
Bases: Brain
Win-Stay, Lose-Shift.
Cooperates on first round. Then repeats last action if it got the higher payoffs (CC=3 or DC=5), switches otherwise (CD=0 or DD=1). Equivalently: cooperate if both players played the same action last round, defect if they differed.
RandomBrain(cooperation_probability=0.5, seed=None)
¶
Bases: Brain
Cooperates with a configurable probability, otherwise defects.
Uses an internal RNG seeded at construction for reproducibility.
Source code in src/policy_arena/brains/rule_based/random_brain.py
15 16 17 | |
Reinforcement Learning¶
policy_arena.brains.rl
¶
BanditBrain(action_space, reward_extractor=None, epsilon=0.1, epsilon_decay=1.0, epsilon_min=0.01, seed=None)
¶
Bases: Brain
Epsilon-greedy multi-armed bandit.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
action_space
|
list of actions the brain can choose from.
|
|
required |
reward_extractor
|
callable that extracts a float reward from a round result.
|
If None, looks for a .payoff attribute. |
None
|
epsilon
|
exploration probability.
|
|
0.1
|
epsilon_decay
|
multiply epsilon by this factor after each update.
|
|
1.0
|
epsilon_min
|
floor for epsilon decay.
|
|
0.01
|
seed
|
RNG seed for reproducibility.
|
|
None
|
Source code in src/policy_arena/brains/rl/bandit.py
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | |
BestResponseBrain(action_space, payoff_func, opponent_action_extractor=None, action_space_opponent=None)
¶
Bases: Brain
Empirical best response to observed opponent behavior.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
action_space
|
list of valid actions.
|
|
required |
payoff_func
|
callable(my_action, opponent_action) -> float payoff.
|
Used to compute expected payoff against the empirical distribution. |
required |
opponent_action_extractor
|
callable(result) -> opponent action.
|
Extracts the opponent's action from a round result. If None, looks for .opponent_action attribute. |
None
|
action_space_opponent
|
opponent's action space. If None, uses same as action_space.
|
|
None
|
Source code in src/policy_arena/brains/rl/best_response.py
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | |
QLearningBrain(action_space, state_encoder=None, reward_extractor=None, learning_rate=0.1, discount=0.95, epsilon=0.1, epsilon_decay=1.0, epsilon_min=0.01, seed=None)
¶
Bases: Brain
Tabular Q-learning with epsilon-greedy exploration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
action_space
|
list of actions the brain can choose from.
|
|
required |
state_encoder
|
callable that maps an observation to a hashable state key.
|
If None, a default encoder is used that returns the round_number clamped to 0 (first round) or 1 (subsequent). |
None
|
reward_extractor
|
callable that extracts a float reward from a round result.
|
If None, looks for a .payoff attribute. |
None
|
learning_rate
|
Q-value update step size.
|
|
0.1
|
discount
|
future reward discount factor.
|
|
0.95
|
epsilon
|
exploration probability (epsilon-greedy).
|
|
0.1
|
epsilon_decay
|
multiply epsilon by this factor after each update.
|
|
1.0
|
epsilon_min
|
floor for epsilon decay.
|
|
0.01
|
seed
|
RNG seed for reproducibility.
|
|
None
|
Source code in src/policy_arena/brains/rl/q_learning.py
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | |