KI2 – MDP / POMDP Kunstmatige Intelligentie / RuG
Decision Processes Agent Perceives environment (St) flawlessly Chooses action (a) Which alters the state of the world (St+1)
finite state machine geen signalen zie BAL geen signalen A1: lummel wat rond zie BAL A2: volg object geen signalen zie obstakel zie obstakel zie BAL A3: houd afstand zie obstakel
Stochastic Decision Processes Agent Perceives environment (St) flawlessly Chooses action (a) according to P(a|S) Which alters the state of the world (St+1) according to P(St+1|St,a)
Markov Decision Processes Agent Perceives environment (St) flawlessly Chooses action (a) according to P(a|S) Which alters the state of the world (St+1) according to P(St+1|St,a) If no longer-term dependencies: 1st order Markov process
Aannames De waarneming van St is zonder ruis, alle benodigde informatie is waarneembaar Acties a worden volgens kans P(a|S) geselecteerd (random generator) Gevolgen van a in (St+1) treden stochastisch op met kans P(St+1|St,a)
A policy +1 -1 START
A policy +1 -1 START
MDP States Actions Transitions between states P(ai|sk) “policy”: beleid, welke a men zoal beslist gegeven de mogelijke omstandigheden s
policy π “ argmax ai P(ai|sk) “ Hoe kan een agent dit leren? Kostenminimalisatie Beloning/straf uit omgeving als gevolg van gedrag/acties (Thorndike, 1911) Reinforcements R(a,S) Structuur wereld T = P(St+1|St)
agent leren de waarde van een Actie te schatten. Reinforcements Gegeven een historie van States, Actions en de resulterende Reinforcements kan een agent leren de waarde van een Actie te schatten. Hoe: Som van reinforcements R? Gemiddelde? exponentiële weging eerste stap bepaalt alle latere (learning from past) directe beloning is nuttiger (rekenen over toekomst) impatience & mortality
Assigning Utility (Value) to Sequences Discounted Rewards: V(s0,s1,s2…) = R(s0) + R(s1) + 2R(s2)… where 0<≤1 where R is reinforcement value, s refers to state, is the the discount factor
Assigning Utility to States Can we say V(s)=R(s)? “ de utiliteit van een toestand is de verwachte utiliteit van alle toestanden die erop zullen volgen, wanneer beleid (policy) π wordt gehanteerd” Transitiekans T(s,a,s’) NO!!!
Assigning Utility to States Can we say V(s)=R(s)? Vπ(s) is specific to each policy π Vπ(s) = E(tR(st)| π, s0=s) V(s)= Vπ *(s) V(s)=R(s) + max T(s,a,s’)V(s’) a s’ Bellman equation If we solve function V(s) for each state we will have solved the optimal π* for the given MDP
Value Iteration Algorithm We have to solve |S| simultaneous Bellman equations Can’t solve directly, so use an iterative approach: 1. Begin with arbitrary initial values V0 2. For each s, calculate V(s) from R(s) and V0 3. Use these new utility values to update V0 4. Repeat steps 2-3 until V0 converges This equilibrium is a unique solution! (see R&N for proof) page 621 R&N
Search space T: S*A*S Explicit enumeration of combinations is often not feasible (cf. chess, GO) Chunking within T Problem: if S is real valued
MDP: wereld is weliswaar stochastisch, Markoviaans, maar: MDP POMDP MDP: wereld is weliswaar stochastisch, Markoviaans, maar: De waarneming van die wereld zelf is betrouwbaar, er hoeven geen aannames worden gemaakt. De meeste ‘echte’ problemen omvatten: ruis in de waarneming zelf onvolledigheid van informatie
De meeste ‘echte’ problemen omvatten: MDP POMDP De meeste ‘echte’ problemen omvatten: ruis in de waarneming zelf onvolledigheid van informatie In deze gevallen moet de agent een stelsel van “Beliefs” kunnen ontwikkelen op basis van series partiële waarnemingen.
Partially Observable Markov Decision Processes (POMDPs) A POMDP has: States S Actions A Probabilistic transitions Immediate Rewards on actions A discount factor +Observations Z +Observation probabilities (reliabilities) +An initial belief b0
A POMDP example: The Tiger Problem
The Tiger Problem Description: 2 states: Tiger_Left, Tiger_Right 3 actions: Listen, Open_Left, Open_Right 2 observations: Hear_Left, Hear_Right
The Tiger Problem Rewards are: -1 for the Listen action -100 for the Open(x) in the Tiger-at-x state +10 for the Open(x) in the Tiger-not-at x state
The Tiger Problem Furthermore: The Listen action does not change the state The Open(x) action reveals the tiger behind a door x with 50% chance, and resets the trial. The Listen action gives the correct information 85% of the time: p(hearleft | Listen, tigerleft) = 0.85 p(hearright | Listen, tigerleft) = 0.15
The Tiger Problem Question: Actions depend on beliefs! what policy gives the highest return in rewards? Actions depend on beliefs! If belief is: 50/50 L/R, the expected reward will be R = 0.5 * (-100 + 10) = -45 Beliefs are updated with observations (which may be wrong)
The Tiger Problem, horizon t=1 Optimal policy: Belief Tiger=left Action [0.00, 0.10] Open(left) [0.10, 0.90] Listen [0.90, 1.00] Open(right)
The Tiger Problem, horizon t=2 Optimal policy: Belief Tiger=left Action [0.00, 0.10] Listen [0.10, 0.90] [0.90, 1.00]
The Tiger Problem, horizon t=Inf Optimal policy: listen a few times choose a door next trial listen1: Tigerleft (p=0.85), listen2: Tigerleft (p=0.96), listen3: ... (binomial) Good news: the optimal policy can be learned if actions are followed by rewards!
The Tiger Problem, belief updates on “Listen” P(Tiger|Listen,State)t+1 = P(Tiger|Listen,State)t / ( P(Tiger|Listen,State)t * P(Listen) + (1-P(Tiger|Listen,State)t ) * (1-P(Listen) ) Example: initial: (Tigerleft) (p=0.5000), listen1: Tigerleft (p=0.8500), listen2: Tigerleft (p=0.9698), listen3: Tigerleft (p=0.9945), listen4: ... (Note: underlying binomial distribution)
The Tiger Problem, belief updates on “Listen” P(Tiger|Listen,State)t+1 = P(Tiger|Listen,State)t / ( P(Tiger|Listen,State)t * P(Listen) + (1-P(Tiger|Listen,State)t ) * (1-P(Listen) ) Example 2, noise in observation: initial: (Tigerleft) (p=0.5000), listen1: Tigerleft (p=0.8500), listen2: Tigerleft (p=0.9698), listen3: Tigerright (p=0.8500), Belief drops... listen4: Tigerleft (p=0.9698), and recovers listen5: ...
Solving a POMDP To solve a POMDP is to find, for any action/observation history, the action that maximizes the expected discounted reward:
The belief state Instead of maintaining the complete action/observation history, we maintain a belief state b. The belief is a probability distribution over the states. Dim(b) = |S|-1
The belief space Here is a representation of the belief space when we have two states (s0,s1)
The belief space Here is a representation of the belief state when we have three states (s0,s1,s2)
The belief space Here is a representation of the belief state when we have four states (s0,s1,s2,s3)
The belief space The belief space is continuous but we only visit a countable number of belief points.
The Bayesian update
Value Function in POMDPs We will compute the value function over the belief space. Hard: the belief space is continuous !! But we can use a property of the optimal value function for a finite horizon: it is piecewise-linear and convex. We can represent any finite-horizon solution by a finite set of alpha-vectors. V(b) = max_α[Σ_s α(s)b(s)]
Alpha-Vectors They are a set of hyperplanes which define the belief function. At each belief point the value function is equal to the hyperplane with the highest value.
Belief Transform Assumption: Finite action Finite observation Next belief state = T(cbf,a,z) where cbf: current belief state, a:action, z:observation Finite number of possible next belief state The transitions of this new continuous space CO-MDP are easily derived from the transition and observation probabilities of the POMDP (remember: no formulas here). What this means is that we are now back to solving a CO-MDP and we can use the value iteration (VI) algorithm
PO-MDP into continuous CO-MDP The process is Markovian, the next belief state depends on: Current belief state Current action Observation Discrete PO-MDP problem can be converted into a continuous space CO-MDP problem where the continuous space is the belief space. Assume we start with a particular belief state b and we take action a1 and receive observation z1 after taking that action. Then our next belief state is fully determined
Using VI in continuous state space. Problem Using VI in continuous state space. No nice tabular representation as before. In CO-MDP value iteration we simply maintain a table with one entry per state. The value of each state is stored in the table and we have a nice finite representation of the value function. Since we now have a continuous space the value function is some arbitrary function over belief space.
PWLC GOAL:for each iteration of value iteration, find a finite Restrictions on the form of the solutions to the continuous space CO-MDP: The finite horizon value function is piecewise linear and convex (PWLC) for every horizon length. the value of a belief point is simply the dot product of the two vectors. . The value at any given belief state is found by plugging in the belief state into the hyper-planes equation. If we represent the hyper-plane as a vector (i.e., the equation coefficients) and each belief state as a vector (the probability at each state) then the value of a belief point is simply the dot product of the two vectors. GOAL:for each iteration of value iteration, find a finite number of linear segments that make up the value function
Steps in Value Iteration (VI) Represent the value function for each horizon as a set of vectors. Overcome how to represent a value function over a continuous space. Find the vector that has the largest dot product with the belief state. Since each horizon's value function is PWLC, we solved this problem, by representing the value function as a set of vectors (coefficients of the hyper-planes)
PO-MDP Value Iteration Example Assumption: Two states Two actions Three observations Ex: horizon length is 1. a1 is the best a2 is the best b=[0.25 0.75] a1 a2 [ ] Since we are interested in choosing the best action, we would choose whichever action gave us the highest value, which depends on the particular belief state. So we actually have a PWLC value function for the horizon 1 value function simply by considering the immediate rewards that come directly from the model. In the figure above, we also show the partition of belief space that this value function imposes. Here is where the colors will start to have some meaning. The blue region is all the belief states where action a1 is the best strategy to use, and the green region is the belief states where action a2 is the best strategy. s1 s2 0 1.5 V(a1,b) = 0.25x1+0.75x0 = 0.25 V(a2,b)=0.25x0+0.75x1.5=1.125
PO-MDP Value Iteration Example The value of a belief state for horizon length 2 given b,a1,z1: immediate action plus the value of the next action. Find best achievable value for the belief state that results from our initial belief state b when we perform action a1 and observe z1. define T as the function that transforms the belief state for a given belief state, action and observation (the formulas are hiding in here). Note that from looking at where b' is, we can immediately determine what the best action we should do after we do the action a1. The belief state b' lies in the green region, which means that if we have a horizon length of 2 and are forced to take action a1 first, then the best thing we could do afterwards is action a2.
PO-MDP Value Iteration Example Find the value for all the belief points given this fixed action and observation. The Transformed value function is also PWLC. Yani value function nına transformed b noktasını (namely b’) gönderirsek bu yeni fonksiyonu türetmiş oluruz. We will use S() to represent the transformed value function, for a particular action and observation. The very nice part of this is that the transformed value function is also PWLC and the nicer part is that it always is this way
PO-MDP Value Iteration Example How to compute the value of a belief state given only the action? The horizon 2 value of the belief state, given that: Values for each observation: z1: 0.7 z2: 0.8 z3: 1.2 P(z1| b,a1)=0.6; P(z2| b,a1)=0.25; P(z3| b,a1)=0.15 0.6x0.8 + 0.25x0.7 + 0.15x1.2 = 0.835
Transformed Value Functions Each of these transformed functions partitions the belief space differently. Best next action to perform depends upon the initial belief state and observation. Probability leriyle çarptığı için böyle yamulurlar.
Best Value For Belief States The value of every single belief point, the sum of: Immediate reward. The line segments from the S() functions for each observation's future strategy. since adding lines gives you lines, it is linear. Superposition eğer doğruları toplarsak gene doğru çıkar: ax+by+a1x+b1y yani gene kırılan doğrular.
Best Strategy for any Belief Points All the useful future strategies are easy to pick out:
Value Function and Partition For the specific action a1, the value function and corresponding partitions: Note that each one of these line segments represents a particular two action strategy. The first action is a1 for all of these segments and the second action depends upon the observation. Thus we have solved our second problem; we now know how to find the value of a belief state for a fixed action. In fact, as before, we have actually shown how to find this value for every belief state.
Value Function and Partition For the specific action a2, the value function and corresponding partitions:
Which Action to Choose? put the value functions for each action together to see where each action gives the highest value.
Compact Horizon 2 Value Function
POMDP Model Control dynamics for a POMDP
Active Learning In an Active Learning Problem the learner has the ability to influence its training data. The learner asks for what is the most useful given its current knowledge. Methods to find the most useful query have been shown by Cohn et al. (95)
Active Learning (Cohn et al. 95) Their method, used for function approximation tasks, is based on finding the query that will minimize the estimated variance of the learner. They showed how this could be done exactly: For a mixture of Gaussians model. For locally weighted regression.
Automatic gesture recognition: Active Perception Automatic gesture recognition: not full-image pattern recognition gaze-based image analysis: fixations save computing time in image processing requires computing time for POMDP action selection (pan/tilt/zoom of camera)