2 Decision ProcessesAgentPerceives environment (St) flawlesslyChooses action (a)Which alters the state of the world (St+1)
3 finite state machine geen signalen zie BAL geen signalen A1: lummel wat rondzie BALA2: volgobjectgeen signalenzie obstakelzie obstakelzie BALA3: houd afstandzie obstakel
4 Stochastic Decision Processes AgentPerceives environment (St) flawlesslyChooses action (a) according to P(a|S)Which alters the state of the world (St+1) according to P(St+1|St,a)
5 Markov Decision Processes AgentPerceives environment (St) flawlesslyChooses action (a) according to P(a|S)Which alters the state of the world (St+1) according to P(St+1|St,a) If no longer-term dependencies: 1st order Markov process
6 AannamesDe waarneming van St is zonder ruis, alle benodigde informatie is waarneembaarActies a worden volgens kans P(a|S) geselecteerd (random generator)Gevolgen van a in (St+1) treden stochastisch op met kans P(St+1|St,a)
9 MDPStatesActionsTransitions between statesP(ai|sk) “policy”: beleid, welke a men zoal beslist gegeven de mogelijke omstandigheden s
10 policy π “ argmax ai P(ai|sk) “ Hoe kan een agent dit leren? KostenminimalisatieBeloning/straf uit omgeving als gevolg van gedrag/acties (Thorndike, 1911) Reinforcements R(a,S) Structuur wereld T = P(St+1|St)
11 agent leren de waarde van een Actie te schatten. ReinforcementsGegeven een historie van States, Actions en de resulterende Reinforcements kan eenagent leren de waarde van een Actie te schatten.Hoe: Som van reinforcements R? Gemiddelde? exponentiële wegingeerste stap bepaalt alle latere (learning from past)directe beloning is nuttiger (rekenen over toekomst)impatience & mortality
12 Assigning Utility (Value) to Sequences Discounted Rewards:V(s0,s1,s2…) = R(s0) + R(s1) + 2R(s2)…where 0<≤1where R is reinforcement value, s refers to state, is the the discount factor
13 Assigning Utility to States Can we say V(s)=R(s)?“ de utiliteit van een toestand is de verwachte utiliteit van alle toestanden die erop zullen volgen, wanneer beleid (policy) π wordt gehanteerd”Transitiekans T(s,a,s’)NO!!!
14 Assigning Utility to States Can we say V(s)=R(s)?Vπ(s) is specific to each policy πVπ(s) = E(tR(st)| π, s0=s)V(s)= Vπ *(s)V(s)=R(s) + max T(s,a,s’)V(s’)a s’Bellman equationIf we solve function V(s) for each state we will have solved the optimal π* for the given MDP
15 Value Iteration Algorithm We have to solve |S| simultaneous Bellman equationsCan’t solve directly, so use an iterative approach:1. Begin with arbitrary initial values V02. For each s, calculate V(s) from R(s) and V03. Use these new utility values to update V04. Repeat steps 2-3 until V0 convergesThis equilibrium is a unique solution! (see R&N for proof)page 621 R&N
16 Search spaceT: S*A*SExplicit enumeration of combinations is often not feasible (cf. chess, GO)Chunking within TProblem: if S is real valued
17 MDP: wereld is weliswaar stochastisch, Markoviaans, maar: MDP POMDPMDP: wereld is weliswaar stochastisch, Markoviaans, maar:De waarneming van die wereld zelf is betrouwbaar, er hoeven geen aannames worden gemaakt.De meeste ‘echte’ problemen omvatten:ruis in de waarneming zelfonvolledigheid van informatie
18 De meeste ‘echte’ problemen omvatten: MDP POMDPDe meeste ‘echte’ problemen omvatten:ruis in de waarneming zelfonvolledigheid van informatieIn deze gevallen moet de agent een stelsel van “Beliefs” kunnen ontwikkelen op basis van series partiële waarnemingen.
21 The Tiger Problem Description: 2 states: Tiger_Left, Tiger_Right 3 actions: Listen, Open_Left, Open_Right2 observations: Hear_Left, Hear_Right
22 The Tiger Problem Rewards are: -1 for the Listen action -100 for the Open(x) in the Tiger-at-x state+10 for the Open(x) in the Tiger-not-at x state
23 The Tiger Problem Furthermore: The Listen action does not change the stateThe Open(x) action reveals the tiger behind a door x with 50% chance, and resets the trial.The Listen action gives the correct information 85% of the time: p(hearleft | Listen, tigerleft) = 0.85p(hearright | Listen, tigerleft) = 0.15
24 The Tiger Problem Question: Actions depend on beliefs! what policy gives the highest return in rewards?Actions depend on beliefs!If belief is: 50/50 L/R, the expected reward will be R = 0.5 * ( ) = -45Beliefs are updated with observations (which may be wrong)
27 The Tiger Problem, horizon t=Inf Optimal policy:listen a few timeschoose a doornext triallisten1: Tigerleft (p=0.85), listen2: Tigerleft (p=0.96), listen3: (binomial)Good news: the optimal policy can be learned if actions are followed by rewards!
37 Value Function in POMDPs We will compute the value function over the belief space.Hard: the belief space is continuous !!But we can use a property of the optimal value function for a finite horizon: it is piecewise-linear and convex.We can represent any finite-horizon solution by a finite set of alpha-vectors.V(b) = max_α[Σ_s α(s)b(s)]
38 Alpha-VectorsThey are a set of hyperplanes which define the belief function. At each belief point the value function is equal to the hyperplane with the highest value.
39 Belief Transform Assumption: Finite actionFinite observationNext belief state = T(cbf,a,z) wherecbf: current belief state, a:action, z:observationFinite number of possible next belief stateThe transitions of this new continuous space CO-MDP are easily derived from the transition and observation probabilities of the POMDP (remember: no formulas here). What this means is that we are now back to solving a CO-MDP and we can use the value iteration (VI) algorithm
40 PO-MDP into continuous CO-MDP The process is Markovian, the next belief state depends on:Current belief stateCurrent actionObservationDiscrete PO-MDP problem can be converted into a continuous space CO-MDP problem where the continuous space is the belief space.Assume we start with a particular belief state b and we take action a1 and receive observation z1 after taking that action. Then our next belief state is fully determined
41 Using VI in continuous state space. ProblemUsing VI in continuous state space.No nice tabular representation as before.In CO-MDP value iteration we simply maintain a table with one entry per state. The value of each state is stored in the table and we have a nice finite representation of the value function. Since we now have a continuous space the value function is some arbitrary function over belief space.
42 PWLC GOAL:for each iteration of value iteration, find a finite Restrictions on the form of the solutions to the continuous space CO-MDP:The finite horizon value function is piecewise linear and convex (PWLC) for every horizon length.the value of a belief point is simply the dot product of the two vectors.. The value at any given belief state is found by plugging in the belief state into the hyper-planes equation. If we represent the hyper-plane as a vector (i.e., the equation coefficients) and each belief state as a vector (the probability at each state) then the value of a belief point is simply the dot product of the two vectors.GOAL:for each iteration of value iteration, find a finitenumber of linear segments that make up the value function
43 Steps in Value Iteration (VI) Represent the value function for each horizon as a set of vectors.Overcome how to represent a value function over a continuous space.Find the vector that has the largest dot product with the belief state.Since each horizon's value function is PWLC, we solved this problem, by representing the value function as a set of vectors (coefficients of the hyper-planes)
44 PO-MDP Value Iteration Example Assumption:Two statesTwo actionsThree observationsEx: horizon length is 1.a1 is the besta2 is the bestb=[ ]a a2Since we are interested in choosing the best action, we would choose whichever action gave us the highest value, which depends on the particular belief state. So we actually have a PWLC value function for the horizon 1 value function simply by considering the immediate rewards that come directly from the model. In the figure above, we also show the partition of belief space that this value function imposes. Here is where the colors will start to have some meaning. The blue region is all the belief states where action a1 is the best strategy to use, and the green region is the belief states where action a2 is the best strategy.s1s2V(a1,b) = 0.25x1+0.75x0 = 0.25V(a2,b)=0.25x0+0.75x1.5=1.125
45 PO-MDP Value Iteration Example The value of a belief state for horizon length 2 given b,a1,z1:immediate action plus the value of the next action.Find best achievable value for the belief state that results from our initial belief state b when we perform action a1 and observe z1.define T as the function that transforms the belief state for a given belief state, action and observation (the formulas are hiding in here). Note that from looking at where b' is, we can immediately determine what the best action we should do after we do the action a1. The belief state b' lies in the green region, which means that if we have a horizon length of 2 and are forced to take action a1 first, then the best thing we could do afterwards is action a2.
46 PO-MDP Value Iteration Example Find the value for all the belief points given this fixed action and observation.The Transformed value function is also PWLC.Yani value function nına transformed b noktasını (namely b’) gönderirsek bu yeni fonksiyonu türetmiş oluruz.We will use S() to represent the transformed value function, for a particular action and observation. The very nice part of this is that the transformed value function is also PWLC and the nicer part is that it always is this way
47 PO-MDP Value Iteration Example How to compute the value of a belief state given only the action?The horizon 2 value of the belief state, given that:Values for each observation: z1: 0.7 z2: 0.8 z3: 1.2P(z1| b,a1)=0.6; P(z2| b,a1)=0.25; P(z3| b,a1)=0.150.6x x x1.2 = 0.835
48 Transformed Value Functions Each of these transformed functions partitions the belief space differently.Best next action to perform depends upon the initial belief state and observation.Probability leriyle çarptığı için böyle yamulurlar.
49 Best Value For Belief States The value of every single belief point, the sum of:Immediate reward.The line segments from the S() functions for each observation's future strategy.since adding lines gives you lines, it is linear.Superposition eğer doğruları toplarsak gene doğru çıkar:ax+by+a1x+b1y yani gene kırılan doğrular.
50 Best Strategy for any Belief Points All the useful future strategies are easy to pick out:
51 Value Function and Partition For the specific action a1, the value function and corresponding partitions:Note that each one of these line segments represents a particular two action strategy. The first action is a1 for all of these segments and the second action depends upon the observation. Thus we have solved our second problem; we now know how to find the value of a belief state for a fixed action. In fact, as before, we have actually shown how to find this value for every belief state.
52 Value Function and Partition For the specific action a2, the value function and corresponding partitions:
53 Which Action to Choose?put the value functions for each action together to see where each action gives the highest value.
56 Active LearningIn an Active Learning Problem the learner has the ability to influence its training data.The learner asks for what is the most useful given its current knowledge.Methods to find the most useful query have been shown by Cohn et al. (95)
57 Active Learning (Cohn et al. 95) Their method, used for function approximation tasks, is based on finding the query that will minimize the estimated variance of the learner.They showed how this could be done exactly:For a mixture of Gaussians model.For locally weighted regression.
58 Automatic gesture recognition: Active PerceptionAutomatic gesture recognition:not full-image pattern recognitiongaze-based image analysis: fixationssave computing time in image processingrequires computing time for POMDP action selection (pan/tilt/zoom of camera)