De presentatie wordt gedownload. Even geduld aub

De presentatie wordt gedownload. Even geduld aub

KI2 – MDP / POMDP Kunstmatige Intelligentie / RuG.

Verwante presentaties


Presentatie over: "KI2 – MDP / POMDP Kunstmatige Intelligentie / RuG."— Transcript van de presentatie:

1 KI2 – MDP / POMDP Kunstmatige Intelligentie / RuG

2 2 Decision Processes  Agent  Perceives environment (S t ) flawlessly  Chooses action (a)  Which alters the state of the world (S t+1 )

3 A1: lummel wat rond A2: volg object A3: houd afstand geen signalen zie obstakel zie BAL finite state machine zie obstakel geen signalen

4 4 Stochastic Decision Processes  Agent  Perceives environment (S t ) flawlessly  Chooses action (a) according to P(a|S)  Which alters the state of the world (S t+1 ) according to P(S t+1 |S t,a)

5 5 Markov Decision Processes  Agent  Perceives environment (S t ) flawlessly  Chooses action (a) according to P(a|S)  Which alters the state of the world (S t+1 ) according to P(S t+1 |S t,a)  If no longer-term dependencies: 1 st order Markov process

6 6 Aannames  De waarneming van S t is zonder ruis, alle benodigde informatie is waarneembaar  Acties a worden volgens kans P(a|S) geselecteerd (random generator)  Gevolgen van a in (S t+1 ) treden stochastisch op met kans P(S t+1 |S t,a)

7 7 A policy START +1

8 8 A policy START +1

9 9 MDP  States  Actions  Transitions between states  P(a i |s k ) “policy”: beleid, welke a men zoal beslist gegeven de mogelijke omstandigheden s

10 10 policy π  “ argmax a i P(a i |s k ) “  Hoe kan een agent dit leren?  Kostenminimalisatie  Beloning/straf uit omgeving als gevolg van gedrag/acties (Thorndike, 1911)   Reinforcements R(a,S)   Structuur wereld T = P(S t+1 |S t )

11 11 Reinforcements  Gegeven een historie van States, Actions en de resulterende Reinforcements kan een agent leren de waarde van een Actie te schatten.  Hoe: Som van reinforcements R? Gemiddelde?  exponentiële weging  eerste stap bepaalt alle latere (learning from past)  directe beloning is nuttiger (rekenen over toekomst)  impatience & mortality

12 12 Assigning Utility (Value) to Sequences V(s 0,s 1,s 2 …) = R(s 0 ) +  R(s 1 ) +  2 R(s 2 )… where 0<  ≤1 where R is reinforcement value, s refers to state,  is the the discount factor Discounted Rewards:

13 13 Assigning Utility to States  Can we say V(s)=R(s)?  “ de utiliteit van een toestand is de verwachte utiliteit van alle toestanden die erop zullen volgen, wanneer beleid (policy) π wordt gehanteerd”  Transitiekans T(s,a,s’) NO!!!

14 14 Assigning Utility to States  Can we say V(s)=R(s)?  V π (s) is specific to each policy π  V π (s) = E(  t R(s t )| π, s 0 =s)  V(s)= V π * (s)  V(s)=R(s) +  max  T(s,a,s’)V(s’) a s’ Bellman equation If we solve function V(s) for each state we will have solved the optimal π* for the given MDP

15 15 Value Iteration Algorithm  We have to solve |S| simultaneous Bellman equations  Can’t solve directly, so use an iterative approach: 1. Begin with arbitrary initial values V 0 2. For each s, calculate V(s) from R(s) and V 0 3. Use these new utility values to update V 0 4. Repeat steps 2-3 until V 0 converges This equilibrium is a unique solution! (see R&N for proof) page 621 R&N

16 16 Search space  T: S*A*S  Explicit enumeration of combinations is often not feasible (cf. chess, GO)  Chunking within T  Problem: if S is real valued

17 17 MDP  POMDP  MDP: wereld is weliswaar stochastisch, Markoviaans, maar: –De waarneming van die wereld zelf is betrouwbaar, er hoeven geen aannames worden gemaakt.  De meeste ‘echte’ problemen omvatten: –ruis in de waarneming zelf –onvolledigheid van informatie

18 18 MDP  POMDP  De meeste ‘echte’ problemen omvatten: –ruis in de waarneming zelf –onvolledigheid van informatie  In deze gevallen moet de agent een stelsel van “Beliefs” kunnen ontwikkelen op basis van series partiële waarnemingen.

19 19 Partially Observable Markov Decision Processes (POMDPs)  A POMDP has: –States S –Actions A –Probabilistic transitions –Immediate Rewards on actions –A discount factor –+Observations Z –+Observation probabilities (reliabilities) –+An initial belief b0

20 20 A POMDP example: The Tiger Problem

21 21 The Tiger Problem  Description: –2 states: Tiger_Left, Tiger_Right –3 actions: Listen, Open_Left, Open_Right –2 observations: Hear_Left, Hear_Right

22 22 The Tiger Problem  Rewards are: -1 for the Listen action -100 for the Open(x) in the Tiger-at-x state +10 for the Open(x) in the Tiger-not-at x state

23 23 The Tiger Problem  Furthermore: –The Listen action does not change the state –The Open(x) action reveals the tiger behind a door x with 50% chance, and resets the trial. –The Listen action gives the correct information 85% of the time: p(hear left | Listen, tiger left ) = 0.85 p(hear right | Listen, tiger left ) = 0.15

24 24 The Tiger Problem  Question: –what policy gives the highest return in rewards?  Actions depend on beliefs!  If belief is: 50/50 L/R, the expected reward will be R = 0.5 * ( ) = -45  Beliefs are updated with observations (which may be wrong)

25 25 The Tiger Problem, horizon t=1  Optimal policy: Belief Tiger=leftAction [0.00, 0.10]Open(left) [0.10, 0.90]Listen [0.90, 1.00]Open(right)

26 26 The Tiger Problem, horizon t=2  Optimal policy: Belief Tiger=leftAction [0.00, 0.10]Listen [0.10, 0.90]Listen [0.90, 1.00]Listen

27 27 The Tiger Problem, horizon t=Inf  Optimal policy: –listen a few times –choose a door –next trial  listen1: Tiger left (p=0.85), listen2: Tiger left (p=0.96), listen3:... (binomial)  Good news: the optimal policy can be learned if actions are followed by rewards!

28 28 The Tiger Problem, belief updates on “Listen”  P(Tiger|Listen,State) t+1 = P(Tiger|Listen,State) t / ( P(Tiger|Listen,State) t * P(Listen) + (1-P(Tiger|Listen,State) t ) * (1-P(Listen) )  Example: –initial: (Tiger left ) (p=0.5000), listen1: Tiger left (p=0.8500), listen2: Tiger left (p=0.9698), listen3: Tiger left (p=0.9945), listen4:... (Note: underlying binomial distribution)

29 29 The Tiger Problem, belief updates on “Listen”  P(Tiger|Listen,State) t+1 = P(Tiger|Listen,State) t / ( P(Tiger|Listen,State) t * P(Listen) + (1-P(Tiger|Listen,State) t ) * (1-P(Listen) )  Example 2, noise in observation: –initial: (Tiger left ) (p=0.5000), listen1: Tiger left (p=0.8500), listen2: Tiger left (p=0.9698), listen3: Tiger right (p=0.8500), Belief drops... listen4: Tiger left (p=0.9698), and recovers listen5:...

30 30 Solving a POMDP  To solve a POMDP is to find, for any action/observation history, the action that maximizes the expected discounted reward:

31 31 The belief state  Instead of maintaining the complete action/observation history, we maintain a belief state b.  The belief is a probability distribution over the states. Dim(b) = |S|-1

32 32 The belief space Here is a representation of the belief space when we have two states (s0,s1)

33 33 The belief space Here is a representation of the belief state when we have three states (s0,s1,s2)

34 34 The belief space Here is a representation of the belief state when we have four states (s0,s1,s2,s3)

35 35 The belief space  The belief space is continuous but we only visit a countable number of belief points.

36 36 The Bayesian update

37 37 Value Function in POMDPs  We will compute the value function over the belief space. –Hard: the belief space is continuous !! –But we can use a property of the optimal value function for a finite horizon: it is piecewise-linear and convex. –We can represent any finite-horizon solution by a finite set of alpha-vectors. –V(b) = max_α[Σ_s α(s)b(s)]

38 38 Alpha-Vectors  They are a set of hyperplanes which define the belief function. At each belief point the value function is equal to the hyperplane with the highest value.

39 39 Belief Transform  Assumption: –Finite action –Finite observation –Next belief state = T(cbf,a,z) where cbf: current belief state, a:action, z:observation  Finite number of possible next belief state

40 40 PO-MDP into continuous CO-MDP  The process is Markovian, the next belief state depends on: –Current belief state –Current action –Observation  Discrete PO-MDP problem can be converted into a continuous space CO-MDP problem where the continuous space is the belief space.

41 41 Problem  Using VI in continuous state space.  No nice tabular representation as before.

42 42 PWLC  Restrictions on the form of the solutions to the continuous space CO-MDP: –The finite horizon value function is piecewise linear and convex (PWLC) for every horizon length. –the value of a belief point is simply the dot product of the two vectors. GOAL:for each iteration of value iteration, find a finite number of linear segments that make up the value function

43 43 Steps in Value Iteration (VI)  Represent the value function for each horizon as a set of vectors. –Overcome how to represent a value function over a continuous space.  Find the vector that has the largest dot product with the belief state.

44 44 PO-MDP Value Iteration Example  Assumption: –Two states –Two actions –Three observations  Ex: horizon length is 1. b=[ ] [ s1s2s1s2 a 1 a 2 ] V(a 1,b) = 0.25x1+0.75x0 = 0.25 V(a 2,b)=0.25x0+0.75x1.5=1.125 a 1 is the best a 2 is the best

45 45  The value of a belief state for horizon length 2 given b,a 1,z 1 : –immediate action plus the value of the next action. –Find best achievable value for the belief state that results from our initial belief state b when we perform action a 1 and observe z 1. PO-MDP Value Iteration Example

46 46 PO-MDP Value Iteration Example  Find the value for all the belief points given this fixed action and observation.  The Transformed value function is also PWLC.

47  How to compute the value of a belief state given only the action?  The horizon 2 value of the belief state, given that: –Values for each observation: z 1 : 0.7 z 2 : 0.8 z 3 : 1.2 P(z 1 | b,a 1 )=0.6; P(z 2 | b,a 1 )=0.25; P(z 3 | b,a 1 )= x x x1.2 = PO-MDP Value Iteration Example

48 48 Transformed Value Functions  Each of these transformed functions partitions the belief space differently.  Best next action to perform depends upon the initial belief state and observation.

49 49 Best Value For Belief States  The value of every single belief point, the sum of: –Immediate reward. –The line segments from the S() functions for each observation's future strategy.  since adding lines gives you lines, it is linear.

50 50  All the useful future strategies are easy to pick out: Best Strategy for any Belief Points

51 51 Value Function and Partition  For the specific action a 1, the value function and corresponding partitions:

52 52 Value Function and Partition  For the specific action a 2, the value function and corresponding partitions:

53 53 Which Action to Choose?  put the value functions for each action together to see where each action gives the highest value.

54 54 Compact Horizon 2 Value Function

55 55 POMDP Model  Control dynamics for a POMDP

56 56 Active Learning  In an Active Learning Problem the learner has the ability to influence its training data.  The learner asks for what is the most useful given its current knowledge.  Methods to find the most useful query have been shown by Cohn et al. (95)

57 57 Active Learning (Cohn et al. 95)  Their method, used for function approximation tasks, is based on finding the query that will minimize the estimated variance of the learner.  They showed how this could be done exactly: –For a mixture of Gaussians model. –For locally weighted regression.

58 58 Active Perception  Automatic gesture recognition: –not full-image pattern recognition –gaze-based image analysis: fixations –save computing time in image processing –requires computing time for POMDP action selection (pan/tilt/zoom of camera)


Download ppt "KI2 – MDP / POMDP Kunstmatige Intelligentie / RuG."

Verwante presentaties


Ads door Google