KI2 – MDP / POMDP Kunstmatige Intelligentie / RuG.

Slides:



Advertisements
Verwante presentaties
‘SMS’ Studeren met Succes deel 1
Advertisements

M/V in de kerk Man and woman in the Church Doelen: informeren, bezinnen, oefenen Goals: to inform, to reflect, practice 1 december: de bijbel | the Bible.
The stock market will go up De beurswaarden zullen stijgen YESNO JA NEEN Is Jefken a good person ? Is Jefken een goed mens ? YES NO JA NEEN Is Lonny a.
NEDERLANDS WOORD BEELD IN & IN Klik met de muis
November 2013 Opinieonderzoek Vlaanderen – oktober 2013 Opiniepeiling Vlaanderen uitgevoerd op het iVOXpanel.
Uitgaven aan zorg per financieringsbron / /Hoofdstuk 2 Zorg in perspectief /pagina 1.
Requirements -People are able to make their own memorial page, called a memori -The website will be build first in Dutch for extension.nl, then copied.
Een alternatief voorstel Naar aanleiding van bestudering van de IAASB voorstellen denkt de NBA na over een alternatief. Dit alternatief zal 26 september.
Deltion College Engels C1 Gesprekken voeren [Edu/002]/ subvaardigheid lezen thema: Order, order…. can-do : kan een bijeenkomst voorzitten © Anne Beeker.
Global e-Society Complex België - Regio Vlaanderen e-Regio Provincie Limburg Stad Hasselt Percelen.
1 Co-Design at Chess-iT Guus Bosman. 2 Afstuderen bij Chess Net.Footworks tot augustus 2003 Afstuderen augustus 2003 tot maart 2004 Chess full-time vanaf.
Teams on the frontline Geert Stroobant De Heide - Balans
Mars, known as the Red Planet, it’s about to appear in spectacular fashion! So mark your calender (August 27, 2006 )to watch throughout the month of August,
Ronde (Sport & Spel) Quiz Night !
Copyright © 2008 Tele Atlas. All rights reserved. Zet uw Business Data op de kaart: Locaties in eTOM ®
Een optimale benutting van vierkante meters Breda, 6 juni 2007.
Virgielcollege Mede mogelijk gemaakt door uw Eerstejaarsch Commissie.
Vaardig? Een spectrum aan vaardigheden! Van informatie- naar media- naar exploratievaardig? Of e-Research & e-learning literate? Collaboration literate??
Nieuwe wegen in ontwerpen met CAD
Kb.1 Ik leer op een goede manier optellen en aftrekken
IST Status Gerrit van Nieuwenhuizen IST-MIT meeting BNL, July 24, 2008
Beyond Big Grid – Amsterdam 26 september 2012 Enquette 77 ingevulde enquettes, waarvan 60 met gebruikservaring = Mainly Computer Science.
Nooit meer onnodig groen? Luuk Misdom, IT&T
SCENARIO BASED PRODUCT DESIGN
SQL injections en meer... PERU. web application vulnerabilities Cross Site Scripting (21.5%) SQL Injection (14%) PHP includes (9.5%) Buffer overflows.
FOD VOLKSGEZONDHEID, VEILIGHEID VAN DE VOEDSELKETEN EN LEEFMILIEU 1 Kwaliteit en Patiëntveiligheid in de Belgische ziekenhuizen anno 2008 Rapportage over.
Woensdag 23 juli 2014 volgende vorige algemeen ziekenhuis Sint-Jozef Malle Dementia pathway: a condition specific approach Patrick De Wit, MD Thierry Laporta,
Applets as didactical tools for the learning of algebra
Elke 7 seconden een nieuw getal
1 introductie 3'46” …………… normaal hart hond 1'41” ……..
In samenwerking met het Europees Sociaal Fonds en het Hefboomkrediet The role of APEL in career coaching and competence management Competence navigation.
Oefeningen F-toetsen ANOVA.
Software Engineering Sommerville, Ian (2001) Software Engineering, 6 th edition Ch.1-3
ontwik idee - keling dag 3 goals today Develop “criteria” to help you evaluate & select your ideas Some tools from Tassouls book to help you do this.
Neurale Netwerken Kunstmatige Intelligentie Rijksuniversiteit Groningen April 2005.
Wat levert de tweede pensioenpijler op voor het personeelslid? 1 Enkele simulaties op basis van de weddeschaal B1-B3.
Netwerk Algorithms: Shortest paths1 Shortest paths II Network Algorithms 2004.
De digitale coach Het verbeteren van een plan van aanpak Steven Nijhuis, coördinator projecten FNT Deze presentatie staat op:
Sunday, 03 August 2014 zondag 3 augustus 2014 Click Klik.
1 Van Harvard naar MIPS. 2 3 Van Harvard naar MIPS Microprocessor without Interlocked Pipeline Stages Verschillen met de Harvard machine: - 32 Registers.
Vrije Universiteit amsterdamPostacademische Cursus Informatie Technologie Universal Modeling Language … why you need models? Models are necessary to communicate,
Tussentoets Digitale Techniek. 1 november 2001, 11:00 tot 13:00 uur. Opmerkingen: 1. Als u een gemiddeld huiswerkcijfer hebt gehaald van zes (6) of hoger,
Hidden Markov Models Introductie Project: 1. Initializatie 2. Training.
From computer power and human reason. Joseph Weizenbaum.
Geheugen, distributie en netwerken Netwerken: de basis voor distributie van gegevens en taken (processen) –bestaan zo’n 40 jaar, zeer snelle ontwikkeling.
Organizing Organization is the deployment of resources to achieve strategic goals. It is reflected in Division of labor into specific departments & jobs.
Motivation One secret for success in organizations is motivated and enthusiastic employees The challenge is to keep employee motivation consistent with.
Simple en continuous tenses Met of zonder –ing. Alle tijden kun je in het Engels met of zonder –ing-form maken: I sleep… I slept… I had slept… I will sleep…
Deltion College Engels B1 Gesprek voeren [Edu/001]
Deltion College Engels C1 Schrijven [Edu/002] thema: CV and letter of application can-do : kan complexe zakelijke teksten schrijven © Anne Beeker Alle.
Deltion College Engels B1 Gesprekken voeren [Edu/005] thema: applying for a job can-do : kan een eenvoudig sollicitatiegesprek voeren © Anne Beeker Alle.
Deltion College Engels C1 Gesprekken voeren [Edu/004]/ thema: There are lies, damned lies and statistics... can-do : kan complexe informatie en adviezen.
Deltion College Engels B2 Schrijven [Edu/004] thema: (No) skeleton in the cupboard can-do: kan een samenhangend verhaal schrijven © Anne Beeker Alle rechten.
Deltion College Engels B2 Gesprekken voeren [Edu/006]/subvaardigheid schrijven notulen en kort voorstel thema: ‘What shall we do about non- active group.
Deltion College Engels B1 En Spreken/Presentaties [Edu/007] Thema: Soap(s) can-do : kan met enig detail verslag doen van ervaringen, in dit geval, rapporteren.
Deltion College Engels En Projectopdracht [Edu/001] thema: research without borders can-do/gesprekken voeren : 1. kan eenvoudige feitelijke informatie.
Deltion College Engels C1 Spreken/Presentaties [Edu/006] thema ‘I hope to convince you of… ‘ can-do : kan een standpunt uiteenzetten voor een publiek van.
Deltion College Engels B1 Schrijven [Edu/004]/ subvaardigheid lezen thema: reporting a theft can-do : kan formulieren waarin meer informatie gevraagd wordt,
Future (toekomst) Je krijgt 2 verschillende vormen van Future.
Kenmerken van een persoonlijke brief
Rational Unified Process RUP Jef Bergsma. Iterations –Inception –Elaboration –Construction –Transition De kernbegrippen (Phases)
© Shopping 2020 TITLE Date Subtitle Logo Gastheer Logo Voorzitter.
Benjamin Boerebach, Esther Helmich NVMO workshop 12 juni 2014.
Just as an introduction for SDP-partners, this is a theoretical ppt on properties of triangles in which first, 3 properties are formulated and visualised.
Sustainable employability in Tourism The human factor October 24, 2014 Where Europe Meets the Americas.
Usability metrics Gebruiksvriendelijkheid ISO Effectiveness Efficiency Satisfaction Learnability Flexibility En nu? Inleiding Hoe gaan we de gebruiksvriendelijkheid.
1 Zie ook identiteit.pdf willen denkenvoelen 5 Zie ook identiteit.pdf.
Deltion College Engels B2 Lezen [Edu/003] thema: Topical News Lessons: The Onestop Magazine can-do: kan artikelen en rapporten begrijpen die gaan over.
The student will be able to:
Transcript van de presentatie:

KI2 – MDP / POMDP Kunstmatige Intelligentie / RuG

Decision Processes Agent Perceives environment (St) flawlessly Chooses action (a) Which alters the state of the world (St+1)

finite state machine geen signalen zie BAL geen signalen A1: lummel wat rond zie BAL A2: volg object geen signalen zie obstakel zie obstakel zie BAL A3: houd afstand zie obstakel

Stochastic Decision Processes Agent Perceives environment (St) flawlessly Chooses action (a) according to P(a|S) Which alters the state of the world (St+1) according to P(St+1|St,a)

Markov Decision Processes Agent Perceives environment (St) flawlessly Chooses action (a) according to P(a|S) Which alters the state of the world (St+1) according to P(St+1|St,a)  If no longer-term dependencies: 1st order Markov process

Aannames De waarneming van St is zonder ruis, alle benodigde informatie is waarneembaar Acties a worden volgens kans P(a|S) geselecteerd (random generator) Gevolgen van a in (St+1) treden stochastisch op met kans P(St+1|St,a)

A policy +1 -1 START

A policy +1 -1 START

MDP States Actions Transitions between states P(ai|sk) “policy”: beleid, welke a men zoal beslist gegeven de mogelijke omstandigheden s

policy π “ argmax ai P(ai|sk) “ Hoe kan een agent dit leren? Kostenminimalisatie Beloning/straf uit omgeving als gevolg van gedrag/acties (Thorndike, 1911)  Reinforcements R(a,S)  Structuur wereld T = P(St+1|St)

agent leren de waarde van een Actie te schatten. Reinforcements Gegeven een historie van States, Actions en de resulterende Reinforcements kan een agent leren de waarde van een Actie te schatten. Hoe: Som van reinforcements R? Gemiddelde?  exponentiële weging eerste stap bepaalt alle latere (learning from past) directe beloning is nuttiger (rekenen over toekomst) impatience & mortality

Assigning Utility (Value) to Sequences Discounted Rewards: V(s0,s1,s2…) = R(s0) + R(s1) + 2R(s2)… where 0<≤1 where R is reinforcement value, s refers to state, is the the discount factor

Assigning Utility to States Can we say V(s)=R(s)? “ de utiliteit van een toestand is de verwachte utiliteit van alle toestanden die erop zullen volgen, wanneer beleid (policy) π wordt gehanteerd” Transitiekans T(s,a,s’) NO!!!

Assigning Utility to States Can we say V(s)=R(s)? Vπ(s) is specific to each policy π Vπ(s) = E(tR(st)| π, s0=s) V(s)= Vπ *(s) V(s)=R(s) +  max T(s,a,s’)V(s’) a s’ Bellman equation If we solve function V(s) for each state we will have solved the optimal π* for the given MDP

Value Iteration Algorithm We have to solve |S| simultaneous Bellman equations Can’t solve directly, so use an iterative approach: 1. Begin with arbitrary initial values V0 2. For each s, calculate V(s) from R(s) and V0 3. Use these new utility values to update V0 4. Repeat steps 2-3 until V0 converges This equilibrium is a unique solution! (see R&N for proof) page 621 R&N

Search space T: S*A*S Explicit enumeration of combinations is often not feasible (cf. chess, GO) Chunking within T Problem: if S is real valued

MDP: wereld is weliswaar stochastisch, Markoviaans, maar: MDP  POMDP MDP: wereld is weliswaar stochastisch, Markoviaans, maar: De waarneming van die wereld zelf is betrouwbaar, er hoeven geen aannames worden gemaakt. De meeste ‘echte’ problemen omvatten: ruis in de waarneming zelf onvolledigheid van informatie

De meeste ‘echte’ problemen omvatten: MDP  POMDP De meeste ‘echte’ problemen omvatten: ruis in de waarneming zelf onvolledigheid van informatie In deze gevallen moet de agent een stelsel van “Beliefs” kunnen ontwikkelen op basis van series partiële waarnemingen.

Partially Observable Markov Decision Processes (POMDPs) A POMDP has: States S Actions A Probabilistic transitions Immediate Rewards on actions A discount factor +Observations Z +Observation probabilities (reliabilities) +An initial belief b0

A POMDP example: The Tiger Problem

The Tiger Problem Description: 2 states: Tiger_Left, Tiger_Right 3 actions: Listen, Open_Left, Open_Right 2 observations: Hear_Left, Hear_Right

The Tiger Problem Rewards are: -1 for the Listen action -100 for the Open(x) in the Tiger-at-x state +10 for the Open(x) in the Tiger-not-at x state

The Tiger Problem Furthermore: The Listen action does not change the state The Open(x) action reveals the tiger behind a door x with 50% chance, and resets the trial. The Listen action gives the correct information 85% of the time: p(hearleft | Listen, tigerleft) = 0.85 p(hearright | Listen, tigerleft) = 0.15

The Tiger Problem Question: Actions depend on beliefs! what policy gives the highest return in rewards? Actions depend on beliefs! If belief is: 50/50 L/R, the expected reward will be R = 0.5 * (-100 + 10) = -45 Beliefs are updated with observations (which may be wrong)

The Tiger Problem, horizon t=1 Optimal policy: Belief Tiger=left Action [0.00, 0.10] Open(left) [0.10, 0.90] Listen [0.90, 1.00] Open(right)

The Tiger Problem, horizon t=2 Optimal policy: Belief Tiger=left Action [0.00, 0.10] Listen [0.10, 0.90] [0.90, 1.00]

The Tiger Problem, horizon t=Inf Optimal policy: listen a few times choose a door next trial listen1: Tigerleft (p=0.85), listen2: Tigerleft (p=0.96), listen3: ... (binomial) Good news: the optimal policy can be learned if actions are followed by rewards!

The Tiger Problem, belief updates on “Listen” P(Tiger|Listen,State)t+1 = P(Tiger|Listen,State)t / ( P(Tiger|Listen,State)t * P(Listen) + (1-P(Tiger|Listen,State)t ) * (1-P(Listen) ) Example: initial: (Tigerleft) (p=0.5000), listen1: Tigerleft (p=0.8500), listen2: Tigerleft (p=0.9698), listen3: Tigerleft (p=0.9945), listen4: ... (Note: underlying binomial distribution)

The Tiger Problem, belief updates on “Listen” P(Tiger|Listen,State)t+1 = P(Tiger|Listen,State)t / ( P(Tiger|Listen,State)t * P(Listen) + (1-P(Tiger|Listen,State)t ) * (1-P(Listen) ) Example 2, noise in observation: initial: (Tigerleft) (p=0.5000), listen1: Tigerleft (p=0.8500), listen2: Tigerleft (p=0.9698), listen3: Tigerright (p=0.8500), Belief drops... listen4: Tigerleft (p=0.9698), and recovers listen5: ...

Solving a POMDP To solve a POMDP is to find, for any action/observation history, the action that maximizes the expected discounted reward:

The belief state Instead of maintaining the complete action/observation history, we maintain a belief state b. The belief is a probability distribution over the states. Dim(b) = |S|-1

The belief space Here is a representation of the belief space when we have two states (s0,s1)

The belief space Here is a representation of the belief state when we have three states (s0,s1,s2)

The belief space Here is a representation of the belief state when we have four states (s0,s1,s2,s3)

The belief space The belief space is continuous but we only visit a countable number of belief points.

The Bayesian update

Value Function in POMDPs We will compute the value function over the belief space. Hard: the belief space is continuous !! But we can use a property of the optimal value function for a finite horizon: it is piecewise-linear and convex. We can represent any finite-horizon solution by a finite set of alpha-vectors. V(b) = max_α[Σ_s α(s)b(s)]

Alpha-Vectors They are a set of hyperplanes which define the belief function. At each belief point the value function is equal to the hyperplane with the highest value.

Belief Transform Assumption: Finite action Finite observation Next belief state = T(cbf,a,z) where cbf: current belief state, a:action, z:observation Finite number of possible next belief state The transitions of this new continuous space CO-MDP are easily derived from the transition and observation probabilities of the POMDP (remember: no formulas here). What this means is that we are now back to solving a CO-MDP and we can use the value iteration (VI) algorithm

PO-MDP into continuous CO-MDP The process is Markovian, the next belief state depends on: Current belief state Current action Observation Discrete PO-MDP problem can be converted into a continuous space CO-MDP problem where the continuous space is the belief space. Assume we start with a particular belief state b and we take action a1 and receive observation z1 after taking that action. Then our next belief state is fully determined

Using VI in continuous state space. Problem Using VI in continuous state space. No nice tabular representation as before. In CO-MDP value iteration we simply maintain a table with one entry per state. The value of each state is stored in the table and we have a nice finite representation of the value function. Since we now have a continuous space the value function is some arbitrary function over belief space.

PWLC GOAL:for each iteration of value iteration, find a finite Restrictions on the form of the solutions to the continuous space CO-MDP: The finite horizon value function is piecewise linear and convex (PWLC) for every horizon length. the value of a belief point is simply the dot product of the two vectors. . The value at any given belief state is found by plugging in the belief state into the hyper-planes equation. If we represent the hyper-plane as a vector (i.e., the equation coefficients) and each belief state as a vector (the probability at each state) then the value of a belief point is simply the dot product of the two vectors. GOAL:for each iteration of value iteration, find a finite number of linear segments that make up the value function

Steps in Value Iteration (VI) Represent the value function for each horizon as a set of vectors. Overcome how to represent a value function over a continuous space. Find the vector that has the largest dot product with the belief state. Since each horizon's value function is PWLC, we solved this problem, by representing the value function as a set of vectors (coefficients of the hyper-planes)

PO-MDP Value Iteration Example Assumption: Two states Two actions Three observations Ex: horizon length is 1. a1 is the best a2 is the best b=[0.25 0.75] a1 a2 [ ] Since we are interested in choosing the best action, we would choose whichever action gave us the highest value, which depends on the particular belief state. So we actually have a PWLC value function for the horizon 1 value function simply by considering the immediate rewards that come directly from the model. In the figure above, we also show the partition of belief space that this value function imposes. Here is where the colors will start to have some meaning. The blue region is all the belief states where action a1 is the best strategy to use, and the green region is the belief states where action a2 is the best strategy. s1 s2 0 1.5 V(a1,b) = 0.25x1+0.75x0 = 0.25 V(a2,b)=0.25x0+0.75x1.5=1.125

PO-MDP Value Iteration Example The value of a belief state for horizon length 2 given b,a1,z1: immediate action plus the value of the next action. Find best achievable value for the belief state that results from our initial belief state b when we perform action a1 and observe z1. define T as the function that transforms the belief state for a given belief state, action and observation (the formulas are hiding in here). Note that from looking at where b' is, we can immediately determine what the best action we should do after we do the action a1. The belief state b' lies in the green region, which means that if we have a horizon length of 2 and are forced to take action a1 first, then the best thing we could do afterwards is action a2.

PO-MDP Value Iteration Example Find the value for all the belief points given this fixed action and observation. The Transformed value function is also PWLC. Yani value function nına transformed b noktasını (namely b’) gönderirsek bu yeni fonksiyonu türetmiş oluruz. We will use S() to represent the transformed value function, for a particular action and observation. The very nice part of this is that the transformed value function is also PWLC and the nicer part is that it always is this way

PO-MDP Value Iteration Example How to compute the value of a belief state given only the action? The horizon 2 value of the belief state, given that: Values for each observation: z1: 0.7 z2: 0.8 z3: 1.2 P(z1| b,a1)=0.6; P(z2| b,a1)=0.25; P(z3| b,a1)=0.15 0.6x0.8 + 0.25x0.7 + 0.15x1.2 = 0.835

Transformed Value Functions Each of these transformed functions partitions the belief space differently. Best next action to perform depends upon the initial belief state and observation. Probability leriyle çarptığı için böyle yamulurlar.

Best Value For Belief States The value of every single belief point, the sum of: Immediate reward. The line segments from the S() functions for each observation's future strategy. since adding lines gives you lines, it is linear. Superposition eğer doğruları toplarsak gene doğru çıkar: ax+by+a1x+b1y yani gene kırılan doğrular.

Best Strategy for any Belief Points All the useful future strategies are easy to pick out:

Value Function and Partition For the specific action a1, the value function and corresponding partitions: Note that each one of these line segments represents a particular two action strategy. The first action is a1 for all of these segments and the second action depends upon the observation. Thus we have solved our second problem; we now know how to find the value of a belief state for a fixed action. In fact, as before, we have actually shown how to find this value for every belief state.

Value Function and Partition For the specific action a2, the value function and corresponding partitions:

Which Action to Choose? put the value functions for each action together to see where each action gives the highest value.

Compact Horizon 2 Value Function

POMDP Model Control dynamics for a POMDP

Active Learning In an Active Learning Problem the learner has the ability to influence its training data. The learner asks for what is the most useful given its current knowledge. Methods to find the most useful query have been shown by Cohn et al. (95)

Active Learning (Cohn et al. 95) Their method, used for function approximation tasks, is based on finding the query that will minimize the estimated variance of the learner. They showed how this could be done exactly: For a mixture of Gaussians model. For locally weighted regression.

Automatic gesture recognition: Active Perception Automatic gesture recognition: not full-image pattern recognition gaze-based image analysis: fixations save computing time in image processing requires computing time for POMDP action selection (pan/tilt/zoom of camera)