% % Dear author, % % All the comments below are available on our WEB sites: % in the US: http://www.ams.sunysb.edu/~feinberg/MDP % in Israel: http://www.ee.technion.ac.il/~adam/MDP % under "information for authors". % % Below please find a LaTeX file with the basic definitions and notation % for the MDP book. This is a second, revised version. It does not contain % examples that are irrelevant to the book. However, it does have more % background material that you can refer to in your paper. % % This file is for LaTeX2e (the new LaTeX). If you have an older version, % please use the LaTeX2.09 file (called notat2.tex). % If you run LaTeX on this file, you should produce 8 pages, including % examples and notation. % % In the file we show how to use the "index" command, both for % a standard entry, as well as for a sub-entry. This is not intended to be % complete: just to show how to use it. Note that you will NOT SEE % an index when you run LaTeX. If you want to see it: % 1. Uncomment (remove the "%") before the "\printindex" command % 2. Run LaTeX as usual % 3. Now give the command makeindex "filename": here "filename" is the name % of the file you are editing, WITHOUT the ".tex" extension. % 4. Run LaTeX one more time. You will now see an index as the last page. % If the program reports errors related to the "index" or"makeindex" options, % you can comment out lines 3,4 and uncomment line 5. In this case you can % still enter indexing commands, but the program will not process them. The % advantage is that we will be able to use them in our final editing. % % Feel free to contact Adam with any questions. % % Best wishes, % The editors. % % Notation file, version of November 29, 1999. \documentstyle[11pt]{report} % \def\PR{\mathop{\rm I\kern -0.20em P}\nolimits} % Probability \def\E{\mathop{\rm I\kern -0.20em E}\nolimits} % Expectations \def\N{\mathop{\rm I\kern -0.20em N}\nolimits} % Nonnegative integers \def\R{\mathop{\rm I\kern -0.20em R}\nolimits} % Blackboard R - Reals \def\F{\mathop{\rm I\kern -0.20em F}\nolimits} % Functions/stationary polices \def\A{{A}} % Action space \def\X{{X}} % State space \def\cA{{\cal A}} % action sigma-field \def\cX{{\cal X}} % state sigma-field % % Definitions that work with or without the packages % \def\PS{\Pi} % Policy space % % Theorem environments etc. This controls the number of theorems, lemmas, etc. % \newtheorem{theorem}{Theorem} \newtheorem{corollary}[theorem]{Corollary} \newtheorem{lemma}[theorem]{Lemma} \newtheorem{proposition}[theorem]{Proposition} \newtheorem{definition}[theorem]{Definition} % % How to start and end a proof: % \newcommand{\Pf}{\paragraph{{\bf Proof.}}} % Starting a proof \newcommand{\blot}{\hfill{\vrule height .9ex width .8ex depth -.1ex }} \newcommand{\EndPf}{\hfill $\blot$ \medskip} % End of proof symbol % % New definitions by Adam: \newcommand\bydef{{:=}} % \newcommand\bydef{\stackrel{{def}}{=}} % % Information about the authors % \author{Eugene Feinberg\\ SUNY Stony Brook\\ Stony Brook etc.\\ \and Adam Shwartz\\ Electrical Engineering.\\ Technion Etc.} % \title{Definitions and notation} % \begin{document} % \maketitle % \chapter{Introduction}\label{ch:intro} % \begin{abstract} In this chapter we describe the basic structure of a Markov Decision Process. We introduce the notation and then give a quick tour of this volume. At this point, we use this as information for authors, with examples. \end{abstract} \section{Introduction} % In this section we shall introduce the volume, its purpose, etc. including a heuristic description of what an MDP is (in contrast to the formal presentation of section~\ref{C1sec:notation}). % \section{Notation}\label{C1sec:notation} % Let $\N=\{0,1,\ldots\}$ and let $\R^n$ be an $n$-dimensional Euclidean space, $\R=\R^1$. % A Markov Decision Process\index{Markov Decision Process} (MDP) is defined through the following objects: a state space\index{state space}\index{space!state} $\X$; an action space\index{action space}\index{space!action}\index{action!space} $\A$; sets $\A(x)$ of available actions\index{action!available}\index{space!available actions} at states $x\in \X$; transition probabilities,\index{transition probability} denoted by $p(Y | x,a)$; reward functions $ r(x,a)$ denoting the one-step reward\index{reward}\index{reward!one step} using action $a$ in state $x$. The above objects have the following meaning. There is a stochastic system with a state space $\X$. When the system is at state $x\in\X$, a decision-maker selects an action $a$ from the set of actions $\A(x)$ available at state $x$. After an action $a$ is selected, the system moves to the next state according to the probability distribution $p(\cdot|x,a)$ and the decision-maker collects a one-step reward $r(x,a)$. The selection of an action $a$ may depend on the current state of the system, the current time, and the available information about the history of the system. At each step, the decision maker may select a particular action or, in a more general way, a probability distribution on the set of available actions $\A(x)$. Decisions of the first type are called nonrandomized and decisions of the second type are called randomized. \paragraph{Discrete MDPs.} An MDP is called finite if the state and action sets are finite. We say that a set is discrete if it is finite or countable. An MDP is called discrete if the state and action sets are discrete. A significant part of research and applications related to MDPs deals with discrete MDPs. For discrete MDPs, we do not need additional measurability assumptions on the major objects introduced above. Readers who are not familiar with measure theory can still read % most of the papers of this volume, since most of the papers deal with discrete MDPs: for the other papers, the results may be restricted to discrete state and action sets. For a discrete state space $\X$ we denote the transition probabilities by $p(y | x,a)$ or $p_{xy}(a)$, and use (in addition to $x,y$) also the letters $i,j,k $ etc.\ to denote states. Unless mentioned otherwise, we always assume that $p(\X | x,a)=1$. The time parameter is $t,s $ or $n\in\N$ and a trajectory\index{trajectory} is a sequence $x_0a_0x_1a_1\ldots\ $. The set of all trajectories is $H_\infty=(\X \times \A)^\infty $. A trajectory of length $n$ is called a history\index{history}, and denoted by $h_n = x_0 a_0 \ldots x_{n-1} a_{n-1} x_n $. Let $H_n = \X \times (\A \times \X)^{n}$ be the space of histories\index{history} up to epoch $n\in\N$. A nonrandomized policy\index{policy} is a sequence of mappings $\phi_n$, $n\in \N$, from $H_n$ to $\A$ such that $\phi_n (x_0 a_0 \ldots x_{n-1} a_{n-1} x_n) \in \A(x_n)$. If for each $n$ this %function mapping depends only on $x_n$, then the policy $\phi$ is called Markov. In other words, a Markov policy $\phi$ is defined by mappings $\phi_n:\ \X \to \A$ such that $\phi_n (x) \in \A (x)$ for all $x \in \X$, $n=0,1,\ldots\ $. A Markov policy $\phi$ is called stationary if the $\phi_n$ do not depend on $n$. A stationary policy is therefore defined by a single mapping $\phi:\ \X \to \A$ such that $\phi (x) \in \A (x)$ for all $x \in \X$. % We denote by $\PS$, $\PS^M,$ and $\PS^S$ the sets of all\index{policy space}\index{space!policy} nonrandomized, Markov, and stationary policies respectively. We observe that $\PS^S \subseteq \PS^M \subseteq \PS$. As mentioned above, by selecting actions randomly, it is possible to expand the set of policies. A randomized policy\index{policy!randomized} $\pi$ is a sequence of transition probabilities $\pi_n ( a_n | h_n)$ from $H_n$ to $\A$, $n\in\N$, such that $\pi_n ( \A (x_n)| x_0 a_0 \ldots x_{n-1} a_{n-1} x_n)=1$. A policy $\pi$ is called randomized Markov if $\pi_n (a_n | x_0 a_0 \ldots x_{n-1} a_{n-1} x_n) = \pi_n (a_n | x_n) $. If $\pi_m (\cdot | x ) = \pi_n (\cdot | x)$ for all $m,n\in\N$ then the Markov policy $\pi$ is called randomized stationary. A randomized stationary policy $\pi$ is thus defined by a transition probability $\pi$ from $\X$ to $\A$ such that $\pi(\A (x) | x) = 1$ for all $x \in \X$. % We denote by $\PS^R$, $\PS^{RM}$, $\PS^{RS}$ the sets of all randomized, randomized Markov, and randomized stationary policies respectively. We have that $\PS^{RS} \subseteq \PS^{RM} \subseteq \PS^{R}$, and in addition $\PS^S \subseteq \PS^{RS} $, $\PS^M \subseteq \PS^{RM}$, and $\PS \subseteq \PS^R$. Note that, while we try to be consistent with the above definitions, there is no standard terminology for policies: in particular, there is no general agreement as to whether ``stationary" implies nonrandomized or, more generally, whether the ``default" should be randomized (the more general case) or nonrandomized. The following additional terms are sometimes also used: pure\index{pure policy} policy means nonrandomized; deterministic\index{deterministic policy} policy means (nonrandomized) stationary. The stochastic process evolves as follows. If at time $n$ the process is in state $x $, having followed the history $h_n $, then an action is chosen (perhaps randomly) according to the policy $\pi $. If action $a$ ensued, then at time $ n+1 $ the process will be in the state $y$ with probability $ p ( y | x , a ) $. Given an initial state $x$ and a policy $\pi$, the ``evolution rule" described above defines all finite-dimensional distributions $ x_0, a_0,\ldots,x_n $, $n\in\N$. Kolmogorov's extension theorem %, Neveu \cite[Section 5.1]{ne}, guarantees that any initial state $x$ and any policy $\pi$ define a stochastic sequence $x_0a_0x_1a_1\ldots\ $. We denote by $\PR_x^\pi$ and %(sometimes called the ``strategic" measure). $\E_x^\pi$ respectively the probabilities and expectations related to this stochastic sequence; $\PR_x^\pi \{x_0=x\}=1$. Any stationary policy $\phi$ defines a Markov chain with transition probabilities $p_{xy}(\phi)=p(y| x,\phi(x))$ on the state space $\X$. A randomized stationary policy $\pi$ also defines a Markov chain with the state space $\X.$ In the latter case, the transition probabilities are $ p_{xy}(\pi)=\sum_{a\in\A(x)} \pi (a)p(y| x,a)$. We denote by $P(\pi)$ the transition matrix with elements $\{p_{xy}(\pi)\}$. The limiting matrix \begin{equation} Q(\pi)=\lim_{N\to\infty} {\frac 1 N} \sum_{n=0}^{N-1} P^n(\phi). \end{equation} always exists and it is stochastic if $\X$ is finite; Chung~\cite[Section 1.6]{chung}. % Let $f$ be a terminal reward\index{reward!terminal} function and $\beta$ be a discount factor. We denote by $v_N(x,\pi,\beta,f)$ the expected total reward\index{reward!expected total} over the first $n$ steps, $n\in\N$: % \begin{equation}\label{C1e:DefFinRew} v_N(x,\pi,\beta,f)=\E_x^\pi \left [ \sum_{n=0}^{N-1} \beta^n r(x_n,a_n)+\beta^N f(x_N)\right ] , \end{equation} whenever this expectation is well-defined. If $\beta\in [0,1[$ then we deal with expected total discounted reward.\index{reward!discounted} If $\beta=1$, we deal with expected total undiscounted reward or simply total reward. If the discount factor $\beta \in [0,1] $ is fixed, we usually write $ v ( x , \pi ) $ instead of $ v ( x , \pi , \beta ) $. The expected total reward over an infinite horizon is \begin{equation}\label{C1e:DefDiscCost} v ( x , \pi ) = v ( x , \pi , \beta ) = v_\infty ( x , \pi , \beta , 0 ) \, . \end{equation} % Conditions for the total reward $ v ( x , \pi , 1 ) $ to be well-defined are usually stronger than the conditions that ensure that total discounted rewards $ v ( x , \pi , \beta ) $, $ 0 \le \beta < 1 $, are well-defined. The expected reward per unit time is \begin{equation}\label{C1e:DefAvRew} w ( x , \pi ) = \liminf_{n \to \infty } {\frac 1 N} v_N ( x , \pi , 1 , 0 ). \end{equation} If a performance measure $g(x,\pi)$ is defined for all policies $\pi$, we denote \begin{equation}\label{C1e:maxValue} G ( x ) = \sup_{\pi \in \PS^R } g ( x , \pi ) . \end{equation} In terms of the performance measures defined above, this yields the values \begin{equation} \label{C1e:DefFinValue} V_N(x,\beta,f) \bydef \sup_{\pi \in \PS^R } v_N(x,\pi,\beta,f) \, , \end{equation} \begin{equation} \label{C1e:DefDiscValue} V(x) = V(x,\beta) \bydef \sup_{\pi \in \PS^R } v ( x , \pi , \beta ) \, , \end{equation} \begin{equation} \label{C1e:DefAvValue} W(x) \bydef \sup_{\pi \in \PS^R } w ( x , \pi ) \, . \end{equation} For $ \epsilon \ge 0 $, a policy $\pi$ is called $ \epsilon $-optimal for criterion $g$ if $ g ( x , \pi ) \ge G (x) - \epsilon $ for all $x \in \X$. A $0$-optimal policy is called optimal. For a function $f$ on $\X$, we consider the reward operators: \begin{equation} \label{C1e:TransOper} P^a f(x) \bydef \E [ f(x_1) \mid x_0 = x, a_0 = a ] \, , \end{equation} \begin{equation} \label{C1e:DiscOper} T^a_\beta f (x) \bydef r (x,a) + \beta P^a f (x) \, \end{equation} and the optimality operators: \begin{equation} \label{C1e:TransOptOper} P f (x) \bydef \sup_{a \in \A(x)} P^a f (x) \, , \end{equation} \begin{equation} \label{C1e:DiscOptOper} T_\beta f(x) \bydef \sup_{a \in \A(x)} T^a_\beta f(x) \, . \end{equation} The finite horizon Optimality Equation\index{optimality equation}\index{optimality equation!finite horizon} is \begin{equation} \label{C1e:FinHorOptEq} V_{N+1} (x) = T_\beta V_N(x), \ \ \ \ \ \ \ \ \ N=0,1,\ldots \, . \end{equation} The discounted reward Optimality Equation\index{optimality equation}\index{optimality equation!discounted reward} is \begin{equation} \label{C1e:DiscOptEq} V (x) = T_\beta V(x) \, . \end{equation} An action $a\in A(x)$ is called conserving at state $x$ for the $(N+1)$-step problem if $T^a_\beta V_N(x)= T_\beta V_N(x).$ An action $a\in A(x)$ is called conserving at state $x$ for the total discounted reward if $T^a_\beta V(x)= T_\beta V(x).$ When $ \beta = 1 $ we denote $ T^a = T^a_1 $ and $ T = T_1 $. In particular, \begin{equation} \label{C1e:TotOptEq} V (x) = TV(x) \end{equation} is the Optimality Equation for expected total undiscounted rewards. For total reward criteria, value functions usually satisfy the optimality equation. In addition, the sets of conserving $n$-step actions, $n=1,\ldots,N+1$ form the sets of optimal actions for $(N+1)$-step problems. Under some additional conditions, the sets of conserving actions form the sets of optimal actions for infinite horizon problems. We shall consider these results in appropriate chapters. The average reward Optimality Equations\index{optimality equation!average reward} are \begin{equation} \label{C1e:AvOptEq1} W (x) = P W (x) \end{equation} \begin{equation} \label{C1e:AvOptEq2} W (x) + h (x) = \sup_{a\in \A^\prime (x)} T^a h (x) \, , \end{equation} where \begin{equation} \A^\prime (x) = \left \{ a \in \A (x) : P^a W (x) = P W (x) \right \} \, . \end{equation} Equation~(\ref{C1e:AvOptEq1}) is called the First Optimality Equation and \index{optimality equation!first}equation~(\ref{C1e:AvOptEq2}) is called the Second Optimality Equation.\index{optimality equation!second} Note that if $W(x) = W $, a constant, then the First Optimality Equation holds and $ \A^\prime (x) = \A (x) $. In this case, the Second Optimality Equations transforms into \begin{equation} \label{C1e:OptEq2} W + h(x)= T h(x) \end{equation} which is often referred to %a2 simply as the Optimality Equation for average rewards. We allow for the starting point $x$ to be defined by an initial probability distribution $\mu$. In this case, we keep the above notation and definitions but we replace the initial state $x$ with the initial distribution $\mu$. For example, we use $\PR_\mu^\pi$, $\E_\mu^\pi$, $v(\mu,\pi)$, $V(\mu),$ $w(\mu,\pi),$ and $W(\mu)$. We remark that, generally speaking, optimality and $\epsilon$-optimality with respect to all initial distributions are stronger than the defined above optimality and $\epsilon$-optimality with respect to all initial states. However, in many natural cases these definitions are equivalent. For example, it is true for total reward criteria. % A more general problem arises when there are multiple objectives. Suppose there are $(K+1)$ reward functions $r_k(x,a), $ $k=0,\ldots,K$. For finite horizon problems, terminal rewards may also depend on $k$. In this case, we index by $k=0,\ldots,K$ all functions that describe rewards. For example, we use the notation $w_k(x,\pi)$, $f_k(x)$, and $W_k(x)$. For problems with multiple criteria, it is usually natural to fix an initial state $x$. It is also possible to fix an initial distribution $\mu$, with our convention that all definitions remain the same, but we write $\mu$ instead of $x$. %a So, for simplicity, we define optimal policies when the initial state $x$ (not a distribution) is fixed. If the performance\index{performance} of a policy $\pi$ is evaluated by $(K+1)$ criteria $g_k (x,\pi)$ then one goal may be to optimize criterion $g_0$ subject to constraints on $g_1,\ldots,g_K.$ Let $C_k,$ $k=1,\ldots,K,$ be given numbers. We say that a policy $\pi$ is feasible\index{feasible}\index{policy!feasible} if \begin{equation} \label{C1e:DefFeasible} g_k (x , \pi ) \ge C_k , \qquad k = 1 , \ldots , K \, . \end{equation} A policy $\pi$ is called optimal\index{optimal}\index{constrained optimization!optimal} for a constrained optimization problem if it is feasible and \begin{equation} \label{C1e:DefOptConstr} g_0 (x , \pi) \ge g_0 (x , \sigma) \quad \mbox{for any feasible policy $\sigma$} . \end{equation} % \paragraph{Nondiscrete MDPs: general constructions.} % {\bf Nondiscrete MDPs: general constructions.} When a state space $\X$ or an action space $\A$ are is not discrete, the natural assumption is that they are measurable spaces endowed with $\sigma$-fields $\cX$ and $\cA$ respectively. When $\X$ or $\A$ are discrete, the corresponding $\sigma$-field is the set of all subsets of the corresponding set. It is also natural to assume that the sets $\A(x) \in \cA$ of feasible actions are measurable, for all states $x\in\X$. Of course, this assumption always holds when $\A$ is discrete. Unless we specify otherwise, we always consider a Borel $\sigma$-field ${\cal B}(\R )$ on $\R $: this is the minimal $\sigma$ field containing all intervals. For non-discrete MDPs, we also assume that $r$ is a measurable function on $(\X\times\A, \cX\times\cA )$ and $p(Y|x,a)$ is a regular transition probability from $(\X\times\A, \cX\times\cA )$ to $(\X, \cX )$. Recall that given two measurable spaces $(E_1,{\cal E}_1)$ and $ (E_2,{\cal E}_2)$, we call $p$ a regular transition probability from $E_1$ to $E_2$ if the following two conditions hold: (i) $ p(\cdot|e_2) $ is a probability measure on $(E_1,{\cal E}_1)$ for any $e_2\in E_2$, and (ii) the function $p(B | \cdot )$ is measurable on $E_2$ for any $B\in {\cal E}_1$. In order to define policies in the general situation, we consider $\sigma$-fields ${\cal H}_n = \cX \times (\cA \times \cX)^{n}$ on the sets of histories $H_n = \X \times (\A \times \X)^{n}$. Nonrandomized and randomized strategies are defined in a way similar to discrete MDPs, with standard and natural additional measurability conditions: (a) nonrandomized policies $\pi$ are defined by mappings $\pi_n$ which are measurable on $(H_n,{\cal H}_n)$, and (b) stationary and Markov policies are defined by mappings which are measurable on $\X\times\A.$ Similarly, for randomized policies, $\pi_n$ are regular transition probabilities from $(H_n,{\cal H}_n)$ to $(\A,\cA)$ and, for randomized Markov and stationary policies, they are regular transition probabilities from $(\X\times\A,\cX\times\cA)$ to $(\A,\cA)$. Let ${\cal H}_\infty=(\cX\times\cA)^\infty.$ Ionescu Tulcea theorem, Neveu~\cite[Section 5.1]{ne}, implies that any initial state $x$ and policy $\pi$ define a unique probability measure on $(H_\infty,{\cal H}_\infty)$. We denote this measure by $\PR_x^\pi$. Sometimes it is called the ``strategic" measure. We denote by $\E_x^\pi$ expectations with respect to this measure. We also notice that Ionescu Tulcea theorem implies that $\PR_x^\pi$ is a regular transition probability from $(\X,\cX)$ to $(H_\infty,{\cal H}_\infty)$ and this implies that the functions $v_n(x,\pi,\beta,f)$ and $v(x,\pi,\beta)$ are measurable in $x$ for any policy $\pi$ (the terminal function $f$ is also assumed to be measurable). We remark that we use Ionescu Tulcea theorem instead of better known Kolmogorov's extension theorem primarily because the latter requires additional assumptions about the structure of the state space (it has to be Borel) and the first one has no such structural assumptions. At the intuitive level, randomized decisions are more general than nonrandomized ones; this means that any nonrandomized policy belongs to the class of randomized policies. In addition, in order to avoid trivial situation, an MDP has to have at least one policy. In order to guarantee these two intuitive properties, we always assume the following two mild conditions: (i) all one-point sets $\{a\}$ are elements of $\cA,$ $a\in\A$; (ii) there is at least one measurable function $\phi$ from $\X$ to $\A$ such that $\phi(x)\in\A(x)$ for all $x\in \X$ The first assumption always holds for models with discrete state and action spaces. The second assumption always holds for models with discrete state spaces. For a measure $ \nu $ and a measurable function $f$ we use the equivalent notations \begin{equation} \nu ( f ) \bydef \int f ( \alpha ) \, d\nu (\alpha ) \bydef f ( \nu ) \, . \end{equation} If we denote $\pi_x(\cdot)=\pi(\cdot|x)$ for a randomized stationary policy $\pi$ then, similarly to discrete MDPs, this policy defines a Markov chain with transition probabilities $p(dy|x,\pi_x).$ If $\X$ is discrete, this chain has transition matrix $P(\pi)$ with elements $p_{xy}(\pi_x)$. Thus, an MDP, strategies, and objective functions can be defined under very general conditions. However, very little can be done if one tries to analyze MDPs with arbitrary measurable state spaces. The first complication is that the value functions $V$ may not be measurable even for one-step models. The second complication is that an important step in the analysis of MDPs is to construct an equivalent randomized Markov policy for an arbitrary policy; see Chapter ???. This can be done by constructing regular transition probabilities $\PR_x^\pi(da_n|x_n)$ which may not exist for general state and action spaces. These two complications do not exist if the state space is countable. These two complications can be resolved if $\X$ and $\A$ Borel spaces. In addition, at the current state of our knowledge, there is no clear need to consider MDPs with arbitrary measurable state spaces because there is no clear motivation or practical needs for such objects. For example, MDPs with Borel state spaces have applications to statistics, control of models with incomplete information, and to inventory management. However, for example, we are not aware of possible applications of MDPs with state spaces having higher cardinality than continuum. \paragraph{Discrete state MDPs.} In this case, the state space $\X$ is discrete and the action space is a measurable space $(\A,\cA)$ such that all one-point sets are measurable. The sets of feasible actions $\A(x)$ are also elements of $\cA.$ Reward functions $r(x,a)$ and transition probabilities $p(y|x,a)$ are automatically measurable in $a$. All constructions described for discrete and general MDPs go through with $\cX$ being the $\sigma$-field of all subsets of $\X.$ \paragraph{Classical Borel MDPs.} Though we do not follow any particular text, all definitions, constructions, and statements, related to Borel spaces we mention in this chapter can be found in Bertsekas and Shreve~\cite[Chapter 7]{bs}; see also Dynkin and Yushkevich~\cite{dy} and Kechris~\cite{ke}. Two measurable spaces $(E_1, {\cal E}_1)$ and $(E_2, {\cal E}_2)$ are called isomorphic if there is a one-to-one measurable mapping $f$ of $(E_1, {\cal E}_1)$ onto $(E_2, {\cal E}_2)$ such that $f^{-1}$ is measurable. % A Polish space is a complete separable metric space. Unless we specify otherwise, we always consider a Borel $\sigma$-fields ${\cal B}(E)$ on a metric space $E$; ${\cal B}(E)$ is the minimal $\sigma$-field containing all open subsets of $E$. Of course, any measurable subset $E^\prime$ of a Polish space forms a Polish space endowed with the Borel $\sigma$-field which is the intersection of $E^\prime$ with Borel subsets of the original space. A measurable space $(E, {\cal E})$ is called Borel if it is isomorphic to a Polish space. All Borel spaces are either finite or countable or continuum, and two Borel spaces with the same cardinality are isomorphic. Therefore, uncountable Borel spaces are continuum. They are also isomorphic to each other and to the sets $(\R ,{\cal B}(\R ))$ and $([0,1],{\cal B}([0,1]))$. The assumptions for Borel MDPs are: (i) $\X$ and $\A$ are Borel spaces and $\cX$ and $\cA$ are corresponding Borel $\sigma$-fields; (ii) the graph $${\rm Gr}\ \A(x)=\{(x,a)|\ x\in\X, a\in\A(x)\}$$ is a measurable subset of $\X\times\A$ and there exists at least one measurable mapping $\phi$ of $\X$ into $\A$ such that $\phi(x)\in\A(x)$ for all $x\in\A(x)$; (iii) the reward functions $r(x,a)$ are measurable on $\X\times\A$ and the transition probabilities $p(\cdot|x,a)$ are regular transition probabilities from $\X\times\A$ to $\X.$ Conditions (i) and (iii) are similar to the corresponding assumptions for general models. The measurability of the graph in (ii) implies that the sets $\A(x)$ are measurable. The existence of a measurable mapping (often called a ``selector") implies that $\A(x)\ne \emptyset$ for all $x$. We remark that it is possible that the graph is Borel and all images are non-empty but the graph does not contain a Borel mapping. Therefore, the second assumption in (ii) is essential for the existence of at least one policy. As was discussed above, the first real complication is that even for one-step problems, the values $V$ may not be Borel measurable functions on $\X$. However, conditions (i)-(iii) imply that these functions are universally measurable for finite and infinite-horizon problems and therefore optimality operators can be defined. Here we explain the concepts of universally measurable sets and functions. Let $(E,{\cal E})$ be a Borel space. For a given probability measure $p$ on $(E,{\cal E})$, define a $\sigma$-field ${\cal E}_p$ which is a completion of ${\cal E}$ with respect to measure $p$. That is, ${\cal E}_p$ is the minimal $\sigma$-field that contains ${\cal E}$ and all subsets $F$ of $E$ such that $F\subset F^\prime$ for some $F^\prime\in {\cal E}$, and $p(F^\prime)=0$. For example, if $(E,{\cal E})=([0,1],{\cal B}([0,1]))$ then we can consider the Lebesgue measure $m$ defined by $m([a,b])=|b-a|.$ Then ${\cal E}_m$ is the so-called Lebesgue $\sigma$-field. Let ${\bf P}(E)$ be the set of all probability measures on $E$. Then the intersection of all $\sigma$-fields ${\cal E}_p$, ${\cal U}(E)=\cap_{\{p\in {\bf P}(E)\}} {\cal E}_p$, is a $\sigma$-field and it is called the universal $\sigma$-field. This $\sigma$-field is also called the $\sigma$-field of universally measurable sets and its elements are called universally measurable subsets of $E$. A universally measurable function on $X$ is a measurable mapping from $(X,{\cal U}(X))$ to $(\R ,{\cal B}(\R )$, where ${\cal U}(X)$ is a universal $\sigma$-field on $X$. Of course, any Borel set and any Borel function are universally measurable. Thus, optimality equations can be defined for Borel MDPs. However, there is another complication for Borel models, which is annoying mostly for aesthetic reasons: $\epsilon$-optimal policies may not exist for small positive $\epsilon$, even for one-step Borel MDPs with bounded reward functions. The example constructed by David Blackwell is based on the observation that the value function is universally measurable but it may not be Borel. However, for any policy, the expected one-step reward is a Borel function of the initial step. Moreover, it is possible to show that for the Borel MDP described above, for any initial measure $p$ on $\X$, and for any $\epsilon>0$ there exists a policy which is $p-{\rm a.s.}$ $\epsilon$-optimal. Such policies are called $(p,\epsilon)$-optimal. \paragraph{Universally measurable Borel MDPs.} If we expand the set of policies and consider universally measurable policies, $\epsilon$-optimal policies exist and the concept of $(p,\epsilon)$ optimality is not needed. However, if we expand the set of policies, the results and their proofs hold for assumptions which are broader than (ii) and (iii). Before we give formal definitions, we explain the concept of analytic sets. Let $f$ is a measurable mapping of a Borel space $(E_1,{\cal E}_1)$ into Borel x space $(E,{\cal E})$. If $F\in {\cal E}$ then by definition $f^{-1}(F)\in {\cal E}_1$. However, it is possible that $f(E)\notin {\cal E}$ for some Borel set $F\in {\cal E}_1$. A subset $F$ of a Borel space $(E,{\cal E})$ is called analytic if there exists a Borel space $(E_1,{\cal E}_1)$ and a measurable mapping of $E_1$ to $E$ such that $F=f(F_1)$ for some $F_1\in{\cal E}_1$. Since one can select $E_1=E$ and $f(e)=e$, every Borel set is analytic. It is also possible to show that any analytic set is universally measurable. It is also possible to consider the $\sigma$-field of analytically measurable sets which is the smallest $\sigma$-field containing all analytic subsets of an analytic set. We remark that Borel and universally measurable $\sigma$-fields consist respectively of Borel and universally measurable sets. The situation is different for analytic sets and $\sigma$-fields of analytically measurable sets. The complement of an analytic set may not be analytic. Therefore, the $\sigma$-field of analytically measurable sets contains sets other than analytic. We remark that there are many equivalent definitions of analytic sets. For example, for Polish spaces they can be defined as continuous images or even as projections of Borel sets. If $(E,{\cal E})$ and $(E_1,{\cal E}_1)$ are two Borel spaces (Borel sets with Borel $\sigma$-fields) then the mapping $f:E\to E_1$ is called universally (analytically) measurable if $f^{-1}(B)$ belongs to the $\sigma$-field of universally (analytically) measurable subsets of $E.$ The assumptions for universally measurable MDPs are: (a) The state and action spaces $(\X,\cX)$ and $(\A,\cA)$ are Borel spaces; (b) ${\rm Gr}\ \A(x)$ is an analytic subset of $\X\times\A$ and all sets $\A(x)$ are not empty; (c) The reward function $r(x,a)$ is an upper analytic function on $\X\times\A$, that is, for any real number $c$, the set $\{r\ge c\}$ is an analytic subset of $\X\times\A;$ (d) The transition function $p(\cdot |x,a)$ is a regular transition probability from $(\X\times\A,\cX\times\cA)$to $(\X,\cX)$. Assumptions (a) and (d) coincide with similar assumptions for Borel MDPs. According to Jankov--von Neumann theorem, assumption (b) implies that there is an analytically measurable mapping $\phi$ from $\X$ to $\A$ such that $\phi(x)\in\A(x)$ for all $x\in\X.$ Of course, any analytically measurable mapping is universally measurable. Assumption (c) is more general than the assumption that $r(x,a)$ is Borel. This generality is unimportant. It is kept in the literature just because the same proofs holds for upper analytic and Borel reward functions. The last important difference between Borel and universally measurable MDPs is that policies are universally measurable for the latter ones. Nonrandomized policies are universally measurable mappings $\phi_n$ of $H_n$ to $\A$ such that $\phi(h_n)\in \A(x_n)$ for any $h_n=x_0a_n\ldots x_n\in H_n.$ Markov (and stationary) policies are defined by universally measurable mappings $\phi_n$ of $\X$ to $\A$ such that $\phi_n(x)\in\A(x)$ ($\phi(x)\in\A(x)$) for all $x\in\X.$ Randomized, randomized Markov, and randomized stationary policies are regular transition probabilities defined in the same way as for general models but the sets $H_n$ and $\X$ are endowed with $\sigma$-fields of universally measurable subsets that play the role of $\sigma$-field ${\cal E}_1$ in the definition of regular transition probabilities given above. Condition (b) implies that there exists at least one policy. There are other versions of universally measurable MDPs. For example, one can consider analytically measurable policies; see Bertsekas and Shreve \cite{bs} for details. The important feature is that all definitions and notations, given for discrete MDPs, hold also for universally measurable MDPs. \section{What's in this volume} To be written. % %EF \begin{thebibliography}{999} % \bibitem{bs}[BertShr] D.~P.~Bertsekas and S.~E.~Shreve,``Stochastic Optimal Control: The Discrete-Time Case," Academic Press, New York, 1978 (republished by Athena Scientific, 1997). % \bibitem{chung}[Chung] K.~L.~Chung, ``Markov Chains with Stationary Transition Probabilities," Springer-Verlag, Berlin, 1960. % \bibitem{dy}[DynYsh] E.~B.~Dynkin and A.~A.~Yushkevich, ``Controlled Markov Processes," Springer-Verlag, New York, 1979 (translation from 1975 Russian edition). % \bibitem{ke}[Kechris] A.~S.~Kechris, ``Classical Descriptive Set Theory," Springer-Verlag, New York, 1995. % \bibitem{ne}[Ne] J.~Neveu, ``Mathematical Foundations of the Calculus of Probability," Holden-Day, San Francisco, 1965. \end{thebibliography} %ef \end{document} % % END OF FILE