• No se han encontrado resultados

On the practical side of things, the best-known tools for quantitative veri-

fication are PRISM [KNP11] and MRMC [KZH+11]. PRISM (Probabilistic

Symbolic Model checker) started out as a model checker for probabilistic log- ics for Markov decision processes and Markov chains, and has grown over the years to encompass reward-based properties and stochastic games. In contrast to other tools, PRISM concentrates on symbolic encoding via binary deci- sion diagrams, which sometimes allow highly compressed storage and thereby faster algorithms. PRISM reads models from a custom format, which allows easy creation of new models. MRMC (Markov reward model checker) is a tool based on explicit storage and somewhat orthogonal in features to PRISM.

Uppaal[BDL+06] is a tool for model checking of timed systems, i.e., systems

consisting of a mix of continuous and discrete state space. Quasy [Cha11] supports finding strategies for games for a single mean-payoff objective or lex- icographically ordered mean-payoff objectives. In addition, it finds strategies for mean-payoff objectives on MDPs. It also supports finding strategies for mean-payoff parity objectives of unichain MDPs (see Chapter 2 for a definition of unichain MDPs). As input it accepts games in a graph form.

On the quantitative synthesis side, Acacia+ [BBFR13] recently gained the ability to combine qualitative and quantitative specifications in a way that allows the synthesis of controllers fulfilling an LTL formula and optimizing a mean payoff.

2

Efficient Systems in

Probabilistic Environments

In which we study the true meaning of efficiency and add to the big body of work around Markov Decision processes.

R´esum´e

Ce chapitre met au point la m´ethode de v´erification et de synth`ese utilis´ee pour

les syst`emes efficaces dans l’environment probabilistique. Nous commencerons

par d´efinition de l’efficacit´e, autrement dire, comment pouvons-nous obtenir la

productivit´e optimale avec les ´efforts minimis´es. Pour cela nous d´eterminons

le syst`eme comme ´efficace, si il optimise le rapport entre le co¨ut et les efforts

appliqu´es.

A la base de cet effet nous ´etudions les processus de d´ecision markovien

aves le ratio comme une foncionne objective. Ensuite nous prouvons, que ces

strat´egies d´eterministes sans m´emoire sont suffisantes, et pr´esentons trois al-

gorithmes pour rechercher les strat´egies optimales. Une de cette strat´egie ´etant

bas´ee sur optimisation lin´eaire et l’autre sur optimisation lin´eaire fractionnaire.

Ces trois algorithmes sont ensuite ´evalui´es sur une s´erie des exemples. Finalle-

ment nous choisissons le plus efficace parmi ces algorithmes et le referons `a la

base de la diagramme de d´ecision binaire de telle mani`ere, qu’il se retrouve `a

la mˆeme echelle, comme un syst`eme de millions ´etats.

2.1

Introduction

In this chapter we show how to automatically synthesize a system that has an “efficient” average-case behavior in a given environment. The efficiency of a system is a natural question to ask; it has also been observed by others,

e.g, Yue et al. [YBK10] used simulation to analyze energy-efficiency in a MAC (Media Access Control) Protocol. The oxford dictionary defines the adjective efficient as follows.

Definition 2.1 (Efficient (Oxford Dictionary)) Efficient (adjective): (of a system or machine) achieving maximum productivity with minimum wasted effort.

We analogously define efficiency as the ratio between a given cost model and a given reward model. To further motivate this choice, consider the follow- ing example: assume we want to implement an automatic gear-shifting unit (ACTS) that optimizes its behavior for a given driver profile. The goal of our implementation is to optimize the fuel consumption per kilometer (l/km), a commonly used unit to advertise efficiency. In order to be most efficient, our system has to maximize the speed (given in km/h) while minimizing the fuel consumption (measured in liters per hour, i.e., l/h) for the given driver profile. If we take the ratio between the fuel consumption (the “cost”) and the speed (the “reward”), we obtain l/km, the desired measure.

Given an efficiency measure, we ask for a system with an optimal average- case behavior. The average-case behavior with respect to a quantitative speci- fication is the expected value of the specification over all possible behaviors of the systems in a given probabilistic environment [CHJS10]. We describe the probabilistic environment using Markov Decision Processes (MDPs), which is a more general model than the one considered in [CHJS10]. It allows us to describe environments that react to the behavior of the system (like the driver profile).

Related Work

Related work can be divided into two categories: (1) work using MDPs for quantitative synthesis and (2) work on MDP reward structures.

From the first category we first consider [CHJS10]. We generalize this work in two directions: (i) we consider ratio objectives, a generalization of average- reward objectives and (ii) we introduce a more general environment model based on MDPs that allows the environment to change its behavior based on actions the system has taken. In the same category there is the work of Parr and Russell [PR97], who use MDPs with weights to present partially specified

machines in Reinforcement Learning. Our approach differs from this approach, as we allow the user to provide the environment, the specification, and the objective function separately and consider the expected ratio reward, instead of the expected discounted total reward, which allows us to ask for efficient

systems. Finally, in [WBB+10], Wimmer et.al. introduce a semi-symbolic

policy algorithm for MDPs with the average objective, while we present a semi-symbolic policy algorithm for MDPs with the ratio objective, subsuming the former.

Semi-MDPs [Put94] fall into the second category. Unlike work based on Semi-MDPs, we allow a reward of value 0. Furthermore, we provide an ef- ficient policy iteration algorithm that works on our Ratio-MDPs as well as on Semi-MDPs. Approaches using the discounted reward payoff (cf. [Put94]) are also related but focus on immediate rewards instead of long-run rewards. Similarly related is the work of Cyrus Derman [Der62], who considered the payoff function obtained by dividing the expected costs by expected rewards. As shown later, we believe that our payoff function is more natural. Note that these two objective functions are in general not the same. Closest to our work is the work of de Alfaro [dA97]. In this work the author also allows rewards with value 0, and he defines the expected payoff over all runs that visit a re- ward with value greater than zero infinitely often. In our framework the payoff is defined for all runs. De Alfaro also provides a linear programming solution, which can be used to find the ratio value in an End-Component (see Section 6). We provide two alternative solutions for End-Components including an efficient policy iteration algorithm. Finally, we are the first to implement and compare these algorithms and use them to synthesize efficient controllers.

Documento similar