FreshPatents.com Logo FreshPatents.com icons
Monitor Keywords Patent Organizer File a Provisional Patent Browse Inventors Browse Industry Browse Agents

1

views for this patent on FreshPatents.com
updated 05/17/13


Inventor Store

    Free Services  

  • MONITOR KEYWORDS
  • Enter keywords & we'll notify you when a new patent matches your request (weekly update).

  • ORGANIZER
  • Save & organize patents so you can view them later.

  • RSS rss
  • Create custom RSS feeds. Track keywords without receiving email.

  • ARCHIVE
  • View the last few months of your Keyword emails.

  • COMPANY PATENTS
  • Patents sorted by company.

Automated learning of failure recovery policies   

pdficondownload pdfimage preview


Abstract: Described is automated learning of failure recovery policies based upon existing information regarding previous policies and actions. A learning mechanism automatically constructs a new policy for controlling a recovery process, based upon collected observable interactions of an existing policy with the process. In one aspect, the learning mechanism builds a partially observable Markov decision process (POMDP) model, and computes the new policy base upon the learned model. The new policy may perform automatic fault recovery, e.g., on a machine in a datacenter corresponding to the controlled process. ...

Agent: Microsoft Corporation - Redmond, WA, US
Inventors: Christopher A. Meek, Guy Shani
USPTO Applicaton #: #20110214006 - Class: 714 2 (USPTO) - 09/01/11 - Class 714 
Related Terms: Automated   Base   Information   Learning   Policy   Process   Recovery   
view organizer monitor keywords


The Patent Description & Claims data below is from USPTO Patent Application 20110214006, Automated learning of failure recovery policies.

pdficondownload pdf

BACKGROUND

Recovery from computer failures due to software or hardware problems is a significant part of managing large data centers. Sometimes the failures may be automatically fixed through actions such as rebooting or re-imaging of the computer.

In large environments, it is prohibitively expensive to have a technician decide on a repair action for each observed problem. As a result, the data centers often employ recovery systems that use some automatic repair policy or controller to choose appropriate repair actions. Typically the repair policy/controller are manually defined and created by a human expert. More particularly, the expert creates policies that map the state of the system to recovery actions by specifying a set of rules or conditions under which an action is to be taken.

However, such policies/controllers are almost never optimal, in that even though they often fix an error, the error may actually be fixed in a faster (and thus correspondingly less expensive) way. As such, these policies may result in longer failure periods (system downtime) than needed.

SUMMARY

This Summary is provided to introduce a selection of representative concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used in any way that would limit the scope of the claimed subject matter.

Briefly, various aspects of the subject matter described herein are directed towards a technology by which a learning mechanism automatically constructs a new policy that automatically controls a process based upon collected observable interactions of an existing policy with the process. In one aspect, the learning mechanism builds a model of the process, including effects of possible actions, and computes the new policy base upon the learned model. The new policy may perform automatic fault recovery, e.g., on a machine in a datacenter corresponding to the controlled process.

In one implementation, the model comprises a partially observable Markov decision process (POMDP). An expectation maximization algorithm (e.g., an adapted Baum-Welch algorithm) learns the POMDP model. The policy is then computed using a point-based value iteration algorithm, such as executed by a cost-based indefinite-horizon formalization.

Other advantages may become apparent from the following detailed description when taken in conjunction with the drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention is illustrated by way of example and not limited in the accompanying figures in which like reference numerals indicate similar elements and in which:

FIG. 1 is a block diagram representing example components/and environment for automated learning of failure recovery policies, and their application.

FIG. 2 is a flow diagram representing example steps taken with respect to automated learning of failure recovery policies.

FIG. 3 shows an illustrative example of a computing environment into which various aspects of the present invention may be incorporated.

DETAILED DESCRIPTION

Various aspects of the technology described herein are generally directed towards a learning mechanism that uses the logged experience of an existing, imperfect recovery policy (used interchangeably with “controller” herein) for automatically learning a new, improved policy. To this end, the technology uses previous state sequences of interactions between the existing recovery policy, containing the observations (e.g., error messages) that the existing policy obtained, and the actions that the existing recovery policy executed. In one implementation, from these sequences a hidden state model is learned, which provides the new, improved policy for failure recovery. For example, the model may comprise a partially observable Markov decision process (POMDP) for generating recovery policies that automatically improves existing recovery policies.

It should be understood that any of the examples herein are non-limiting. Indeed, while data center failure recovery is exemplified herein, many other real world applications, such as assembly lines, medical diagnosis systems, and failure detection and recovery systems, are also controlled by hand-made controllers, and thus may benefit from the technology described herein. As such, the present invention is not limited to any particular embodiments, aspects, concepts, structures, functionalities or examples described herein. Rather, any of the embodiments, aspects, concepts, structures, functionalities or examples described herein are non-limiting, and the present invention may be used in various ways that provide benefits and advantages in computing and automated recovery in general.

FIG. 1 shows a general example in which a process 102 (which can also be considered a system) is controlled by a repair policy 104. The illustrated process 102 is representative of any of a plurality of such processes/systems, such as in a large datacenter in which there is often a domain on the order of hundreds or thousands of such machines. Note that in many such domains, such machines are identical and independent, e.g., computers in service farms such as answering independent queries to a search engine may share the same configuration and execute independent programs. It is therefore unlikely, if not impossible, for errors to propagate from one machine to another.

As is known, these systems/processes may be automatically controlled by policies, in which a controlled system/process may be influenced by external actions. FIG. 1 shows the repair policy 104 deciding which repair actions 106 are to be taken to attempt to guide the process 102 towards a desired outcome.

Typically, the repair policy 104 (running on a computer system or the like) receives some observations about the state 108 of the process 102, such as from one or more watchdogs 109, comprising one or more other computers that probe the system 102, (or possibly from the process 102 itself). Given these observations, the repair policy 104 decides on what it determines is the most appropriate action. In other words, the policy 104 maps observations or states of the process 102 into appropriate actions. The observable interactions of the policy can be described as a sequence of observations and actions.

Note that the state 108 may include error messages that are received before, during or after a repair action is attempted, e.g., the repair policy may attempt a repair action based on one error message, and that repair action may result in another error message being generated, such as if the system soon fails again. More particularly, the repair policy 104 typically receives failure messages 108 about the process 102 via the one or more watchdog 109. Messages from watchdogs are often aggregated into a small set of notifications, such as “Software Error” or “Hardware Error.”

As described herein, a model of the environment may be created and used to create a revised policy that is approximately optimal given the model. In one implementation, a POMDP is used by the learning mechanism for modeling such a process. A POMDP captures the (hidden) states of the process, the stochastic transitions between states given actions, the utilities of states and the cost of actions, and the probabilities of observations given states and actions. POMDPs are further described by Michael L. Littman and Nishkam Ravi in “An instance-based state representation for network repair,” in Proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI, pages 287-292, (2004) and R. D. Smallwood and E. J. Sondik, in “The optimal control of partially observable markov decision processes over a finite horizon,” Operations Research, 21:1071-1098, 1973.

However, specifying the parameters of such a model in a way that provides good results is often no easier than specifying a policy directly. Thus, the technology described herein is generally directed towards automatically learning such models. The learning process is based in part on evaluating parameters, such as the probability that a failure will be fixed given an action.

To this end, as represented in FIG. 1, logs 110 of the recovery interactions (collected in a known manner) for an existing imperfect policy 112 are used in POMDP model learning (by a suitable algorithm) 114. These logs 110 contain sequences of action observations that the policy 14 received as input, typically error messages 108 from the process 102 or information about the outcome of a repair action 106. These logs 110 may be used to process the input messages as noisy observations over the true failures of the system in order to learn the model parameters using standard Hidden Markov Model (HMM) learning techniques, such as expectation maximization (EM). One example of EM is the forward-backward (Baum-Welch) algorithm, described by Leonard E. Baum, Ted Petrie, George Soules, and Norman Weiss in “A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains,” The Annals of Mathematical Statistics, 41(1):164-171, 1970).

Given the learned POMDP model 115/parameters, an (approximately) optimal policy 118 for the learned model 115 may be computed, as represented in FIG. 1 via block 116. More particularly, with the learned POMDP model parameters, a policy 118 that optimizes repair costs may be computed, and then applied to the system as needed thereafter.

In one implementation, an approximate policy 118 may be rapidly computed via a point-based value iteration algorithm, (such as described by Joelle Pineau, Geoffrey Gordon, and Sebastian Thrun, in “Point-based value iteration: An anytime algorithm for pomdps”; In International Joint Conference on Artificial Intelligence (IJCAI), pages 1025-1032, August 2003). As this policy 118 is learned from the interactions logged from the existing policy 112, it is not guaranteed to be optimal for that system, yet it typically improves upon the previous policy. The improvement may be verified before the new policy 118 is implemented, as described below.

Turning to additional details, repair actions may succeed or fail stochastically, and often provide an escalating behavior. Actions may be labeled using increasing levels, where problems fixed by an action at level i, are also fixed by any action of level j>i. Probabilistically, this means that if j>i then pr(healthy|aj,e)>pr(healthy|ai,e) for any error a Action costs are also typically escalating, where lower level actions that fix minor problems are relatively cheap, while higher level actions are more expensive. In many real world systems this escalation can be considered roughly exponential. For example, restarting a service takes five seconds, rebooting a machine takes approximately ten minutes, while re-imaging the machine takes about two hours.

Another stochastic feature is inexact failure detection. A watchdog may report an error for a machine that is fully operational, or report a “healthy” status for a machine that experiences a failure.

In view of the escalating nature of actions and costs, one choice for a policy is an escalation policy. Such policies choose a starting entry level based on the first observation, and execute an action at that level. In many cases, due to the non-deterministic success of repair actions, each action is tried several times. After the controller decides that the action at the current level cannot fix the problem, the controller escalates to the next (more costly) action level, and so on.

FIG. 2 summarizes some example steps of automated policy learning, such as to learn many of these variable features based on actual observations. For example, given an observation, the entry level for that observation and the number of retries of each action before an escalation occurs may be optimized, via the learning algorithm 114 that uses the logs 110 of the controller execution collected by system administrators. Steps 202 and 204 of FIG. 2 represent executing the existing policy and collecting the log, respectively, used in learning.

More particularly, the learning algorithm takes as input a log L of repair sessions, in which each repair session comprises a sequence I=o0, a1, o1, . . . , onl, starting with an error notification, followed by a set of repair actions and observations until the problem is fixed. In some cases, sessions end with the machine declared as “dead,” but in practice a technician is called for these machines, repairing or replacing them. Therefore, it can be assumed that the sessions end successfully in the healthy state.

To learn the policies from the system logs, various alternatives for computing the recovery policy may be used, as represented in FIG. 2 by step 206 (likely not an actual step, but a determination made beforehand for a domain). One alternative is to begin with a simple, model-free, history-based policy computation, as represented via step 208. Another is a method that learns the POMDP model parameters (step 210), and then uses the POMDP to compute a policy (step 212).

With respect to model-free learning (of Q-values) at step 208, an optimal policy can be expressed as a mapping from action-observation histories to actions. More particularly, histories are directly observable, allowing use of the standard Q function terminology, where Q(h,a) is the expected cost of executing action a with history h and continuing the session until it terminates. This approach is known as model-free, because, for example, the parameters of a POMDP are never learned; note that histories are directly observable, and do not require any assumption about the unobserved state space.

The system log L is used to compute Q:

Cost  ( l i ) = ∑ j = i + 1  l   C  ( a j ) ( 1 ) Q  ( h , a ) = ∑ l ∈ L  δ  ( h + a , l )  Cost  ( l  h  ) ∑ l ∈ L  δ  ( h + a , l ) ( 2 )

where li is a suffix of l starting at action ai, C(a) is the cost of action a, h+a is the history h with the action a appended at its end, and δ(h,l)=1 if h is a prefix of l and 0 otherwise. The Q function is the average cost until repair of executing the action a in history h, under the policy that generated L. Learning a Q function is much faster than learning the POMDP parameters, requiring only a single pass over the training sequences in the system log.

Given the learned Q function, the following policy may be defined:

π Q  ( h ) = min a  Q  ( h , a

Download full PDF for full patent description/claims.




You can also Monitor Keywords and Search for tracking patents relating to this Automated learning of failure recovery policies patent application.

Patent Applications in related categories:

20130124908 - Systems and methods for automatic replacement and repair of communications network devices - Systems and methods for automatic repair, replacement, and/or configuration of various network devices within a communications network are disclosed. The system may receive indication of a failed network device and automatically perform diagnostic on the network device to determine any problems associated with the hardware and/or software components within the ...


###
monitor keywords



Keyword Monitor How KEYWORD MONITOR works... a FREE service from FreshPatents
1. Sign up (takes 30 seconds). 2. Fill in the keywords to be monitored.
3. Each week you receive an email with patent applications related to your keywords.  
Start now! - Receive info on patent apps like Automated learning of failure recovery policies or other areas of interest.
###


Previous Patent Application:
Optimized placement of virtual machines in a network environment
Next Patent Application:
Creation of highly available pseudo-clone standby servers for rapid failover provisioning
Industry Class:
Error detection/correction and fault detection/recovery

###

FreshPatents.com Support - Terms & Conditions
Thank you for viewing the Automated learning of failure recovery policies patent info.
- - - AAPL - Apple, BA - Boeing, GOOG - Google, IBM, JBL - Jabil, KO - Coca Cola, MOT - Motorla

Results in 0.97696 seconds


Other interesting Freshpatents.com categories:
Qualcomm , Schering-Plough , Schlumberger , Texas Instruments , g2