{"title": "Automated Aircraft Recovery via Reinforcement Learning: Initial Experiments", "book": "Advances in Neural Information Processing Systems", "page_first": 1022, "page_last": 1028, "abstract": "", "full_text": "Automated Aircraft Recovery \nvia Reinforcement Learning: \n\nInitial Experiments \n\nJeffrey F. Monaco \n\nBarron Associates, Inc. \n\nJordan Building \n\n1160 Pepsi Place, Suite 300 \nCharlottesville VA 22901 \n\nmonaco@bainet.com \n\nDavid G. Ward \n\nBarron Associates, Inc. \n\nJordan Building \n\n1160 Pepsi Place, Suite 300 \nCharlottesville VA 22901 \n\nward@bainet.com \n\nAndrew G. Barto \n\nDepartment of Computer Science \n\nUniversity of Massachusetts \n\nAmherst MA 01003 \nbarto@cs.umass.edu \n\nAbstract \n\nInitial experiments described here were directed toward using reinforce(cid:173)\nment learning (RL) to develop an automated recovery system (ARS) for \nhigh-agility aircraft. An ARS is an outer-loop flight-control system de(cid:173)\nsigned to bring an aircraft from a range of out-of-control states to straight(cid:173)\nand-level flight in minimum time while satisfying physical and phys(cid:173)\niological constraints. Here we report on results for a simple version \nof the problem involving only single-axis (pitch) simulated recoveries. \nThrough simulated control experience using a medium-fidelity aircraft \nsimulation, the RL system approximates an optimal policy for pitch-stick \ninputs to produce minimum-time transitions to straight-and-Ievel flight in \nunconstrained cases while avoiding ground-strike. The RL system was \nalso able to adhere to a pilot-station acceleration constraint while execut(cid:173)\ning simulated recoveries. \n\n\fAutomated Aircraft Recovery via Reinforcement Learning \n\n1023 \n\n1 INTRODUCTION \n\nAn emerging use of reinforcement learning (RL) is to approximate optimal policies for \nlarge-scale control problems through extensive simulated control experience. Described \nhere are initial experiments directed toward the development of an automated recovery sys(cid:173)\ntem (ARS) for high-agility aircraft. An ARS is an outer-loop flight control system designed \nto bring the aircraft from a range of initial states to straight, level, and non-inverted flight in \nminimum time while satisfying constraints such as maintaining altitude and accelerations \nwithin acceptable limits. Here we describe the problem and present initial results involving \nonly single-axis (pitch) recoveries. Through extensive simulated control experience using \na medium-fidelity simulation of an F-16, the RL system approximated an optimal policy \nfor longitudinal-stick inputs to produce near-minimum-time transitions to straight and level \nflight in unconstrained cases, as well as while meeting a pilot-station acceleration constraint. \n\n2 AIRCRAFT MODEL \n\nThe aircraft was modeled as a dynamical system with state vector x = {q, 0, p, r, {3, Vi}, \nwhere q = body-axes pitch rate, 0 = angle of attack, p = body-axes roll rate, r = \nbody-axes yaw rate, {3 = angle of sideslip, Vi = total airspeed, and control vector fl = \n{flse , flae, fla/' Orud} of effector and pseudo-effector displacements. The controls are de(cid:173)\nfined as: flse = symmetric elevon, oae = asymmetric elevon, oal = asymmetric flap, and \nOrud = rudder. (A pseudo-effector is a mathematically convenient combination of real ef(cid:173)\nfectors that, e.g., contributes to motion in a limited number of axes.) The following addi(cid:173)\ntional descriptive variables were used in the RL problem formulation: h = altitude, h = \nvertical component of velocity, e = pitch attitude, N z = pilot-station normal acceleration. \nFor the initial pitch-axis experiment described here, five discrete actions were available \nto the learning agent in each state; these were longitudinal-stick commands selected from \n{-6, -3,0, +3, +6} lbf. The command chosen by the learning agent was converted into \na desired normal-acceleration command through the standard F-16 longitudinal-stick com(cid:173)\nmand gradient with software breakout. This gradient maps pounds-of-force inputs into de(cid:173)\nsired acceleration responses. We then produce an approximate relationship between normal \nacceleration and body-axes pitch rate to yield a pitch-rate flying-qualities model. Given this \nmodel, an inner-loop linear-quadratic (LQ) tracking control algorithm determined the actu(cid:173)\nator commands to result in optimal model-following of the desired pitch-rate response. \n\nThe aircraft model consisted of complete translational and rotational dynamics, including \nnonlinear terms owing to inertial cross-coupling and orientation-dependent gravitational ef(cid:173)\nfects. These were obtained from a modified linear F-16 model with dynamics of the form \n\nj; = Ax + Bfl + b + N \n\nwhere A and B were the F-16 aero-inertial parameters (stability derivatives) and effector \nsensitivities (control derivatives). These stability and control derivatives and the bias vec(cid:173)\ntor, b, were obtained from linearizations of a high-fidelity nonlinear, six-degree-of-freedom \nmodel. Nonlinearities owing to inertial cross-coupling and orientation-dependent gravita(cid:173)\ntional effects were accounted for through the term N, which depended nonlinearly on the \nstate. Nonlinear actuator dynamics were modeled via the incorporation ofF-16 effector-rate \nand effector-position limits. See Ward et al. (1996) for additional details. \n\n3 PROBLEM FORMULATION \n\nThe RL problem was to approximate a minimum-time control policy capable of bringing the \naircraft from a range of initial states to straight, level, and non-inverted flight, while satis(cid:173)\nfying given constraints, e.g., maintaining the normal acceleration at the pilot station within \n\n\f1024 \n\n1. F. Monaco, D. G. Ward and A G. Barto \n\nan acceptable range. For the single-axis (pitch-axis) flight control problem considered here, \nrecovered flight was defined by: \n\nq = q = it = h = i't = o. \n\n(1) \n\nSuccessful recovery was achieved when all conditions in Eq. 1 were satisfied simultane(cid:173)\nously within pre-specified tolerances. \n\nBecause we wished to distinguish between recovery supplied by the LQ tracker and that \nlearned by the RL system, special attention was given to formulating a meaningful test to \navoid falsely attributing successes to the RL system. For example, if initial conditions were \nspecified as off-trim perturbations in body-axes pitch rate, pitch acceleration, and true air(cid:173)\nspeed, the RL system may not have been required because the LQ controller would provide \nall the necessary recovery, i.e., zero longitudinal-stick input would result in a commanded \nbody-axes pitch rate of zero deg./ sec. Because this controller is designed to be highly re(cid:173)\nsponsive, its tracking and integral-error penalties usually ensure that the aircraft responses \nattain the desired state in a relatively short time. The problem was therefore formulated to \ndemand recovery from aircraft orientations where the RL system was primarily responsible \nfor recovery, and the goal state was not readily achieved via the stabilizing action of the LQ \ncontrol law. \nA pitch-axis recovery problem of interest is one in which initial pitch attitude, e, is selected \nto equal etrim +U(80Tn,n' 8 0Tna:l:)' where etrim == atrim (by definition), U is a uniformly \ndistributed random number, and eOTnin and eoTnaz define the boundaries of the training re(cid:173)\ngion, and other variables are set so that when the aircraft is parallel to the earth (80 = 0), \nit is \"pancaking\" toward the ground (with positive trim angle of attack). Other initial con(cid:173)\nditions correspond to purely-translational climb or descent of the aircraft. For initial condi(cid:173)\ntions where eo < atrim, the flight vehicle will descend, and in the absence of any corrective \nlongitudinal-stick force, strike the ground or water. Because it imposes no constraints on al(cid:173)\ntitude or pitch-angle variations, the stabilizing response of the LQ controller is inadequate \nfor providing the necessary recovery. \n\n4 REINFORCEMENT LEARNING ALGORITHM \n\nSeveral candidate RL algorithms were evaluated for the ARS. Initial efforts focused primar(cid:173)\nily on (1) Q-Learning, (2) alternative means for approximating the action-value function (Q \nfunction), and (3) use of discrete versus continuous-action controls. During subsequent in(cid:173)\nvestigations, an extension of Q-Learning called Residual Advantage Learning (Baird, 1995; \nHarmon & Baird, 1996) was implemented and successfully applied to the pitch-axis ARS \nproblem. As with action-values in Q-Learning, the advantage function, A(x, u), may be \nrepresented by a function approximation system of the form \n\n(2) \nwhere \u00a2( x, u) is a vector of relevant features and 0 are the corresponding weights. Here, the \nadvantage function is linear in the weights, 0, and these weights are the modifiable, learned \nparameters. \n\nA(x,u) = \u00a2(x,ufO, \n\nFor advantage functions of the form in Eq. 2, the update rule is: \n\nOk+l \n\nOk - a ((r + \"Y~t A(y, b*)) K~t + (1 - K~t) A(x, a*) - A(x, a)) \n\u2022 ( ~\"Y~t\u00a2(y, b*) K~t + ~ (1 - K~t) \u00a2(x, a*) - \u00a2(x, a)) , \n\nwhere a* = argminaA(x, a) and b* = argminbA(y, b), !l.t is the system rate (0.02 sec. \nin the ARS), \"Y~t is the discount factor, and K is an fixed scale factor. In the above notation, \n\n\fAutomated Aircraft Recovery via Reinforcement Learning \n\n1025 \n\ny is the resultant state, i.e., the execution of action a results in a transition from state x to \nits successor y. \n\nThe Residual Advantage Learning update collapses to the Q-Learning update for the case \n~ = 0, K = L. The parameter ~ is a scalar that controls the trade-off between residual(cid:173)\ngradient descent when ~ = 1, and a faster, direct algorithm when ~ = O. Harmon & Baird \n(1996) address the choice of ~, suggesting the following computation of ~ at each time step: \n\n;F,. \n'J!'= \n\nl:o WdWrg \n\nl:o(Wd - wrg)wrg \n\n+J-L \n\nwhere Wd and Wrg are traces (one for each (J of the function approximation system) associ(cid:173)\nated with the direct and residual gradient algorithms, respectively, and J-L is a small, positive \nconstant that dictates how rapidly the system forgets. The traces are updated during each \ncycle as follows \n\nWd \n\nf -\n\n(1-J-L)Wd-J-L[(r+'Y~tA(y,b*)) K~t+(1- K~t)A(X,a*)] \n\n\u2022 [- :(JA(x, a*)] \n\nwrg \n\nf -\n\n(1-J-L)Wrg-J-L[(r+'Y~tA(y,b*\u00bbK~t+(1- K~t)A(x,a*)-A(X,a)] \n\n\u2022 ['Y~t ;(JA(y,b*) K~t + (1- K~t) ;(JA(x,a*) -\n\n;(JA(X, a)] . \n\nAdvantage Learning updates of the weights, including the calculation of an adaptive ~ as \ndiscussed above, were implemented and interfaced with the aircraft simulation. The Ad(cid:173)\nvantage Learning algorithm consistently outperformed its Q-Learning counterpart. For this \nreason, most of our efforts have focused on the application of Advantage Learning to the \nsolution of the ARS. The feature vector 4>(x, u) consisted of normalized (dimensionless) \nstates and controls, and functions ofthese variables. Use ofthese nondimensionalized vari(cid:173)\nables (obtained via the Buckingham 7r-theorem; e.g., Langharr, 1951) was found to enhance \ngreatly the stability and robustness of the learning process. Furthermore, the RL system ap(cid:173)\npeared to be less sensitive to changes in parameters such as the learning rate when these \ntechniques were employed. \n\n5 TRAINING \n\nTraining the RL system for arbitrary orientations was accomplished by choosing random \ninitial conditions on e as outlined above. With the exception of h, all other initial condi(cid:173)\ntions corresponded to trim values for a Mach 0.6, 5 kIt. flight condition. Rewards were \n-1 per-time-step until the goal state was reached. In preliminary experiments, the training \nregion was restricted to \u00b1 0.174 rad.(l0 deg.) from the trim pitch angle. For this range of \ninitial conditions, the system was able to learn an appropriate policy given only a handful of \nfeatures (approximately 30). The policy was significantly mature after 24 hours oflearning \non an HP-730 workstation and appeared to be able to achieve the goal for arbitrary initial \nconditions in the aforementioned domain. \n\nthen expanded the training region and considered initial e values within \nWe \n\u00b1 0.785 rad. (45 deg.) of trim. The policy previously learned for the more restricted \ntraining domain performed well here too, and learning to recover for these more drastic \noff-trim conditions was trivial. No boundary restrictions were imposed on the system, but \na report of whether the aircraft would have struck the ground was maintained. It was noted \n\n\f1026 \n\n1. R Monaco, D. G. Ward and A. G. Barto \n\nthat recovery from all possible initial conditions could not be achieved without hitting \nthe ground. Episodes in which the ground would have been encountered were a result \nof inadequate control authority and not an inadequate RL policy. For example, when the \ninitial pitch angle was at its maximum negative value, maximum-allowable positive stick \n(6 lbf.) was not sufficient to pull up the aircraft nose in time. To remedy this in subsequent \nexperiments, the number of admissible actions was increased to include larger-magnitude \ncommands: {-12, -9, -6, -3,0, +3, +6, +9, +12} lbf. \n\nEarly attempts at solving the pitch-axis recovery problem with the expanded initial con(cid:173)\nditions in conjunction with this augmented action set proved challenging. The policy that \nworked well in the two previous experiments was no longer able to attain the goal state; \nit was only able to come close and oscillate indefinitely about the goal region. The agent \nlearned to pitch up and down appropriately, e.g., when h was negative it applied a corrective \npositive action, and vice versa. However, because of system and actuator dynamics mod(cid:173)\neled in the simulation, the transient response caused the aircraft to pass through the goal \nstate. Once beyond the goal region, the agent applied an opposite action, causing it to ap(cid:173)\nproach the goal state again, repeating the process indefinitely (until the system was reset \nand a new trial was started). Thus, the availability of large-amplitude commands and the \npresence of actuator dynamics made it difficult for the agent to fonnulate a consistent pol(cid:173)\nicy that afforded all goal state criteria being satisfied simultaneously. One might remedy \nthe problem by removing the actuator dynamics; however, we did not wish to compromise \nsimulation fidelity, and chose to use an expanded feature set to improve RL perfonnance. \nUsing a larger collection offeatures with approximately 180 inputs, the RL agent was able \nto formulate a consistent recovery policy. The learning process required approximately 72 \nhours on an HP-730 workstation. (On this platform, the combined aircraft simulation and \nRL software execution rate was approximately twice that of real-time.) At this point per(cid:173)\nformance was evaluated. The simulation was run in evaluation mode, i.e., learning rate was \nset to zero and random exploration was disabled. Performance is summarized below. \n\n6 RESULTS \n\n6.1 UNCONSTRAINED PITCH-AXIS RECOVERY \n\nFig. 1 shows the transition times from off-trim orientations to the goal state as a function \nof initial pitch (inclination) angle. Recovery times were approximately 11-12 sec. for the \nworst-case scenarios. i.e .\u2022 1801 = 45 deg. off-trim. and decrease (almost) monotonically \nfor points closer to the unperturbed initial conditions. The occasional \"blips\" in the figure \nsuggest that additional learning would have improved the global RL performance slightly. \nFor 180 1 = 45 deg. off-trim, maximum altitude loss and gain were each approximately 1667 \nft. (0.33 x 5000 f t. ). These excursions may seem substantial. but when one looks atthe time \nhistories for these maneuvers, it is apparent that the RL-derived policy was perfonning well. \nThe policy effectively minimizes any altitude variation; the magnitude of these changes are \nprincipally governed by available control authority and the severity of the flight condition \nfrom which the policy must recover. \n\nFig. 2 shows time histories of relevant variables for one of the limiting cases. The first col(cid:173)\numn shows body-axes pitch rate (Qb) and commanded body-axes pitch rate (Qbmodel) \nin (deg./sec.), pilot station nonnal acceleration (Nz) in (g), angle of attack (Alpha) in \n(deg.). and pitch attitude (Theta) in (deg.), respectively. The second column shows the \nlongitudinal stick action executed by the RL system (lbf.), the left and right elevator de(cid:173)\nflections (deg.). total airspeed (ft./ sec.), and altitude (ft.). The majority ofthe 1600+ ft. \naltitude loss occurs between zero and five sec.; during this time, the RL system is applying \nmaximum (allowable) positive stick. Thus, this altitude excursion is principally attributed \nto limited control authority as well as significant off-trim initial orientations. \n\n\fAutomated Aircraft Recovery via Reinforcement Learning \n\n1027 \n\nRecovery Time (sec.) \n\n20 \n18 \n16 \n14 \n12 \n10 \n8 \n6 \n4 \n2 \nO~~rn~~~~~~~~~~~~~~~~ \n\u00b750 \n\n\u00b740 \n\n\u00b720 \n\n\u00b710 \n\n\u00b730 \n\n0 \n\n30 \n\n50 \n\n20 \n\n40 \n\n10 \n\nFigure 1: Simulated Aircraft Recovery Times for Unconstrained Pitch-Axis ARS \n\n- IU \n\n-\n\n,'z \n\n\" \n\n2 \n\n2 \n\n2 \n\n\" \n\n12 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n-\n\n00. \n\n\u2022 \n\n\u2022 \n\n_. 00.-\n\n11 \n\u2022 \n\n- \" , , - -\n\nL ~ , \njV ~ , he \n:t: ~ , \n~ , \n\nj cC. \n........... -_ ....... \n--\n--\nt== \n\n--\n\n\u2022 \n-.-\n\n, \n\" \n\n\u2022 \n\n\u2022 \n\n1. \n\n12 \n\n\u2022 \n\n2 \n\n\u2022 \n\n\u2022 \n\nI \n\n10 \n\no \n\n2 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\" \n\n12 \n\n\u2022 \n\n2 \n\n\u2022 \n\n\u2022 \n\n\" \n\n\"0 \n\nFigure 2: TIme Histories During Unconstrained Pitch-Axis Recovery for 8 0 = 8 trim -\n45 deg . \n\n\f1028 \n\nI. R Monaco, D. G. Ward and A G. Barto \n\n6.2 CONSTRAINED PITCH-AXIS RECOVERY \n\nThe requirement to execute aircraft recoveries while adhering to pilot-safety constraints was \na deciding factor in using RL to demonstrate the automated recovery system concept. The \nneed to recover an aircraft while minimizing injury and, where possible, discomfort to the \nflight crew, requires that the controller incorporate constraints that can be difficult or impos(cid:173)\nsible to express in forms suitable for linear and nonlinear programming methods. \n\nIn subsequent ARS investigations, allowable pilot-station normal acceleration was re(cid:173)\nstricted to the range -1.5 9 ~ N z ~ 3.5 g. These values were selected because the un(cid:173)\nconstrained ARS was observed to exceed these limits. Several additional features (for a \ntotal of 189) were chosen, and the learning process was continued. Initial weights for the \noriginal 180 inputs corresponded to those from the previously learned policy; the new fea(cid:173)\ntures were chosen to have zero weights initially. Here, the RL system learned to avoid the \nnormal acceleration limits and consistently reach the goal state for initial pitch angles in the \nregion [-45 + 8 trim , 35 + 8 trim ] deg. Additional learning should result in improved re(cid:173)\ncovery policies in this bounded acceleration domain for all initial conditions. Nonetheless, \nthe results showed how an RL system can learn to satisfy these kinds of constraints. \n\n7 CONCLUSION \n\nIn addition to the results reported here, we conducted extensive analysis of the degree to \nwhich the learned policy successfully generalized to a range of initial conditions not expe(cid:173)\nrienced in training. In all cases, aircraft responses to novel recovery scenarios were stable \nand qualitatively similar to those previously executed in the training region. We are also \nconducting experiments with a multi-axes ARS, in which longitudinal-stick and lateral-stick \nsequences must be coordinated to recover the aircraft. Initial results are promising, but sub(cid:173)\nstantially longer training times are required. In summary, we believe that the results pre(cid:173)\nsented here demonstrate the feasibility of using RL algorithms to develop robust recovery \nstrategies for high-agility aircraft, although substantial further research is needed. \n\nAcknowledgments \n\nThis work was supported by the Naval Air Warfare Center Aircraft Division (NAWCAD), \nFlight Controls/Aeromechanics Division under Contract N62269-96-C-0080. The authors \nthank Marc Steinberg, the Program Manager and Chief Technical Monitor. The authors also \nexpress appreciation to Rich Sutton and Mance Harmon for their valuable help, and to Lock(cid:173)\nheed Martin Tactical Aircraft Systems for authorization to use their ATLAS software, from \nwhich F-16 parameters were extracted. \n\nReferences \n\nBaird, L. C. (1995) Residual algorithms: reinforcement learning with function approxima(cid:173)\ntion. In A. Prieditis and S. Russell (eds.), Machine Learning: Proceedings of the Twelfth \nInternational Conference, pp. 30-37. San Francisco, CA: Morgan Kaufmann. \n\nHarmon, M. E. & Baird, L. C. (1996) Multi-agent residual advantage learning with general \nfunction approximation. Wright Laboratory Technical Report, WPAFB, OH. \n\nLangharr, H. L. (1951) Dimensional Analysis and Theory of Models. New York: Wiley and \nSons. \n\nWard, D. G., Monaco, J. E, Barron, R. L., Bird, R.A., Virnig, J.C., & Landers, T.E (1996) \nSelf-designing controller. Final Tech. Rep. for Directorate of Mathematics and Computer \nSciences, AFOSR, Contract F49620-94-C-0087. Barron Associates, Inc. \n\n\f", "award": [], "sourceid": 1386, "authors": [{"given_name": "Jeffrey", "family_name": "Monaco", "institution": null}, {"given_name": "David", "family_name": "Ward", "institution": null}, {"given_name": "Andrew", "family_name": "Barto", "institution": null}]}