Application of reinforcement learning for the control of a control moment gyroscope

(1)

Application of Reinforcement Learning for

the Control of a Control Moment Gyroscope

presented to the

Department of Mechanical Engineering

by

Daniel Fernando G´

omez Berdugo

Advisor: Juan Sebasti´an Nu˜nez

For the title of Mechanical Engineer

Mechanical Engineering Universidad de Los Andes

(2)

Application of Reinforcement Learning for

the Control of a Control Moment Gyroscope

Approved by:

Juan Sebasti´an Nu˜nez, Advisor

(3)

To my parents who made me who I am and made everything

possible

A mis padres que me hicieron quien soy y quienes hicieron todo

(4)

Preface

The greatest motivation for this project was to learn about gyroscopes, the

motion of which has always caused me a great curiosity, and about machine

learning, a forefront topic that raises deep questions about the nature of

consciousness.

Prefacio

La mayor motivaci´on para este proyecto fue aprender sobre giroscopios, cuyo

movimiento siempre me ha causado una gran curiosidad, y sobre aprendizaje

autom´atico, un tema novedoso que plantea preguntas profundas sobre la

(5)

Acknowledgments

I would like to thank mi advisor for his encouragement and enthusiasm,

especially when results were not as expected, the technicians of the

manu-facturing lab for their help finding ideas to improve the robustness of the

prototype, my parents for their moral support, my sisters for proofreading

this document, and my friends for their feedback on the presentation of the

project.

Agradecimientos

Agradezco a mi asesor por su apoyo y entusiasmo, especialmente cuando los

resultados no eran los esperados, a los t´ecnicos del laboratorio de manufactura

por su ayuda a encontrar ideas para mejorar la robustez del prototipo, a mis

padres por su soporte moral, mis hermanas por revisar este documento, y a

(6)

List of Tables

(10)

List of Figures

1 Photo of the BATC M-95 CMG four-wheel pyramid configura-tion present on the world View 1 and world View 2 satellites.[1] 1

2 CMG prototype built at Universidad de los Andes by Juan David Lopez[2]. . . 3

3 Basic components of a CMG . . . 5

4 Diagram showing how a CMG generates torque to move a body 5

5 Reference frames for an array of 3 SGCMG. Adapted from [2] 7

6 CMG system used . . . 12

7 Measured data and fits used to calculate the friction factor of the prototype along the z axis. . . 15

8 (a)CMG system. (b) Simmechanics simulation of the system . 16

9 Angular velocity of the body with a gimbal rate of 0.1 rad/s in one of the gimbals . . . 17

(11)

10 Angular velocity of the body with all gimbals at the same gimbal rate of 0.1 rad/s . . . 18

11 Roll trajectory . . . 19

12 Comparison between (a) no gimbal inertia and (b) a gimbal inertia of 200kgmm2 _{. . . .} ₂₁

13 Reinforcement learning flow diagram. Taken from[3]. . . 23

14 Diagram of an inverted pendulum with a cart to provide control. 24

15 Reward function used for the Q-learning algorithm. . . 31

16 Simulation of Q-learning algorithm for different levels of fric-tion of the model. . . 32

17 Controller trained with 300 learning episodes tasked with fol-lowing a sine trajectory. . . 33

18 Controller trained with 600 learning episodes tasked with fol-lowing a sine trajectory. . . 34

19 Action maps learned by the Q-learning algorithm for different number of episodes. . . 35

20 Controller trained with a finer discretization angle following a sinusoidal trajectory. . . 37

21 Experimental setup for the tests performed. . . 38

(12)

23 (a) Original support. Some epoxy mastic was added to hold the gimbal shaft and the motor support together but it proved to be ineffective. (b) New support. . . 40

24 Block diagram for the control of the yaw angle. . . 41

25 Response to a step input to turn 90◦. . . 42

26 Schematic view of the circuit used to control the CMG prototype. 49

(13)

Abstract

This project consists in a first approximation to a reinforcement learning

based controller to provide attitude control using a control moment gyroscope

(CMG) array. A multibody simulation of a CMG array was made using

Simmechanics and it was compared with a mathematical model in MATLAB.

This model was used to simulate a controller Q-learning algorithm and a

PID, both tasked with controlling the yaw angle of the system. The

Q-learning controller was able to stabilize the system to a step input between

5 and 10 seconds under very specific parameters for the learning algorithm.

Both controllers were tested on a CMG prototype and were able to respond

satisfactorily. The PID showed a better performance, with less stabilization

(14)

Resumen

Este proyecto consiste en una primera aproximaci´on a un controlador

basado en aprendizaje por refuerzo para proveer control de actitud utilizando

un arreglo de giroscopios de control de momento (CMG). Una simulaci´on

multicuerpo de un arreglo de CMGs fue creada utilizando Simmechanics y

fue comparado con un modelo matem´atico en MATLAB. Este modelo fue

usado para simular un controlador basado en un algoritmo Q-learning y un

PID, ambos encargados de controlar el ´angulo de direcci´on (yaw) del sistema.

El controlador basado en Q-learning fue capaz de estabilizar el sistema luego

de una entrada de escal´on entre 5 y 10 segundos usando par´ametros muy

especificos para el algoritmo de aprendizaje. Ambos controladores fueron

probados en un prototipo de CMG y respondieron satisfactoriamente. El

PID mostró un mejor desempeño, con un menor tiempo de estabilización y

(15)

1 Introduction

Attitude control is a fascinating problem in rigid body dynamics which refers to the control of a body’s orientation in space, or how to maintain an angular position or trajectory. The most common application of attitude control is in the aerospace industry, in which precise control of roll, pitch and yaw is needed to achieve stability and to follow a desired trajectory. Attitude control can be provided through external actuators such as ailerons or thrusters or through internal actuators in the form of angular momentum exchange devices. These internal devices are divided into reaction wheels and control moment gyroscopes (CMG). Reaction wheels generate torque by changing the speed of a flywheel while CMGs work by changing the orientation of a flywheel running at high speed [4]. CMGs are generally more efficient than reaction wheels in terms of energy consumption per torque generated which makes them an attractive solution to achieve attitude control.

Figure 1: Photo of the BATC M-95 CMG four-wheel pyramid configuration present on the world View 1 and world View 2 satellites.[1]

(16)

dynam-ics are much more complex which makes the implementation of a controller a challenging problem. As the flywheels change their orientation, the direction of the torque generated also changes. This makes the relation between the input torque and the movement of the body highly non-lineal. In addition, the dynamic response of the system is affected by the moments of inertia of all its components, by the presence of friction forces and by many other hard to model perturbations. This compels the use of sophisticated and robust control techniques, which can deal with non-linearity and unknown parameters, such as adaptive or sliding mode control[5].

The proposal of this thesis is to use an alternative approach and derive a control scheme for a CMG array without precise knowledge of the system’s dynamical equations by using machine learning methods.

Machine learning is the field of programming a computer to learn how to do a task without being explicitly programmed on how the task should be done. It is especially effective to solve problems which are hard to program but are routinely solved by humans or animals, such as discerning visual-ized elements and movement with limbs. The problem of CMG control falls into this category. While it is a complex problem that requires sophisticated methods, the human mind frequently solves this control problem when learn-ing how to ride bike. The front wheel acts as a flywheel and when it is turned it produces a reaction torque which is carefully controlled by the cyclist to maintain balance without consciously calculating the torque needed.

Machine learning has been implemented successfully for complex control problems such as helicopter flying [6]. It should be noted that although machine learning can in theory tackle any control task, in practice the imple-mentation is usually easier by programming the controller explicitly, even for a complex control problem like the CMG. However it is interesting to apply machine learning to study its advantages and limitations and acquire knowl-edge that facilitates the implementation in future projects. Furthermore the relation between input and output varies drastically with the configuration of the gimbals in a CMG array. It would be useful to obtain a learning algorithm that is able to generate a controller for different gimbal configurations.

A CMG array prototype was built on an earlier graduation project at Universidad de los Andes. This prototype is missing any form of control

(17)

Figure 2: CMG prototype built at Universidad de los Andes by Juan David Lopez[2].

which provides an excellent opportunity to study the dynamics of the system and apply machine learning techniques as a way to provide attitude control. This bachelor thesis will describe the process used for the implementation and testing of a Q-learning algorithm to provide control for the yaw angle of this CMG array (1 degree of freedom). To achieve this objective first a math-ematical model of the CMG array prototype was developed and compared with a multibody simulation in Simmechanics. Several assumptions in the model were tested and trajectories were studied to establish a practical way to implement the controller. Then a machine learning algorithm was tested in simulation using the previous model. Finally the control law obtained in the simulation was tested on the physical prototype and its performance was compared to that of a PID controller.

2 Objectives

2.1 General Objective

To demonstrate the use of machine learning techniques on the attitude control for an array of control moment gyroscopes by implementing a Q-learning algorithm.

(18)

2.2 Specific Objectives

1. Create a simplified mathematical model of the system’s dynamics and corroborate the approximations with a Simmechanics model.

2. Design a controller scheme based on machine learning for the CMG system.

3. Implement a learning algorithm to control the system’s yaw in a com-puter simulation.

4. Setup the CMG system to enable the implementation of a controller.

5. Perform experimental tests to evaluate the performance of the algo-rithm in 1 degree of freedom and compare it with a PID controller.

3 Nomenclature

a scalar a ~a vector a

ˆ

a matrix a

ˆ

Aab transformation matrix from frame a to frameb ˙

θi gimbal rate of the ith gimbal Ω flywheel speed

J moment of inertia of the flywheel ˆ

IB

a moment of inertia of the body in reference frame a

~

wi unit vector parallel to the ith flywheel spin axis

~gi unit vector parallel to the ith gimbal axis

a_~_ωb _{angular velocity of frame} _b _{with respect to frame} _a

~ Ha

b angular momentum of a in the b frame

4 Control Moment Gyroscope Dynamics

A CMG consists of three parts: a body, gimbals and flywheels. The flywheels spin at great speeds generating a considerable angular momentum along its

(19)

spin axisw~. The gimbals change the orientation of the flywheels by rotating it around the gimbal axis ~g which generates a torque that moves the body.

Figure 3: Basic components of a CMG

According to the rotational version of Newton’s Second Law of Motion, the total torque applied is equal to the change in angular momentum. When the gimbal rotates, the angular momentum of the flywheel changes its di-rection slightly in a didi-rection perpendicular to both the angular momentum vector and the gimbal axis. The torque the CMG applies to the body is the reaction, so its direction is opposite to the change of the angular momentum of the flywheel.

Figure 4: Diagram showing how a CMG generates torque to move a body

(20)

fixed speed flywheel, it is called SGCMG (Single Gimbal Control Moment Gyroscope) and provides a single degree of freedom of control by moving the gimbal on its axis. In order to obtain all degrees of freedom needed for the attitude control of a rigid object in space, an array of 3 or more SGCMG is used. SGCMG arrays are specified by direction of the axis of each gimbal, the position of each SGCMG has no effect on the dynamics[7]. There are several types of CMG with two gimbals per flywheel or with a variable flywheel speed which provide more degrees of freedom of control per device, but for this project only SGCMG will be used.

To formulate the equations of motion of a SGCMG, three reference frames are used, the inertial reference frame (I), the body reference frame (B) and the gimbal reference frame (G) (see figure 5). Note that the spin axis of the flywheel is fixed in the gimbal reference frame and because the flywheel is radially symmetric around this axis there is no need to consider the flywheel reference frame separately.

4.1 Kinematic equations

The angular velocity of the flywheel around w~ (see figure 3) will be referred to as the flywheel’s speed, the angular velocity of the flywheel around~g will be referred to as the gimbal rate. The angular velocity of the gimbal frame (Gi) as seen by the inertial frame (I) is given by:

I_~_ωGi ₌I _~_ωB₊B_~_ωGi ₌I _~_ωB_{+ ˙}_θ

i~g (1)

The angular velocity of the flywheel is given by:

I_~_ωW i _{= +}I_~_ωB₊B_~_ωGi₊Gi_~_ωW i ₌I _~_ωB_{+ ˙}_θ

i~gi+ Ωw~i (2)

where ˙θi is the gimbal rate of the ith gimbal, Ω is the flywheel speed and a_ω~b _{is the angular velocity of frame} _b _{as seen by frame}_a_.

(21)

(22)

4.2 Dynamic equations

The motion of the multiple reference frames and the moment of inertia of the different bodies involved yields an equation for the system’s dynamics with hundreds of terms. Fortunately most of this terms are negligible when compared to the effect of the flywheels. The following assumptions and simplifications were used to obtain a compact equation that is able to model the motion of the system:

1. The inertia of all the flywheels are equal.

2. The flywheels speeds are all equal.

3. The angular momentum of the flywheel along components other than its spin axis is negligible.

4. The flywheel speed remains constant.

5. The torque generated by the angular. acceleration of components in the gimbal frame other than the flywheels is negligible.

6. The center of mass of the complete system lies on its center of rotation.

To obtain the dynamical equations of the system it is convenient to start with Euler’s equation for rigid body motion for the body reference frame. Then, to obtain the torque the flywheels exert on the body, the angular momentum of the flywheels is found and derived with respect to time in the inertial frame. Finally, the kinematic equations are introduced to write the equation in terms of the body angular velocity, the flywheel speed and the gimbal rate. Starting with Euler’s equation:

ˆ

I~ω˙ =T~ −~ω×H~ (3)

I_~_ω_˙B B =

ˆ

I_BB

−1h ~

TB−I~ωBB×

ˆ

I_BBI~ωB_Bi (4)

Where T~ is the torque generated by the SGCMGs. To find this torque it is necessary to find the angular momentum of the flywheels on the body

(23)

frame and derivate with respect to time in the inertial frame. The torque each flywheel exerts on the body is equal to the negative of the rate of change of the angular momentum of the flywheels.[8] It can easily be calculated in the gimbal frame using:

TG=− I_d

dtH

w G =−

G_d dtH w G − I ~

ωG×H_Gw (5)

The angular momentum of any of the flywheels is given by:

H_Gw = ˆJG I~ωGG+ Ωw~G

+m~r×~v ≈JΩw~ (6) For this approximation the term m~r×~v is ignored because the effect of the dead mass of the flywheel can be easily incorporated into the moment of inertia of the body since it is assumed that the body rotation occurs around its center of mass. The components of the angular momentum that come from I_ω~G _{are neglected because the angular velocity of the body and the} gimbal rate are typically three to four orders of magnitude magnitude lower than the flywheel’s speed[9]. Then, because only the component parallel to

~

w is taken into account and this is a principal axis of the flywheel, ˆJ can be replaced by the scalar J corresponding to the moment of inertia of the flywheel on its spin axis.

Next, equation 6 is replaced into equation 5.

~ TG =−

G_d

dtJΩw~G−

I_~_ωG

G×JΩw~G (7a)

~

TG=

−JΩ˙w~G − I~ωGB+ B_~_ωG

G

×ΩJ ~wG (7b)

~

TG=−JΩ

I_~_ωB

G×w~G+ ˙θ~gG×w~G

(7c)

~

TB =−JΩ

I_~_ωB

B×w~B+ ˙θ~gB×w~B

(8)

It is assumed that Ω is constant for the type of CMG used so the first term in equation 7b is neglected. For the final step the torque is written in the body coordinate frame.

(24)

Finally equation 8 is replaced into equation 4, taking into account that a summation over the different giroscopes must be made.

I_~_ω_˙B B =−

ˆ

I_BB

−1"

I_~_ωB B ×

ˆ

I_BBI~ωB_B+X i

JΩI~ω_BB×w~iB+ ˙θig~iB×w~iB

#

(9a)

I_˙

~

ωB_B =−Iˆ_BB

−1"

I

~

ω_BB× Iˆ_BBB~ω_BI +X i

JΩw~iB

!

+X

i

JΩ ˙θi~gB×w~iB

#

(9b)

An interesting aspect of this equation is that the torque the flywheels exert on the body depends not only on the gimbal rate but also on the motion of the body. If the flywheels have a resultant total angular momentum, then the motion of the body will cause a change in this angular momentum and thus a reaction torque will be generated. This coupling is essentially a second order differential equation and causes oscillations in the angular velocity of the body similar to a spinning top that is perturbed perpendicularly to its spinning axis.

An effective way to reduce the effect of this term is to set all initial gimbal angles in a way that the sum of the angular momentum of the flywheels cancels. This avoids nutation and precession (angular oscillations along the

x and y axes respectively) and facilitates the inclination of the body along axes other than z.

Another way to interpret this equation is that CMGs work through the principle of conservation of momentum. When a gimbal is rotated the an-gular momentum vector of that gimbal’s flywheel is changed so the anan-gular momentum of the other flywheels and the body must change so the total is conserved. This offers another method to obtain the angular velocity of the body in terms of the gimbal positions and the orientation of the body. The difference between the initial and current sum of the angular momentum of the flywheels is equated to the angular momentum of the body:

(25)

ˆ

I_IB·I_~_ωB

I =JΩ ˆABI

X

i

(w~iB(t= 0)−w~iB) (10b)

In practice, equation 10b is limited because it is hard to include the effect of external perturbations however it proved to be useful as a way to visualize the motion of the system in simple trajectories and to double check a computation using equation 9a.

The above equations are general for any array of SGCMG as long as the approximations described are applicable, which is a common scenario for SGCMG. With knowledge of the configuration of the SGCMG the terms for the vectors~gandw~ can be replaced so the equation can be written completely in terms of the gimbal angle and rate.

5 Description of Prototype

A detailed description of the prototype and its design can be found at [2].

5.1 General description

The prototype used in this project is a system of 3 SGCMG on a pyramid configuration; all three gimbal axes lie on a plane and are arranged symmet-rically, rotated 120◦from one another. It consists on a body that has three rotational degrees of freedom given by a radial bearing and a universal joint, three gimbals attached to the body, and three flywheels, one on each gimbal. The body is made of aluminum for the base that connects to the universal joint and steel for the support of the gimbals. The flywheels are powered by Cheetah H2217-4 brushless outrunner motors which can achieve speeds of up to 42000RPM. The gimbal angle on each gyroscope is controlled by a Hitec servo motor with a top speed of 6.54rad/s. The prototype is powered with a 12V 18Ah lead acid battery.

The flywheels are made of brass, have a diameter of 6cm a mass of 102g and an inertia along the principal axis of 45.9kg·mm2_{. This equates to a}

(26)

(27)

Table 1: Summary of CMG details

Componente Description

Flywheel motor

Reference H2217 KV3500 Maximum speed 42000 RPM

Max Power 333W

Voltage 10V-12V

Flywheel

Material Brass

Mass 102 g

Diameter 60 mm

Inertia 45.9 kg mm2

Gimbal Motor

Reference Hitec HS 422 Maximum Torque 4.1 kg Maximum Speed 6.54 rad/s

Voltage 4.8-6.0 V

maximum possible torque of 1.32 Nm for each flywheel. The moment of inertia of the body is 44061kgmm2_{. The inertia tensor of the body was}

estimated using a CAD model in Autodesk Inventor.

Several modifications were made from the original design to increase safety and provide a more robust actuation of the gimbals:

• Plastic protectors were added covering the gimbal and flywheel.

• The flywheel motors were replaced with motors that are slower but more robust.

• The gear transmission from the servo to the gimbal axis was replaced by a direct connection. This sacrificed the gimbal range of motion but eliminated a source of error in the jumping between gears.

• The metal structure that supported the servos was replaced to better secure the servos to the structure.

• A metal framework was built to support the prototype and provide additional safety.

(28)

5.2 Design of the electronic control

Two schemes were proposed to provide control commands to the prototype. Sending commands from Matlab to a microcontroller connected to the proto-type or writing the complete control code on the micro controller. An arduino Mega 2560 was chosen for the task because if the complete control algorithm was to be written on the microcontroller, it would require for the system to have enough memory to hold the variables of the algorithm. Ultimately both schemes were used, the PID controller was implemented fully on the micro-controller while the Q-learning based micro-controller was setup using a MATLAB script which sent the actions to the Arduino microcontroller through serial communication.

The servos are controlled through PWM using the VarSpeedServo library which allows for control of both the position and speed of the servo. The brushless motors are also controlled through PWM using the same library, with a pulse width of 1ms corresponding to no throttle and 2ms corresponding to full throttle.

An external circuit was built to allow the reading of a rotary encoder that measures the yaw angle. The circuit also has ports to ease the connection between the Arduino and the servos and motors, and a potentiometer that allows manual control of either the motor or the servo signal depending on the code uploaded. The circuit schematic can be found at appendix A.

5.3 Measurement of friction coefficient

In order to add more realism to the mathematical model of the system, the coefficient of friction of the prototype was measured. A simple linear friction model was used in which the friction torque is proportional to the angular velocity. It was written in the form: Tf riction

I =−cω because both torque (T) and inertia (I) are difficult to measure independently but their ratio can be found from the angular acceleration. The experimental setup consisted on disconnecting the device so the cables would not induce an external torque, applying an impulse to the system so it starts rotating on the z axis and then measuring the time and angle until it stops using a rotary encoder. The

(29)

Figure 7: Measured data and fits used to calculate the friction factor of the prototype along the z axis.

trajectory was fitted to the function: φ0e−cφ, where φ is the yaw angle, and

the parameter was found for 10 different experimental measurements. Figure 7 shows the data points and fitting used. The average value of the friction factor was 0.2275s−1_.

(30)

(a) (b)

Figure 8: (a)CMG system. (b) Simmechanics simulation of the system

6 Simulation Model of the System’s

Dynam-ics

It is important to have an accurate computational model of the CMG as a way to validate the equations that were obtained for the motion of the CMG and to be able to test the performance of a controller. A simple model was created using Simmechanics second generation, a multi body simulation environment for 3-D mechanical systems, within Simulink. The bodies used are: the main body of the CMG, the three gimbals, and the three flywheels. Each body is defined by its mass and inertia tensor. The gimbal and flywheel configuration is the same as in the prototype.

The inputs of the simulation are the initial angles of the gimbals and the gimbal rates, the output is the angular velocity and orientation of the body.

The model has several purposes:

• Verify if the assumptions used in the mathematical model are valid.

• Find conditions and angular trajectories of the body that ease the attitude control of the CMG prototype.

• Observe the effect of changing physical factors such as friction, inertia of the flywheels and body, and flywheel speed.

(31)

Three models were compared, the simmechanics simulation, a differential equation of angular velocity based on equation 9a (which will be denotedA) and a differential equation of orientation based on equation 10b (which will be denoted B).

The computational model does not include the simplifications done to the angular momentum of the flywheel or the inertia of the gimbal. This is useful to verify the validity of the approximations used as it will be shown that the equations and the simulation model predict the same trajectory. To compare the results, the angular velocity of the system was found through the simulation and through models A and B for several input gimbal rates. For a gimbal rate of 0.1rad/s on only one of the gimbals (the one with the red flywheel as shown in figure 8b) and an initial gimbal angle of 45◦ from the xy plane, we obtain the result shown in figure 9.

Figure 9: Angular velocity of the body with a gimbal rate of 0.1 rad/s in one of the gimbals

(32)

nearly identical results. In this configuration the initial torque generated by the motion of the gimbal has components of equal magnitude on x and

z. However, the motion on the z axis is much greater than in the x axis. This occurs because at the initial position of the flywheels (oriented at 45◦), the angular momentum of the complete system points at the z axis so the system’s natural reaction is to spin mostly around this axis, while precessing slightly, just as a spinning top would do. One way to obtain a trajectory without precession is to make the angular momentum of the flywheels be parallel to then angular velocity vector of the body. Under this condition the cross product with the angular velocity of the body cancels out leaving a much simpler equation. This can be done by having equal gimbal rates and gimbal angles on the three SGCMG. The resulting trajectory is shown in figure 10.

Figure 10: Angular velocity of the body with all gimbals at the same gimbal rate of 0.1 rad/s

In figure 10 the x and y velocities seem to differ between the equations and the model, however this is on a scale of 10−4_{rad/s so the disparity is}

(33)

Figure 11: Roll trajectory

negligible and both velocities can be approximated to 0. Thez axis is also a principal axis of inertia of the body because of the symmetry of the prototype which means that the motion will remain restricted to this axis.

Another way to obtain a clean trajectory, without oscillations, is to start with gimbal angles that cause the sum of the angular momentum of the flywheels to cancel out. This can be achieved with gimbal angles of 0◦, which means all vectors w~ lie on the xy plane. With this condition it is easy to get a trajectory with motion exclusively on the roll axis by using an input of gimbal rates multiple of [0 1 -1]. As shown in figure 11 the components of angular velocity along y and z are negligible.

However there are two problems with this trajectory. The first is that at the start, there is no torque generated on the roll axis. As it can be seen in figure 11 the angular velocity on the x axis does not increase initially, even though the gimbals are moving. The second problem is that roll can only be achieved in one direction, using an input of [0 -1 1] will result in the same

(34)

motion. Both problems occur because the configuration with all vector w~

lying along the plane is a singular state; all torque vector are parallel. In this state a degree of freedom is lost.

Other initial gimbal angles could be used so the system does not start on a singular state, but then there would be a net angular momentum in the gimbals. As seen in figure 9, this causes oscillations if motion is not parallel to the net angular momentum of the gimbals. To solve this problem a change of the gimbal configuration of the prototype is proposed. This is further analysed in appendix B but was left as future work as it goes beyond the scope of this project.

To test the robustness of the assumption that the inertia of the gimbal is negligible, different values of inertia were added to the gimbal frame and the trajectories were compared. For comparison, the inertia along the bodies attached along gimbal axis for the prototype used in this project was esti-mated to be between 70kgmm2 and 80kgmm2. It can be seen in figure 12 that during some points of the trajectory, the difference between the simula-tion (which takes into account the moment of inertia of the gimbal) and the mathematical models (which don’t take into account the moment of inertia of the gimbal) is noticeable. However the general form of the response remains the same so the discrepancy should pose no problems for testing a controller because very high precision is not required.

It was decided that, as a first approximation to the attitude control prob-lem, only the trajectory shown in figure 10 will be used. This trajectory can be useful for testing a controller implemented on the device on a single degree of freedom to control the yaw angle. The model developed provides an environment in which control algorithms can be tested. The next step for the development of a controller is to select a machine learning algorithm that is suitable to the problem at hand.

The matlab scripts and simulink models used for the dynamic simulation can be found in a Github repository athttps://github.com/DanielFGomez/

(35)

(a) (b)

(c)

Figure 12: Comparison between (a) no gimbal inertia and (b) a gimbal inertia of 200kgmm2

(36)

7 Machine Learning

Machine learning is a field that concerns the use of algorithms that learn from given data to find patterns, extract information or find the optimal solution to a problem. It is divided into supervised learning, unsupervised learning and reinforcement learning. In supervised learning the data points include examples that show the correct solution to a problem and the algorithm learns to generalize into similar cases. In unsupervised learning the correct solution is not shown and the algorithm searches for general patterns in the data. Reinforcement learning falls in between both schemes; the correct solution is not known and correct examples are not shown but the algorithm tries out possible solutions and receives feedback to evaluate its performance. This type of reinforcement learning mimics the neural pathways in biology that enable learning from experience. Through repeated trial and error, an approximate solution to the problem is ultimately found. This makes reinforcement learning a powerful tool for problems involving unmodeled dynamics.

The following section will give a broad description of the learning algo-rithm used in this project and a brief review on the most relevant variables involved in reinforcement learning. An in depth introduction to reinforce-ment learning can be found at [3]. A shorter but still complete introduction to reinforcement learning with a specific example of motor control can be found at [10].

Reinforcement learning is defined in term of chains of actions and states, which follow one another iteratively, as shown in figure 13.

The agent is the learner, decision maker and action taker. It is able to observe its state within the environment and take an action accordingly.

Theenvironment is what the agent interacts with, it determines the next state given the current state and the action taken by the agent.

The reward is a signal that the environment sends to the agent which codifies what the agent should accomplish. The agent chooses its actions based on a policy which maps states to actions. It is important that the agent takes into account not only the current reward but also the future

(37)

Figure 13: Reinforcement learning flow diagram. Taken from[3].

rewards it can achieve by following a certain policy. To do so we define the concept ofexpected return as the sum of all future rewards until a termination state is reached. However, it should be noted that in problems like a control task there is no final step so the sum of rewards increases boundlessly. To avoid this problem the concept of discount rate is introduced and used to reduce the importance of a reward given at a later time step, until rewards that are too far in the future become negligible. Thus the expected return is defined as:

Rn=rn+1+rn+2γ+rn+3γ2+· · ·+rn+iγi−1+. . . (11) Where Rn is the expected return at the nth time step, ri is the reward signal on the ith time step and γ is the discount rate. With these concepts, the goal of reinforcement learning can be formally define as finding the policy that maximizes expected return for every state.

7.1 Reinforcement earning example: Inverted

pendu-lum control problem

It is helpful to use the control of an inverted pendulum to illustrate the concepts described.

(38)

Figure 14: Diagram of an inverted pendulum with a cart to provide control.

Let’s consider a simple cart free to move in the x axis with a pendulum on top. The pendulum tends to fall over due to gravity and the controller is tasked with applying a force F~ to move the cart in a way that the pendulum stays upright. The state must include the necessary information to be able to decide an effective action. For this problem the state can be specified by the angleθ and the angular velocity ˙θ. The available actions are the different magnitudes and directions the force F~ can take. In defining a reward func-tion, it is important that the reward represents what the pendulum controller should accomplish. It is needed the system reaches and keeps an angle of zero, so the reward should be a function of the angle that is high for small angles and low for large angles. Commonly used functions for this type of problem are −θ2 _{and Gaussian functions centered at 0.}

An interesting aspect of this learning approach is that it is not necessary to specify how the task should be accomplished. For example, given that the reward is increased as the angle reaches 0, one might expect that the controller would rush to minimize the angle. However, because total reward

(39)

is considered by the algorithm, it actually slows down as it approaches the target to avoid overshooting.

The inverted pendulum control is an example of a problem that can go on indefinitely with no termination state. However, it is unlikely that the system will obtain the optimal policy in the first try, so a stopping criteria is imposed to allow for several episodes of learning. In each episode the pendulum starts at a random angle, many steps of observing the state and responding with an action occur, and finally a terminal state is reached. In this case this case a terminal state can be determined as leaving a range of angles or reaching a time limit.

For this scheme to work, it is important that the evolution of the states depends only on the previous state and on the action chosen, so the correct action can be inferred correctly from the state information. This is called the Markov property. Most real world systems don’t follow this property because past trajectory, as well as the current state, is important in determining which action should be taken to get to the desired state. Another factor that makes difficult to follow the markov property is that state information is not always known exactly. To apply reinforcement learning algorithms for any problem it is necessary to codify into the state all information required to predict the evolution of the system. A system that satisfies the Markov property is called a Markov Decision Process and is the framework in which the theory of reinforcement learning is based.

A Markov decision process is defined as a tuple [11] of states S, actions A, rewards R and transition probabilities P. The rewards and probabilities are expressed in terms of going from a state to another using a certain action. Notice that this description is more naturally understood by using discrete states rather than continuous ones. Indeed reinforcement learning is based on discrete states and the translation to a problem that involves continuous states is a complicated problem that is still under study.

One of the solutions to extend reinforcement learning algorithms to a continuous state space is to use a grid-like discretization of the state space. This is an effective strategy for a small number of dimensions but as this number increases the amount of possible states becomes unmanageable. A possible method to allow reinforcement learning algorithms to work with a

(40)

big number of states is to use existing knowledge of the problem to focus the learning iterations on states critical to the solution of the problem.

For example, for the pendulum problem we know that fine tuning the input force is only needed on a small range near the equilibrium point, so a viable discretization is to divide the input θ into four regions divided by the angles -90◦,-15◦,0◦,15◦,90◦and the input ˙θ into 20 equally spaced parts between −20 rad·s−1 and 20 rad·s−1[10]. These states include the complete range of operation of the pendulum, and, for the angle, a focus on the most relevant states.

7.2 Q-learning

Once the problem is expressed using the reinforcement learning framework, an algorithm is needed for finding the policy that maximizes the expected return for the system. There is a myriad of algorithms used for this purpose but for this project, only Q-learning will be considered. This algorithm was chosen because it is simple to implement and can test and improve the policy online, in parallel with the corresponding Markov decision process, which makes it easier to visualize the learning process. Q-learning employs the concept of state-action value Qπ₍_{s, a}_{) which is defined as the expected} returnRt for a policyπ, given that a certain actionawas chosen at a certain state s:

Qπ(s, a) = EpiRt|st=s, at =a (12) Conceptually, this measures how valuable it is to perform actionawhen in state s. This provides a useful way to obtain a policy, simply the action with the highest Q value for the given state is chosen. This method of choosing the next action is known as a greedy policy since only the actions expected to yield the most reward are chosen. In practice, it is better to occasionally choose actions with a lower value because the correct value function is not known in advance. It is important to explore different actions so the algo-rithm does not get stuck on local maxima. This can be done by defining anexploration parameter and only choosing actions with the highest value

(41)

with a probability 1− and a random action otherwise. This method of choosing the action based on Q is known as an -greedy policy.

The Q-learning algorithm consists of updatingQon every iteration with information about the reward that was given and the maximum Q value that can be found on the new state. The way Q is updated guarantees convergence to the true action-value function of the optimal policy. It can be better described with the following pseudocode:

InitializeQ(s, a) arbitrarily; for each learning episode do

Initialize state (s);

for each step of episode do

Choose action (a) based on the current state;

Take the action a and observe the new state s0 and the reward (r) given;

Update Q using the rule:

Q(s, a)←Q(s, a) +α[r+γmaxa0Q(s0, a0)−Q(s, a)];

Update state s to s0; Until s is a terminal state; end

end

The parameter (α) is called the learning rate. It determines the impor-tance of new information relative to current information. It ranges from 0 to 1, 0 meaning new information is completely ignored, and 1 meaning that only the new information is used, ignoring all previous knowledge. The proof of convergence is beyond the scope of this text but it can be shown that the Q-learning algorithm converges to the optimal policy if the process being learned satisfies the Markov property and each state-action pair is visited an infinite number of times. Naturally these conditions are only met approxi-mately when implementing the algorithm in practice, but convergence can be achieved nonetheless if the conditions are good enough.

(42)

8 Simulation of Q-learning algorithm

The main challenge in the implementation of a learning algorithm is that it is a mostly heuristic process. There are several techniques that can be used but there are no clear cut methods for applying reinforcement learning algorithms for mechanical control.

The following assumptions must be true in order to generate a policy that can effectively control a physical plant [12]:

1. The model of the environment used to derive the learning policy is accurate.

2. The algorithm is able to maximize the expected return.

3. Maximizing the expected return corresponds to the desired behavior.

In this section only the second and third assumptions were considered.

For this project it was decided that only the yaw axis will be controlled. This is the simplest case as only one degree of freedom is taken into account which reduces the number of states and, because of the symmetry of the prototype, it also is the simplest trajectory to reproduce.

8.1 Methodology

The methodology used was to start from a base problem that was known to work, in this case the control for an inverted pendulum as described in [10], and then change the equations of motion and vary the algorithm parameters until a suitable control policy was obtained. Although the CMG and the in-verted pendulum are very different in terms of their dynamics, both problems involve actuation signals that must take a body to an angular position and velocity of zero with care taken to avoid overshooting so it was considered a viable start point. Initially, not much importance was put into achieving great precision or short control time but rather that the learning algorithm

(43)

worked and, after the learning process, obtained a policy that allowed the CMG to take the yaw angle to approximately 0 consistently. Once the algo-rithm could achieve this undemanding level of control the parameters were tweaked to improve performance.

8.2 Visualization Tools

In order to debug the learning algorithm, it is important to visualize the learning process. For this purpose several methods were devised and used to hypothesize about the functioning of the learning algorithm.

Methods of visualization include:

• Plotting the number of times each state was visited.

• Plotting the action taken for each state.

• Animating the system during the learning episodes.

• Animating the system once the learning has concluded..

• Visualizing the Q value for different state.s

• Plotting the total return and the time needed to reach the origin against the number of episodes.

Based on what could be deduced from these visualization methods, several possible fixes for the learning simulation were tried, and kept if the results got better. For example, if the states were only visited a handful of times over many episodes, the discretization was made broader. If the system was not able to avoid overshooting, an action of smaller magnitude was added.

For some algorithm parameters such as the learning rate, discount rate and exploration factor a brute force exploration was made to find the param-eters that maximized the performance of the algorithm and controller. The performance of the algorithm is evaluated using the total reward received per episode, this amount should increase with the number of episodes. The per-formance of the controller was assessed quantitatively with the time needed

(44)

to reach and at stay at a yaw angle of 0◦and qualitatively by observing an an-imation of the response of the system and comparing it to how a traditional controller is expected to respond.

There are too many variables involved in Q-learning to describe the effect of each one for the process of achieving yaw control in this text. Only the factors that showed a remarkable improvement in the performance or that illustrate an important aspect of reinforcement learning will be presented.

8.3 Base Parameters

The system is trained to always reach an angle of 0. To make the system follow a given trajectory, the error from a desired trajectory instead of the actual yaw angle can be fed into the system. The algorithm will drive this error to 0, because it is trained to make its input reach 0. This way the system can be made to follow any trajectory.

To make the algorithm seek a yaw angle of 0, the reward function was a Gaussian centered at the origin as shown in figure 15. Additionally failure and success conditions were determined. The success condition was to re-main stable in the range from -5◦to 5◦for more than 5 seconds. If the success condition was achieved, the episode was ended and a reward of 20 was given. The failure condition consisted in either leaving the range from -90◦to 90◦or having an episode last more than 50s without achieving success. In both fail-ure cases a penalization of -20 was given. Each episode starts by initializing the yaw angle randomly and ends with either a failure or a success condition.

The states representing the angular position of the system were divided in steps of 5◦in the range from -45◦to 45◦and two states for angles above and below this range. The states representing the angular velocity are divided into steps of 0.02 rad·s−1 between −0.05 rad·s−1 and 0.05 rad·s−1.

Q-learning uses discrete states, time steps and actions which means that angle, velocity, sample time and gimbal rate must be discretized. The state discretization had to be such that the system could reach the desired state. This means that the discretization state had to be within the margin of error desired for the controller. This error margin was chosen to be 5◦because

(45)

only the broad control is to be tested currently. The discretization should be such that the system approximates a Markov decision process accurately enough for reinforcement learning algorithms to apply. A consensus between the velocity discretization, the sample time, and the action available had to be achieved. The speed had to be such that the system could typically change its angular state between samples. The actions should be such that they change the speed state between samples. An intuitive approach would be to increase the state resolution and decrease the control time however this makes it harder for the simulation to reach every state. More training episodes can be used to increase the number of times each state is visited but the combined increment in resolution and number of episodes greatly increases the running time of the learning algorithm.

Figure 15: Reward function used for the Q-learning algorithm.

A key aspect which greatly improved the performance of the algorithm was including the effect of friction in the simulation model. This acted as a way to smooth out the discrete actions and allow the system to stop without the need of precise actuation. Figure 16a shows the performance of the

(46)

Q-(a) No friction. (b) Friction measured for the prototype.

(c) Doubled the coefficient of friction measured for the prototype.

Figure 16: Simulation of Q-learning algorithm for different levels of friction of the model.

learning algorithm on the simulation model without including friction. When friction is included with the magnitude measured for the CMG prototype, see figure 16b, the time to success decreases but is not kept low consistently. When the friction is increased to twice the measured friction, the algorithm is able to develop a policy that consistently reaches the target angle in less than 10s after 200 learning episodes.

The performance of the controller was further evaluated by having it follow a sinusoidal trajectory as shown in figure 17.There are several moments in which the system does not respond as it should, taking an action that

(47)

Figure 17: Controller trained with 300 learning episodes tasked with following a sine trajectory.

(48)

Figure 18: Controller trained with 600 learning episodes tasked with following a sine trajectory.

separates it from the target as indicated in figure 17. Using 600 iterations solves some of the problems presented previously, so the system follows the target more closely as shown in figure 18. There no significant improvement in the performance with a further increase in the number of iterations.

What is noticeable with an increase in the number of iterations is that the mapping from states to actions becomes more organized as seen in figure 19. With only 300 iterations the system can get consistently to a yaw angle of zero. However certain states, particularly states where both the angle and velocity are negative, are visited much less times so the best action is not well determined. This explains the behavior of the system in the points marked in figure 17. Visualizing the evolution of the action map proved to be a valuable tool in the implementation of the learning algorithm. It provides a qualitative method to judge the performance of the learning algorithm and help in identifying the states in which the controller struggles to provide a useful action.

(49)

(a) 300 Episodes (b) 600 Episodes

(c) 1500 Episodes (d) 5000 Episodes

Figure 19: Action maps learned by the Q-learning algorithm for different number of episodes.

(50)

Another possible aspect to vary is the state discretization, the control time and the number of episodes to obtain a better performance.

Starting from parameters that are known to converge to an acceptable policy, it is easy to increase the controlling frequency and the state resolution in small steps to obtain a learning algorithm closer to having continuous states that still converges to a control policy. Figure 20 shows a controller that was trained using an angular resolution of 1◦, a shorter control time and a broader range of actions. Initially the controller struggles to match the trajectory but then it follows the sine closely. The disadvantage of decreasing the resolution of the discretization is that the amount of learning episodes needed to obtain a suitable controller increases because much more states have to be visited to fill the action map.

Episodes in which the controller was tasked with following sinusoidal tra-jectories were added during the learning process as an attempt to improve the performance of the controller under such trajectories. However the learn-ing algorithm could not be made to converge when a significant number of sinusoidal trajectories were added during the learning process.

The scripts used for the Q-learning simulation can be found in a Github repository athttps://github.com/DanielFGomez/CMG-project/tree/master.

9 Experimental Tests

9.1 Technical Issues of the prototype

Few experimental tests could be performed due to consistent problems with the prototype. Adjusting the flywheels to the motor shaft was a constant problem. The adhesives used were not always effective which caused flywheels to detach. The alignment of the flywheels was also a critical factor, even if the flywheel appeared to be center by eye inspection, vibrations could be high enough to break the motor shaft. These problems demonstrate a possible hazard in operating the device. The plastic protectors that were added demonstrated they were effective as protection to keep the flywheel

(51)

Figure 20: Controller trained with a finer discretization angle following a sinusoidal trajectory.

(52)

(53)

from flying off in case it is detached, however eye protection should always be worn when operating the prototype.

The critical speed of the shaft was calculated for the deflection caused by the weight of the flywheel. It was found to be 29630RPM which lies in the operating range of the motors. It was decided to only power the motors at low throttle to keep the flywheel speed below this level. The speed of the motors was limited to around 7000RPM for all tests.

Three sets of flywheels were built; the flywheels of the first set were attached with an interference fit between the shaft and the flywheel, the ones of the second set were attached using a set screw, and the final set was attached using a mandrel for an RC plane propeller. The flywheels with the setscrew immediately showed very high vibrations, one shaft was deflected and another was broken in under a minute of running at speeds around 7000RPM. The flywheels attached with the interference fit held to the shaft, however one of the shafts fractured so it had to be discarded. The best results were found using the flywheels attached with the mandrel.

Figure 22: Different flywheel alternatives used.

Another constant problem was that the intensity of the vibrations loos-ened the threads that join the motors with the gimbal axes. The supports that join the brushless motors to the gimbal axes were remade to be tougher and allow a shaft of bigger diameter to be used, as shown in figure 23b.

It is suggested to modify the support of the motor to include a bearing to support the shaft at its end point and gimbal axis with a larger diameter. The bearing reduces the deflection of the shaft and thus the critical speed increased. It was calculated that by placing a bearing 3mm from the flywheel at the end opposite to the motor, the critical speed of the shaft was increased

(54)

(a) (b)

Figure 23: (a) Original support. Some epoxy mastic was added to hold the gimbal shaft and the motor support together but it proved to be ineffective. (b) New support.

(55)

to 172800RPM.

9.2 Control scheme

For both controllers a rotary encoder was used to provide feedback of the yaw angle of the system. The only difference was in the code uploaded to the microcontroller.

Figure 24: Block diagram for the control of the yaw angle.

9.3 PID Tests

The PID controller was tuned using the simulation. First, only a proprotional gain was used, equal to the inverse of the relation between input gimbal rate and output angular acceleration. The proportional gain was tweaked until a response time between 5 and 10 seconds was obtained. Then a derivative term was added, initially equal to the proportional term and then adjusted to obtain a slightly over damped response. The PID parameters found were set to a PID in an arduino microcontroller that sent the corresponding gimbal rate signal to the servos. A summary of the results obtained for a step input is found on figure 25.

(56)

Figure 25: Response to a step input to turn 90◦.

9.4 Q-learning Tests

A Q-learning based controller was implemented using an action map trained in simulations in Matlab. The Matlab script received the position and veloc-ity from the Arduino microcontroller and responded with the correct action according to the action map obtained. Training on the physical plant was not possible because of the limited robustness of the device. Unfortunately this removes one of the biggest advantages of reinforcement learning, that it adapts specifically to the plant were it is being used. A summary of the results obtained for a step input is found on figure 25.

Both controllers were able to reach the desired angle after a step response so it can be concluded that the implementation of the controller was success-ful even if the time and shape of the response could be improved. The PID showed a better response in both simulation and experiment. The physical plant showed greater friction than what was measured as is evidenced by the damped response of the PID in the plant, in contrast with the slightly under damped response in simulation. For both controllers, the experimental response presented more damping than the simulation. This means that the

(57)

real coefficient of friction of the system is higher than what was calculated.

The arduino and matlab codes used for the control can be found in a Github repository athttps://github.com/DanielFGomez/CMG-project/

tree/master.

10 Conclusions

Two controllers were successfully implemented to control the yaw angle of the CMG system built at the university. The PID controller showed better performance and was easier to set up than the Q-learning based controller. For the control problem studied it is convenient to have an action that is smoothly tuned to the difference from the target. This makes the PID better suited for the control task than the discrete Q-learning algorithm. Q-learning is more useful for problems in which a discrete set of actions can generally provide the desired result without much tuning.

The computational simulation demonstrated the general complexity of the motion of a CMG array, but allowed finding trajectories and conditions that are easier to work with. Additionally the assumptions used in deriving the mathematical equations that describe the motion of the CMG array were confirmed. The discrepancies encountered were small enough for the purpose of this project but might be of interest if very high accuarcy in the trajectory must be achieved. This model can be used for future projects to attempt complete attitude control.

The simulation also helped identify a flaw in the current configuration of the CMG prototype which should be addressed to allow attitude control in 3 axes. The current gimbal configuration tends to have a very high angular momentum along thez axis. This makes very difficult to control the roll and pitch of the system since the body will precess rather than tilt when subject to a torque from the CMG array.

The Q-learning algorithm for the control of the yaw angle of the CMG prototype showed clear evidence that it was able to learn an action map that controls the system satisfactorily. It exemplified some of the challenges

(58)

involved in the application of a discrete reinforcement learning algorithm for a continuous control problem. The convergence of the learning algorithm was highly dependent of the relation between the resolution of the discretization of states and the discrete actions (gimbal rates) available.

The Q-learning controller presented difficulties to follow a trajectory be-cause in that case the transition of states was not dependant only on the current state but also on the change in the target. The controller could re-spond to the current situation, however the transition of states did not obey the Markov Property so the learning algorithm struggled in certain states.

The Q-learning controller was harder to implement than a regular PID controller and the performance varied greatly on the action list available, and sometimes a very small change in parameters made the algorithm unable to learn. Nonetheless a controlled response was achieved and the results suggest that using a sufficiently fine discretization with enough training episodes, a high quality controller could be implemented. However it is unlikely that this strategy would work for complete attitude control. As the system has more degrees of freedom, the curse of dimensionality causes a huge increment in the number of states with a reduction in resolution. Algorithms that allow the usage of continuous state spaces should be employed to extend the application to attitude control on three axes.

The robustness of the CMG prototype should be enhanced before the implementation of a controller is attempted. The current flywheel speed and mass is too high, which leads to constant mechanical failures due to the vibration of the flywheels. This problem does not allow for repeated tests of the learning algorithm needed to perform the learning episodes on the physical plant. This takes away the biggest advantage of reinforcement learning as the algorithm cannot adapt to the real plant.

11 Future Work

Future work related to this project can be divided in four categories: improve the structural integrity of the system, add sensors to obtain more feedback from the system, study and implement traditional methods for attitude

(59)

con-trol of a CMG array on three axes, and research and test reinforcement learning algorithms that work with continuous states and actions.

To further improve the durability of the system, the gimbals should be redesigned so the flywheel is not held in cantilever. Holding the flywheel shaft at both sides of the flywheel can reduce deflections and therefore increase the critical speed of the shaft and decrease vibrations. Another possibility to consider is to use lighter flywheels and run them at a slower speed. With the current conditions the body can reach high angular velocities along the yaw axis, which are not needed to study attitude control. Instead, using lighter flywheels at a slower speed can reduce the mechanical strain and increase the durability of the system.

As part of the sensors it is suggested to include an IMU (inertial mea-surement unit) to provide full attitude control. Another helpful sensor would be encoders in the gimbal to have an accurate feedback of the gimbal posi-tions. If the gimbal rate is slow the servos do not move or start moving at an arbitrary point. This is problematic if the gimbals must be synchronized to achieve a desired trajectory. Additionally, hall effect, or other type of sensors could be added to receive feedback from the flywheel speed and control it accurately.

To be able to provide attitude control on three axes it is important to change the gimbal configuration as shown in appendix B and to replace the universal joint to reduce friction of rotation along the x and y axes. The general problem of attitude control using CMG is quite complex, but simple case studies for small changes in the gimbal angles could be examined. The matrices obtained in appendix B can be of great help for this purpose.

The future goals for usage of reinforcement learning algorithms are to find an algorithm that is better suited for the control task. In particular it should be able to manage continuous states and actions and achieve an acceptable control in a small number of episodes so it can be realistically tested on the physical plant.

(60)

Bibliography

[1] Ball Aerospace and Technologies Corporation. Worldview-2, 2016. Taken from: https://directory.eoportal.org/web/eoportal/satellite-missions/v-w-x-y-z/worldview-2.

[2] Juan David López Gutiérrez, Carlos Francisco Rodr´ıguez Herrera, and Juan Pablo Barreto Melgarejo. Diseño de un sistema de control de orientación utilizando la técnica de giroscopio de control de momento (control moment gyroscope). Tesis (Ingeniero Mecánico). Universidad de los Andes. Bogotá : Uniandes, 2011., 2011.

[3] Andrew G Sutton, Richard SBarto. Reinforcement learning. MIT Press, 1998.

[4] F. Landis Markley and John L. Crassidis. Fundamentals of Spacecraft Attitude Determination and Control. Space Technology Library: 33. New York, NY : Springer New York : Imprint: Springer, 2014., 2014.

[5] Universidad de los Andes. Performance Evaluation of Two Nonlinear Controllers on an Attitude System Using SGCMG. ASME, oct 2015.

[6] Andrew Y. Ng, Adam Coates, Mark Diel, Varun Ganapathi, Jamie Schulte, Ben Tse, Eric Berger, and Eric Liang. Autonomous inverted helicopter flight via reinforcement earning. Springer Tracts in Advanced Robotics, 21:363–372, 2006.

[7] J Gersh and M Peck. Violet: A high-agility nanosatellite for demon-strating small control-moment gyroscope prototypes and steering laws. AIAA Guidance, Navigation and Control Conference, (August):1–15, 2009.

(61)

[8] David A Kane, Thomas Levinson. Dynamics, theory and applications. McGraw-Hill, 1985.

[9] Vivek Nagabhushan. Development of Control Moment Gyroscopes for Attitude Control of Small Satellites. pages 1–100, 2009.

[10] Lasse Scherffig. Reinforcement Learning in Motor Control. University of Osnabruk.

[11] Hado Van Hasselt and Marco A. Wiering. Reinforcement learning in continuous action spaces. Proceedings of the 2007 IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning, ADPRL 2007, (Adprl):272–279, 2007.

[12] Andrew Ng. Advice for applying machine learning, 2016. Avail-able at: http://www.teachingtree.co/watch/debugging-reinforcement-learning-algorithms.

(62)

Appendix A

(63)

(64)

Appendix B

Analysis of gimbal

configuration

The original gimbal configuration was given by the following vectors.

~

w1(θ1 = 0) =

  1 0 0 

 g~1 =

  0 1 0   (13) ~

w2(θ2 = 0) =





−1/2

√

3/2 0



 g~2 =





−√3/2

−1/2 0



 (14)

~

w3(θ3 = 0) =





−1/2

√

3/2 0



 g~3 =





√

3/2 1/2

0



 (15)

The corresponding rotations around gi are performed to express vectors

~

wi in terms of the anglesθi. Then all three vectors are added and multiplied by JΩ to obtain the total angular momentum:

(65)

(a) (b)

Figure 27: (a) Original and (b) propossed gimbal configurations

~

H_Bw =JΩ





cosθ1−1/2 cosθ2−1/2 cosθ3

p

(3)/2 cosθ2−

√

3/2 costheta3

sinθ1+ sinθ2sinθ3



 (16)

It can be seen that for θ1 = θ2 = θ3 = 0 the total angular momentum

is 0 which is desired to avoid oscillations. By deriving the total angular momentum the torque generated by the motion of the gimbals can be found:

~ T =





sinθ1 −1/2 sinθ2 −1/2 sinθ3

0 −√3/2 sinθ2 −

√

3/2 sinθ3

cosθ1 cosθ2 cosθ3

 ·   ˙ θ1 ˙ θ2 ˙ θ3   (17)

The matrix that relates gimbal rates to torque becomes singular at θ1 =

θ2 =θ3 = 0. The gimbal angles that cause the initial angular momentum to

be zero, which makes attitude control easier, also cause the system to lose a degree of freedom of control. Therefore it is not convenient to use this gimbal configuration.

As an alternative it is proposed to rotate the gimbal axes 90◦ as shown in figure 27.

Application of reinforcement learning for the control of a control moment gyroscope

Application of Reinforcement Learning for

the Control of a Control Moment Gyroscope

Daniel Fernando G´

omez Berdugo

Application of Reinforcement Learning for

the Control of a Control Moment Gyroscope

To my parents who made me who I am and made everything

possible

A mis padres que me hicieron quien soy y quienes hicieron todo

Preface

Prefacio

Acknowledgments

Agradecimientos

Table of Contents

List of Tables

List of Figures

Abstract

Resumen

1

Introduction

2

Objectives

2.1

General Objective

2.2

Specific Objectives

3

Nomenclature

4

Control Moment Gyroscope Dynamics

4.1

Kinematic equations

4.2

Dynamic equations

5

Description of Prototype

5.1

General description

5.2

Design of the electronic control

5.3

Measurement of friction coefficient

6

Simulation Model of the System’s

Dynam-ics

7

Machine Learning

7.1

Reinforcement earning example: Inverted

pendu-lum control problem

7.2

Q-learning

8

Simulation of Q-learning algorithm

8.1

Methodology

8.2

Visualization Tools

8.3

Base Parameters

9

Experimental Tests

9.1

Technical Issues of the prototype

9.2

Control scheme

9.3

PID Tests

9.4

Q-learning Tests

10

Conclusions

11

Future Work

Bibliography

Appendix A

Appendix B

Analysis of gimbal

configuration