ChangHyeon Nam's Blog notes and thoughts

CS285 | Lecture 2 Supervised Learning of Behaviors

Comments

Lecture2 is about supervised learning of behaviors.

Untitled 2

Let’s think of object recognition problem(sl problem). input image goes in to deep neural network and output is label. We’re using RL terminology in standard SL example and turn that in to RL problem. we call Input o as observation and a for action. $\pi_{\theta}$(Policy) maps o to a. theta means weights for NN. In RL, we concerned with sequential decision making problem. So these inputs and outputs occur at some point of time.

Untitled 3

Usually in rl, we deal with discrete time problems. t represents at which time step you observe o and which step you emit a. So $\pi_{\theta}$ gives a distribution over $a_t$ conditional on $o_t$. and unlike sl problem, $a_t$ influences $o_{t+1}$.

Untitled 4 Instead of outputting labels you would output something looks a lot more like action, but it could be also discrete action using softmax.

Untitled 5 but you can also choose continuous action. some continuous distribution like multivariate distribution.

Untitled 6 we also have state $s_t$. Some times we consider policy as $\pi_{\theta}(a_t\mid s_t)$. State is typically assumed to be a markovian state, where $o_t$ is observation that results from state. So most of case we write policy as $(a_t\mid o_t)$ but sometimes $(a_t\mid s_t)$ which is more restrictive case.

Untitled 7 Let’s see difference between observation and state.

Let’s say that you observe this scene, there is cheetah chasing gazelle. This observation is consists of image and the image is made of pixels. The image is produced by underlying physics of system. System has as state which has minimal representation. So the image is observation $o_t$ , and state is representation of current configuration of the system which in this case might be position of cheetah and gazelle and their velocities.

Untitled 8 Observation might be altered in some way like above. So the full state can not be inferred exactly. Car is in front of cheetah and you can’t see it. Observation is insufficient to deduce the state but state hasn’t changed.

State is true configuration of system and Observation is something that results from that state.

Untitled 9 We can also tell difference between observation and state using graphical model.

  • Observation is from state in every time step.
  • Policy use observation to choose action (from o to a).
  • The state in action at the current timestep determines the state of next time.

In graphical model, we might conclude there are certain dependency that are present in system. $p(s_{t+1}\mid {s_t,a_t})$ is transition probability.

And there is Markov property which means $s_{t+1}$ is independent of $s_{t-1}.$ You don’t have to consider how you reach the state, you’re enough to just consider current state and forget about previous state.

Without Markov property, we would not able to formulate optimal policy without considering entire history.

However, if policy is conditioned on observations rather than state, we could ask are the observation also conditionally independent in this way?.

Observation in general not going to satisfy the markov property. Meaning that current observation might not be enough to fully determine the future without also observing the past. You can also think this in cheetah example. When the car is in front of cheetah, you can not see where it is in image, you might be able to figure out where it going to go in the future. If you saw the image where cheetah was, you can figure out where to go. So In general, Past observation can actually give you additional information beyond what you would get from the current observation that would be useful for decision making.

Many RL algorithm that we’ll discuss in this course will actually require markovian observation states in which case I will write $\pi(a\mid s)$. But some case I also mentioned that a particular algorithm could be modified some way to handle non-markovian observations.

In lecture, professor said markovian state but in this context, observation is right. Then the question is $\pi(a\mid s)$ is for markovian observation?. (State always have markovian property)


Untitled 10 Bellman (from USA) / Lev pontryagin (from soviet).


Untitled 11 Let’s talk about how we actually learn policies. In this lecture, We will start with simple way of learning policy that doesn’t require using very sophiscated RL algorithm. But instead learns policies in same way of SL by utilizing data.

Collect large data set consisting of image and action tuples and then use supervised Learning to learn to map from observations to action. This is called imitation learning, sometimes called behavior cloning. We are cloning the behavior of human(or demonstrator or expert). We can ask Does it really work?


Untitled 12 This started from 1989 using tiny 5 hidden unit.


Untitled 13 We can ask Does this behavior (basic) principal work? In general the answer is No!.

Why behavior cloning goes wrong?(while regular supervised Learning would work). Assume state is one dimension which easier to visualize. Black curve is trajectory through time and represents our training data. Use black curve to learn policy that goes to s to a. and red is that policy will do when we run it.

Initially the policy stays pretty close to training data because we’re using large NN and train well. But it make some small mistake. Every learned model makes at least small mistake and this is inevitable. The trouble is when this model makes small mistake, It will find itself in a state that is little bit different than the states that it was trained on.

And When it finds itself in a state that unusual, it will make bigger mistake because it doesn’t know what to do there. As these mistakes compound, states becomes more and more different and mistakes get bigger and bigger.

Untitled 14 In practice, it works lol with very large dataset and using some trick.

So why is that we can use behavior cloning methods in practice to train policy that actually do something fairly decent well?

Untitled 15 We’ll discuss more details in part two but one of things breifly now is particular technique that was used to address this issue in this paper by nvidia.

There’s left/right/center camera which turns out to be quite important. You record three different camera images at the same time. Forward image is supervised whatever steering angle the person had. The image looking to the left is supervised of steering angle that is little bit to the right what the person did. (same as right)

This particular trick mitigate drifting problem. because right/left image are essentially teaching the policy how to correct little mistakes. And if this correct mistakes then maybe they won’t be accumulated as much.

Untitled 16 The more general principle is that errors in the trajectory will compound if you can somehow modify your training data so that your training data illustrate little mistake and feedback to correct those mistakes then perhaps the policy can learn those feedbacks and stabilize.

Other case: If you can train a stable feedback optimal feedback controller around demonstration and use that feedback controller as supervision, you can actually get stable policy that inherit stability.


Untitled 17 Something that we could ask to derive more general solution is what’s the underlying mathematical principal behind this drift? (what’ really going on here)

When we run the policy, we’re sampling from $\pi_{\theta}(a_t\mid o_t)$ and this distribution is trained on some data distribution and we call this distribution as $p_{data}(o_t)$. This is basically the distribution of observations seen in our training data.

Now we know from SL theory, when you train a particular model on a particular train distribution and you get good training error, and don’t overfit, you can expect to also get good test error if test points are drawn from same distribution.

If we saw new observation which come from same distribution as our training data, we would expect our learned policy to produce right action on those observation.

However when we run our policy, distribution over observation that actually see is different because policy takes different action which result in different observations.

After a while, $p_{\pi_{\theta}}(o_t)$ is different from $p_{data}(o_t)$. So this is reason fro compounding error problem.

So can we make $p_{\pi_{\theta}}(o_t)$ equal to $p_{data}(o_t)$? If we can do this, then we know that our policy would produce good action simply from standard results in supervised learning theory.

Now one way to make $p_{\pi_{\theta}}(o_t)$ equal to $p_{data}(o_t)$ is to simply make the policy perfect lol. The policy is perfect then never makes mistakes then distribution will match but very very hard.


Untitled 18 instead of being clever of policy, we actually be clever about our data distribution. So not change the policy, but change data to avoid distributional shift problem.

That’s the basic idea behind a method DAgger.

goal: collect training data that comes from $p_{\pi_{\theta}}(o_t)$ instead of $p_{data}(o_t)$. Because we have observation, action tuples from $p_{\pi_{\theta}}(o_t)$ and we train on those tuples then distributional shift problem will be gone.

Repeating this process enough times, Eventually it converges resulting in final dataset that does come from the same distribution of policy.


Untitled 19 So then what’s the issue with DAgger?

Step three is problem. In many cases, asking a human to manually label $D_{\pi}$ with optimal action actually be quite onerous.

Imagine you’re watching a video of drone flying through forest, and you have to steer that drone and provide optimal actions without your actions actually affecting the drone in real time. (that’s very unnature.)

Untitled 20 We can address distributional shift problem using the DAgger algorithm but the DAgger algorithm has a few practical challenges that make it a little tricky to use because of additional labeling step.

Untitled 21If model doesn’t deviate far enough from $p_{data}o(t)$ and still stays good. In order to do that we need to mimic the expert behavior very accurately. (also don’t overfit)


Why might we fail to fit the expert?(or Why might our model not be perfect?)

  1. Even if observation is fully markovian meaning that observation is enough to perfectly ignore the state, the human behavior might still be not markovian.
  2. If we have continuous action, Demonstrator’s behavior might be multimodal. This means that they might inconsistently select from multiple different modes in the distribution, It make hard to imitate.

Untitled 22

Untitled 23We could use whole history in imitation learning. If we use variable number of frames, there is too many weights.

Untitled 24To address this issue, we employ some kind of RNN. RNN type architecture can greatly mitigate the markovian problem.

Untitled 25In general, this rnn approach would mitigate the non-markovian policy issue but there are some reason why it might also work poorly.

Let’s consider following scenario. Let’s say you’re driving car and has camera inside the car. the camera can see out the front windshield. Every time you step on the brake, there is little light up on dash board. If person is stand on the road, you need to press on the brake. And let’s say that your data contains these examples.

NN has to figured out that braking is caused by the presence of person or by the light going off. It’s easy for NN to associate the light with the brake because light always turns on when you’re stepping on the brake whereas person standing in front of car is a little more complicated to figure out. If the info was removed from the observation, then the issue will go away.

Adding more observation make imitation learning more harder because it creates causal confusion problem. Cause and effect relationship between parts of observation and action become difficult to deduce from just the data.

This is not limited to Non-markovian/markovian issue and it can happen in all sorts of imitation learning scenarios. But two question might ask you to ponder.

(This is my opinion) A1) I think history makes worse because when we think of that scenario, There are a lots of situation when press the brake without presense of human on the road but light always turn off when press the brakes.

A2) I think original dataset already has these casual confusion so DAgger can’t mitigate.

→ Later lecture we’ll discuss!.


Untitled 26Let’s talk about other potential reason, Multimodal behavior.

Let’s say that you’re flying drone in that video and you need to fly around the tree. And maybe human expert sometimes go left and sometimes right. Now if your action is discrete, left/right/straight action, this is not a problem. Because Softmax distribution for output categorical distribution could easily capture the notion that the left action has high prob and right action has high prob but the straight action is low prob.

But If we have continuous action, We would typically parameterize output distribution as a multivariate normal distribution determined by mean and variance. If you just pick one mean, and one variance, how do you put that mean/variance to model the fact that you can go left or right, never go straight. That cause a big problem.

You average possibility and get exactly wrong thing. → You will go straight because it’s average between left and right. This is why multi-modal behavior cause problem.

But in general, human do tend to be multimodal which exhibit complicated distribution in their demonstrations unless they’re performing very simple task.

There is few possible solutions.

yesterday, I learned Mixture of Guassian which is more powerful than K-Means in unsupervised-learning lol.

  1. Output mixture of gaussians. (not use single gaussian output distributions.) This can capture multiple mode.
  2. Latent variable model
  3. Autoregressive discretization

Untitled 27 Mxiture of gaussian sometimes is called mixture density networks. The idea is that outputting $\mu$ and $\sum$ of one gaussian, output multiple $\mu,$ mutiple $\sum$, multiple $w$.

Trade of Mixture of gaussian model are that you need more output params and the ability to model multi-modal distribution in very high dimensions can be challenging. In general, for arbitrary distributions, # elements to model them well in theory exponentially with dimensionality.

Untitled 28 In latent variable model, Output distribution is still gaussian but in addition to inputing the image, we also input a latent variable in to our model which might be drawn from some prior distribution. Our model takes an image and some noise and turns that image and noise turns in to a gaussian distribution of actions. Later lecture, variational auto-encoder will deal with this.

Untitled 29 Last option strikes a good balance between simplicity and expressivity.

Mixture of gaussian is very simple but has difficult with very complex distributions, Latent variable model is very expressive but more complex to implement and Autoregressive discretization is perhaps a nice middle ground between two.

It can represent arbitrary distribution but easier to use than latent variable models.

The idea of auto-regressive discretization is following.

  • If we have discrete action, this multi-modality problem is not issue. Because of discrete action, a softmax Categorical distribution can easily represent any distribution.
  • But if we have continuous action, Discretizing is challenging. In general, number of bins that you need for discretizing in n-dimensional action space is exponential in n.

Autoregressive discretization discretizes one dimension at a time but can still represent arbitrary distribution by using clever NN tricks.

First we discretize the first dimension of action. then we sample from the softmax and then we have a value for the first action dimension and we feed this value into another NN. That’s going to output a distribution over the second action dimension. and sample that repeat.

So we discretize one dimension at a time which means that we never need to incur that exponential cost but because we’re modeling distribution over the next dimension condition on previous one. Then by the chain rule probability, we can actually represent full joint distribution over all of the action dimensions. (easy to implement and quite powerful in practice)

Untitled 30 While this in theory doesn’t alleviate the distributional shift problem, in practice it can actually make even naive behavior cloning work if we’re careful about tricks for handling non-markovian policy and handling mutli-modality.


Untitled 31 while human is walking the road, Three camera on head will collect data and use that data for training quadrotor.


Untitled 32 later slides we’re going to talk about issue with imitation learning and motivate towards more general objectives.

Untitled 33 To make RL learn autonomously, we should define our objective. (like below)

Untitled 34 add more Terminology.

Untitled 35

Untitled 36 Let’s talk about imitation learning theory building of this notion of cost function and reward function.


Untitled 37

  • $\pi^{star}$ : unknown policy of expert.
  • $c(s,a)$ is zero-one loss.

Remember that in reinforcement learning, the reward or loss must be evaluated in expectation under the learned policy, not under the policy of expert.

This is why behavior cloning is not actually optimizing quite the right objective. Behavior cloning typically maximize the log-likelihood in expectation under the state or observation under the expert. → whereas we want to maximize in expectation under the state, observation distribution of our policy. → Distributional mismatch problem (DAgger tries to correct)

Now, I will explain some bound of behavior cloning and formalise distributional problem.


Untitled 38

  • T : length of trajectory (#steps)

and we want to analyze the expected total value of zero-one loss.

Assumption : SL works. → All of the states in training set, probability of making a mistake at those states is less than or equal to epsilon.

Bounding total number of mistakes, we’re going to most pessimistic scenario possible. (worst possible case. == tightrope walker scenario. when they fall of to red region, there is no demonstration data in training because expert never fall off. they don’t know what to do in red region.)

Untitled 39

  • for each time step, there is probability of $\epsilon$ which make mistake, and if they make mistake, they will also make mistake in remain step. $\epsilon x (remaining \space time\space step)$
  • It’s pretty bad bound because it means as our Trajectory increases, Our error increases quadratically. → That’s why naive behavior cloning is bad idea therotically.

Untitled 40 In previous slide, analysis was little bit naive because we didn’t assume anything about generalization. We just said that for all states in training set, our errors are bounded by epsilon.

For generalization, your policy does well not just training point, but on other point from same distribution. So our error is bounded by epsilon for any state sampled from $p_{train}$. (easily to extend to Expectation of $p_{train}$ by DAgger paper)

Bound is still quadratic if we don’t do DAgger. (first term means that doesn’t make mistake, second term means make mistake.)


Untitled 41 So what we can do is bound the total variation divergence between $p_{\theta}$ and $p_{train}$ which is sum of the absolute value of the differences.

we can not say anything about $p_{mistake}$, so we can say that maximum of total variance is 2 times $(1-(1-\epsilon)^t)$. (worst case is $\epsilon =0$, sum of prob is 1)


Untitled 42 Another way to imitate is that data which is not necessarily optimal for one task, could be optimal for another one.

Untitled 43 You might have not enough dataset for any one of these points by itself. (maybe if you just use the demonstrations for p1, you won’t able to learn to reach p1 successfully.) What if you condition on your policy on the point that reached?

Now if you have single demonstration for each point, you can learn conditional policy which could be able to reach any point. By conditioning your policy on additional information, you can actually train policies performing some task even if you don’t have enough dataset for that task.

Untitled 44 Untitled 45 Goal-conditioned behavior cloning was used in various paper.

Untitled 46 Untitled 47