-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Fix termination vs truncation mixup #951
Comments
Hi @sash-a, is this fixed? |
Right now no as all the environments we currently use there is no truncation and so there's no issue. However, it is high on the list, but unfortunately the whole team is off this week and we are quite busy with some other research right now, but we will try to fix this as soon as possible. If you have capacity we're more than happy to accept pull requests. Note this is only an issue in the PPO systems |
Thanks @sash-a Also, is the |
Yup, both IQL and SAC handle it correctly. It's only the PPO systems that have the issue. All that needs to happen in ppo is that discounts need to be used when calculating advantage and Yup I'm pretty sure that branch works, the problem is that it just got very out of date with develop and then it became difficult to merge and we got a bit busy, but it should work, it's just missing some of Mava's latest features, but the algorithm itself should be correct. |
@sash-a Thanks and have a good day! |
After some experimentation this helps in certain envs but hurts others and needs quite a lot more investigation, unfortunately I don't have the time right now. The current progress is on the I see that this PR on CleanRL seems to agree that this is important for certain envs, but not all, which makes sense as not all envs truncate. My current solution to the GAE is to use both terminal and truncated which I see is the same as the above issue:
|
Describe the bug
Seem like we are not using the correct termination vs truncation values, we're always using the condition
termination or truncation
(timestep.last()
) when we often want to use the condition of onlytermination
(1 - discount
). It especially tricky in the recurrent systems.Expected behavior
What we should do is that when calculating advantages we should use
termination
(1 - discount
) and in the recurrent systems when passing inputs to the networks during training we should usetermination or truncation
in order to correctly reset the hidden state.Possible Solution
Always put
1-discount
in thePPOTimestep.done
and always puttimestep.last()
in theRNNLearnerState.done
.To avoid issues like this in future I think we should rename
RnnLearnerState.done
toRnnLearnerState.truncation
.Looks like there are a couple places where we use
PPOTimestep.done
when it should beRNNLearnerState.done
so we'd have to go through and make sure we're always using the correct one. An example is here and here where we're using thePPOTimestep.done
(which should be1 - discount
) in order to reset the hidden state, instead we should pass inRnnLearnerState.truncation
to the loss functions and use that.The text was updated successfully, but these errors were encountered: