Blind Submission by Conference • Raisin: Residual Algorithms for Versatile Offline Reinforcement Learning
Shared Reply, Parts 1 and 2
Official Comment by Paper5886 Authors • Shared Reply, Parts 1 and 2
We've updated the paper, as described in both individual replies and our shared reply below, highlighting in the paper the significantly changed text in blue.
Part 1: Novelty, e.g. "the methodology is a direct application of RA on SAC-N"
To recap, the most important of our novel findings is that RA works disproportionately well in the offline setting, increasing SAC's median score on D4RL by a factor of 54, and the second most important novel finding is that this also enables SAC-
- Referring to Raisin as RA + SAC-
is only a rough explanation we use. Raisin is not the only possible way to combine RA and SAC- . For example, we only briefly tested the alternative minimum placement . (Where we step in the second term towards the average of .) There are certainly reasons to think that formulation could be worse than Raisin (for example, it might reduce the possible range of pessimism), and Raisin indeed outscored that algorithm in preliminary experiments (not shown). That said, we don't yet consider alternative RA + SAC- approaches such as that one fully ruled out. - We find no single setting for
universally works well: you must tune it per dataset. If you only tuned Raisin's on, say, hopper-expert, you would find that catastrophically fails and that only a large around works. However, catastrophically fails at, e.g., halfcheetah-expert. This is why another important direction for future work is automatically tuning . - The whole of Raisin is greater than the sum of its parts: SAC-10 catastrophically fails at, e.g., hopper-expert, and RA on its own (i.e.,
) doesn't even solve hopper-expert one-third of the way, yet their careful combination (see the previous bullet) reaches SOTA. Furthering this notion, RA without the minimum-of-N-critics trick performs only comparably to TD3+BC, yet Raisin outperforms TD3+BC-10. - We mention additional negative algorithmic findings as well: most importantly, ensemble aggregation techniques for imparting optimism to pure RG (
) don't seem to improve its scores for, e.g., halfcheetah-random (where we speculated RG might be "too pessimistic" in some sense). We tested those algorithms extensively. We've now gone into more detail on this in the paper, in Appendix G.
We've also added new experiments, discussed in Parts 3 and 4 of our shared reply.
Part 2: Why does Raisin work?
Our new experiments on fixed points (discussed in Part 4 of our shared reply) suggest they are likely important for Raisin's effectiveness. Section 4.3 confirms that value function initialization can also be important. Regarding value initialization, Figure 2 supports Wang & Ueda (2021)'s argument that RG tends to maintain its average value prediction, which means that a standard near-zero initialization and positive rewards can make RG naturally pessimistic, which is a natural fit for offline RL.
Granted, it's less clear how value function initialization interacts with the minimum-of-N-critics trick, nor what the
We've now stated all the above more explicitly in the paper and also further emphasized that future work in this direction is worthwhile.
References
Zhikang T. Wang and Masahito Ueda. Convergent and Efficient Deep Q Network Algorithm. arXiv, June 2021. doi: 10.48550/arXiv.2106.15419.
Seyed Kamyar Seyed Ghasemipour, Shixiang Shane Gu, and Ofir Nachum. Why So Pessimistic? Estimating Uncertainties for Offline RL through Ensembles, and Why Their Independence Matters. OpenReview, October 2022. URL https://openreview.net/forum?id=z64kN1h1-rR.
Shared Reply, Parts 3 and 4
Official Comment by Paper5886 Authors • Shared Reply, Parts 3 and 4
Part 3: A Raisin analog for TD3
We've also now implemented a Raisin analog for TD3 to examine the generality of our approach and to show the advantage of Raisin over a method approximately equivalent to the prior work Bi-Res-DDPG (since TD3 is essentially an improved DDPG). As we expected, RA-TD3 with
RA-TD3-1 | RA-TD3-10 | Raisin-10 | |
---|---|---|---|
hopper-expert | 12 ± 15 | 103 ± 16 | 110 ± 0.4 |
hopper-random | - | 27 ± 10 | 31 ± 0.1 |
walker2d-random | - | 12 ± 6.3 | 18 ± 9.2 |
Part 4: Fixed point experiments
We've additionally added new experiments in Appendix F training RG (
Resuming from SG on halfcheetah-random:
SG-SAC-1 | RG-SAC-1 resume | SG-SAC-1 resume | |
---|---|---|---|
halfcheetah-random | 17 ± 15 | -2.0 ± 0.3 | 32 ± 3.5 |
Resuming from RG on hopper-expert:
RG-SAC-1 | RG-SAC-1 resume | SG-SAC-1 resume | |
---|---|---|---|
hopper-expert | 39 ± 20 | 63 ± 24 | 0.7 ± 0.1 |
Official Review of Paper5886 by Reviewer foJV
Official Review of Paper5886 by Reviewer foJV
This paper considers the problem of offline reinforcement learning and investigates the performance of the residual gradient algorithm (RG) and an improved residual algorithm (R) on SAC. Empirically, the simple trick increases the score on D4RL gym tasks. By adding a residual component to SAC-N, the algorithm (when
Overall, this paper provides a clear explanation of the RAISIN algorithm (which is simply RA on top of SAC-N) and provides experimental results comparing the performance improvement and the ablation studies.
The methodology is simple, and the only parameter to tune is the pessimistic, as everything else is based on top of SAC. The proposed algorithm achieves comparable performance with much less ensembles, which makes the algorithm promising.
The weaknesses of this paper are listed as follows:
- It seems that the novelty of this paper is limited, the methodology is a direct application of RA on SAC-N, and there are little theoretical discussion on how the performance get improved.
- In Section 4.1 comparison with TD3+BC-10, the authors simply claim that Raisin outscores TD3+BC-10. However, in table 1, there are some tasks that RAISIN scores lower than TD3+BC, for example on hopper-expert, walker2d-expert, walker2d-medium-expert, hopper-medium-expert, hopper-medium, halfcheetah-random, etc. The authors do not provide sufficient explanation regarding what are the properties in these tasks that makes the performance of RAISIN worse.
- Since this paper does not provide new techniques or methodologies that shed lights on future researches in offline RL, but only provides experimental studies on a combination of two existing methods. It is important for the authors to provide more thorough empirical explanations and interpretations. The experimental results are good for general papers, but for a paper without enough technical novelty, higher standard in experiments should be expected.
The paper is written clearly, the methodology lacks some novelty, the empirical results do provide some useful observations.
This paper provide a novel observation that RA + SAC-N has good performance and reduced computational cost. The algorithm becomes more robust on a number of gym tasks. However, the performance of RAISIN does not outperform all SOTA results on all tasks (when compared with TD3+BC), and the experiments are not interpreted sufficiently.
Following up
Official Comment by Paper5886 Authors • Following up
Thanks again for your review! In light of our additions and replies, might you consider raising your score?
Reply to Reviewer foJV
Official Comment by Paper5886 Authors • Reply to Reviewer foJV
Thanks a ton for the review! To recap the strengths you highlight:
- Raisin is simple
- The paper is written clearly
- Raisin roughly matches SOTA with dramatically reduced compute
- Aside from a fixed
and tuning , we left all other hyperparameters untouched
It seems that the novelty of this paper is limited, the methodology is a direct application of RA on SAC-N,
Please see Part 1 of our shared reply, "Novelty." To our knowledge, RA has strictly been proposed and tested as an online algorithm. The only use of deep RA we are aware of is the RA-DDPG work (Zhang et al. 2019), where even the highest improvement in AUC over all tasks is a factor of about 3
Please also see Parts 3 and 4 of our shared reply, where we discuss new experiments we've added.
there are little theoretical discussion on how the performance get improved.
Please see Part 2 of our shared reply, "Why does Raisin work?" Also, please see our new experiments in Part 4 of our shared reply, which suggest the importance of fixed points.
In Section 4.1 comparison with TD3+BC-10, the authors simply claim that Raisin outscores TD3+BC-10. However, in table 1, there are some tasks that RAISIN scores lower than TD3+BC, for example on hopper-expert, walker2d-expert, walker2d-medium-expert, hopper-medium-expert, hopper-medium, halfcheetah-random, etc.
We wrote it that way in our initial draft because we did not consider 1% score differences significant. However, we have updated our draft to be more precise. For example, Raisin scores 90% or higher than the best score on twelve tasks, whereas TD3+BC-10 only achieves this for seven tasks. And TD3+BC-10's worst relative score on a task is only 18% of that task's best (walker2d-random), whereas Raisin's worst score on a task is still 73% of that task's best (hopper-medium).
The authors do not provide sufficient explanation regarding what are the properties in these tasks that makes the performance of RAISIN worse.
We agree that it would be nice to know the fundamental reason for the slightly lower scores on hopper-medium and walker2d-random. Two possible explanations are: (i) we left every hyperparameter other than
Since this paper does not provide new techniques or methodologies that shed lights on future researches in offline RL, but only provides experimental studies on a combination of two existing methods. It is important for the authors to provide more thorough empirical explanations and interpretations. The experimental results are good for general papers, but for a paper without enough technical novelty, higher standard in experiments should be expected.
We argue that our work has sufficient novelty in method design. Raisin is not just a combination of two existing methods. We also have negative algorithmic results, where ensemble approaches attempting to give pure RG optimism don't improve pure RG's score on tasks like halfcheetah-random. We've expanded on this in Appendix G. Please see Part 1 of our shared reply for more discussions on novelty. Furthermore, we have added more experiments, as discussed in Parts 3 and 4 of our shared reply, and provided more thorough empirical explanations of why Raisin works — please see Parts 2 and 4 of our shared reply.
References
Shangtong Zhang, Wendelin Boehmer, and Shimon Whiteson. Deep Residual Reinforcement Learning. arXiv, May 2019. doi: 10.48550/arXiv.1905.01072.
Official Review of Paper5886 by Reviewer vwJR
Official Review of Paper5886 by Reviewer vwJR
This paper incorporates the resisual algorithm (RA) with SAC-N used in offline RL setting, and finds that the additional RA objective can not only improve the SOTA performance, but also reduce the number of ensemble functions for a more efficient computation.
Strength
- The paper empirically shows that adding RA to SAC-N is helpful in reducing the number of ensemble members.
Weakness:
The proposed RAISIN method can be viewed as a combination of RA and SAC-N, but in some sense lacking adequate motivation: RA method is helpful to the stable policy training, and the main issue of SAC-N is the large number of ensemble functions, but why simply combining the two would be good?
The proposed method can be easily extended to most current popular offline rl algorithms, so it's better to add more experiments to verify its scalability.
About the pessimism settings in Appendix.A, when N=10, this paper chooses
on half of tasks, but this choice means an actual equivalence with SAC-10, so I doubt whether the proposed method will work in this case.
fair
Although the proposed method can achieve significant performance improvement, there lacks adequate analysis and explaination on why combining the RA and SAC-N is useful.
Following up
Official Comment by Paper5886 Authors • Following up
Thanks again for the feedback! Given our additions and replies, might you consider raising your score?
Reply to Reviewer vwJR
Official Comment by Paper5886 Authors • Reply to Reviewer vwJR
Thanks a bunch for the feedback! To recap the strengths you emphasize:
- Raisin is performant
- Raisin achieves SOTA scores
- The paper is fairly clear and/or reproducible
The proposed RAISIN method can be viewed as a combination of RA and SAC-N, but in some sense lacking adequate motivation: RA method is helpful to the stable policy training, and the main issue of SAC-N is the large number of ensemble functions, but why simply combining the two would be good?
Please see Part 2 of our shared reply, "Why does Raisin work?" Also, please see our new experiments in Part 4 of our shared reply, which suggest the importance of fixed points.
The proposed method can be easily extended to most current popular offline rl algorithms, so it's better to add more experiments to verify its scalability.
We agree our approach can be applied beyond SAC — We have added experiments showing an analogous approach for TD3 works similarly. Please see Part 3 of our shared reply for more details. If you're referring instead to adding residual weight or the minimum-of-
About the pessimism settings in Appendix.A, when N=10, this paper chooses
on half of tasks, but this choice means an actual equivalence with SAC-10, so I doubt whether the proposed method will work in this case.
We agree that
Technical Novelty And Significance: 2: The contributions are only marginally significant or novel. Empirical Novelty And Significance: 2: The contributions are only marginally significant or novel.
Please see Part 1 of our shared reply, "Novelty." To our knowledge, RA has strictly been proposed and tested as an online algorithm. The only use of deep RA we are aware of is the RA-DDPG work (Zhang et al. 2019), where even the highest improvement in AUC over all tasks is a factor of about 3
Please also see Parts 3 and 4 of our shared reply, where we discuss new experiments we've added.
References
Shangtong Zhang, Wendelin Boehmer, and Shimon Whiteson. Deep Residual Reinforcement Learning. arXiv, May 2019. doi: 10.48550/arXiv.1905.01072.
Official Review of Paper5886 by Reviewer kejP
Official Review of Paper5886 by Reviewer kejP
In this paper, the author integrated the Residual gradient algorithm RG into SAC and conducted extensive experiments to prove this new approach can achieve the same level of SAC-N’s performance and, at the same time, greatly reduce the critic number N. The newly proposed method is more “versatile,” which means it can converge under a diverse range of datasets.
The main contribution of this paper is it gives a comprehensive study of the RA approach’s effect on SAC in the offline learning framework. And through comparing to state-of-the-art algorithms, it proves that combing RA into SAC can achieve the same performance while significantly reducing the number of critics.
Strength:
The strength of this paper is that 1. It conducted comprehensive experiments to prove the effectiveness of RA with the SAC algorithm and its generality to different datasets.
The paper gives a very clear motivation of why they investigated a relatively old and not proven effective algorithm and conducted experiments from different angles, such as the reduction of N compared to SG-based SAC methods and different value initialization.
Weakness:
Most of this paper’s efforts are on the experimental comparison of SG and RA SAC algorithms. And the difference between them is adding the gradient of the next states into the loss function. Though these two approaches have significant differences, the RA approach has already been investigated by previous researchers, as mentioned in the RA-DDPG work.
In the baseline comparison, the author should also involve past work that extended RA to DDPG to prove their models' effectiveness and novelty. Since in the introduction, the author talked in detail and claimed RA is in nature beneficial to offline learning reinforcement learning, they should provide more information to prove it, not just compare between SAC methods.
This paper aims to prove that RA-SAC is more versatile than the SG counterparts, and compared to traditional SAC, it needs fewer critics, thus significantly reducing the computing cost. However, as introduced in section 1, the RA method is slower to converge. So the author should also provide convergence curve comparison, not just RA-SAC’s.
The clarity of this paper is good and easy to follow. It provides a detailed discussion of SG and AG differences. However, some of the propositions are not clearly explained. For example, on page 2, the theoretical concerns are important but not mentioned though it seems this paper is focusing on experiments.
Quality: this work’s quality is good but not perfect. The author conducted many experiments to support their claims, but in order to prove that RG could outperform SG, more investigations, such as different agents or tasks, should be involved.
Novelty: the novelty is good since the investigation of combining RG and SG approaches is not enough. But the technical contribution is not that much.
Reproducibility: the technical description is okay and not in detail. The producibility could be increased if the author releases their codebase.
This paper investigates the RA approach, which does not attract much attention in the RL field. Though the author provides many experiments to support their claim, the comparison is not enough to justify their claims( as described in the weakness and quality).
Following up
Official Comment by Paper5886 Authors • Following up
Thanks again for reviewing! In light of our replies and additions, might you consider increasing your score?
Reply to Reviewer kejP
Official Comment by Paper5886 Authors • Reply to Reviewer kejP
Thanks a ton for the review! To summarize the strengths you point out:
- We give extensive experiments to show that Raisin is computationally efficient and scores well on a diverse range of datasets
- The paper is clear and easy to follow, includes a detailed discussion of SG vs. RG, and experiments from different angles
- The novelty is "good", and the reproducibility is "okay" even before we release the code
Most of this paper’s efforts are on the experimental comparison of SG and RA SAC algorithms. And the difference between them is adding the gradient of the next states into the loss function. Though these two approaches have significant differences, the RA approach has already been investigated by previous researchers, as mentioned in the RA-DDPG work.
Please see Part 1 of our shared reply, "Novelty." To our knowledge, RA has strictly been proposed and tested as an online algorithm. Additionally, the only use of deep RA we are aware of is indeed the RA-DDPG work (Zhang et al. 2019), where even the highest improvement in AUC over all tasks is a factor of about 3
In the baseline comparison, the author should also involve past work that extended RA to DDPG to prove their models' effectiveness and novelty. Since in the introduction, the author talked in detail and claimed RA is in nature beneficial to offline learning reinforcement learning, they should provide more information to prove it, not just compare between SAC methods. [...] The author conducted many experiments to support their claims, but in order to prove that RG could outperform SG, more investigations, such as different agents or tasks, should be involved.
We've added new experiments with a Raisin analog for TD3 (which also show that even an improved version of the prior work Bi-Res-DDPG does not score very well). Please see Part 3 of our shared reply.
This paper aims to prove that RA-SAC is more versatile than the SG counterparts, and compared to traditional SAC, it needs fewer critics, thus significantly reducing the computing cost. However, as introduced in section 1, the RA method is slower to converge. So the author should also provide convergence curve comparison, not just RA-SAC’s.
While slow convergence indeed seems to arise if using only a single critic and initialization that is too high (as shown in Figure 2 of our paper), with our proposed algorithm (
However, some of the propositions are not clearly explained. For example, on page 2, the theoretical concerns are important but not mentioned though it seems this paper is focusing on experiments.
Thank you for this comment. We've now clarified what those theoretical concerns are (and some counterpoints). Please let us know if there are other propositions you'd like us to explain more clearly.
Reproducibility: the technical description is okay and not in detail. The producibility could be increased if the author releases their codebase.
We present the pseudocode in Appendix B. We will also release clean code upon acceptance.
References
Shangtong Zhang, Wendelin Boehmer, and Shimon Whiteson. Deep Residual Reinforcement Learning. arXiv, May 2019. doi: 10.48550/arXiv.1905.01072.
Paper Decision
Decision by Program Chairs • Paper Decision
The paper did an empirical investigation of residual algorithm, i.e., weighted averaging of residual gradient algorithm (RG) and semi-gradient algorithm (SG), for SAC. It showed empirically, the simple trick works great for SAC in offline setting, increasing the score of SAC on D4RL gym tasks, and achieving comparable performance for SAC-N with much less ensembles. All the reviewers expressed concerns on the limited novelty, i.e., applying a well-established technique. The scope of the work is quite limited. For example, does the conclusion generalizes to other RL algorithms involving bootstrapping (solving the Mean Squared Bellman Error) beyond SAC. In addition, the work missed comparison to more recent offline RL algorithms. The performance seems comparable at best or worse than CQL or decision transformer. It is thus hard to justify the cost of using ensemble of 10 critics.
Limited novelty and scope of the work. Missed comparison to more recent offline RL algorithms which showed better or comparable performance with much lower compute.
N/A
Thoughts for future readers
Official Comment by Paper5886 Authors • Thoughts for future readers
Thanks for the metareview! Here are our thoughts for future readers.
For our thoughts on novelty, see Part 1 of our shared reply.
Raisin outscores CQL on 12 of the 15 tasks [1] and otherwise scores within 10%. (Similarly vs. DT [2,3].) Further, just like SAC- , Raisin often beats CQL and DT by large margins, e.g. by ~500% on hopper-random [1].
Moreover, Raisin likely runs faster than CQL and DT since we use EDAC's code but with 5 fewer critics, and EDAC is already faster than CQL [1], which is faster than DT [3]. We are unsure which method uses less of the GPU, but [1] suggests the memory usage of CQL and Raisin roughly match.
We think our TD3 version of Raisin in Appendix E suggests this. Granted, we only tested the datasets we thought would be most difficult, and still only D4RL gym.
All that said, we also wish we had emphasized Raisin's empirical success mostly as further motivation for fixing pure RG.
[1] Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble, An et al.
[2] RvS: What is Essential for Offline RL via Supervised Learning?, Emmons et al.
[3] Offline Reinforcement Learning with Implicit Q-Learning, Kostrikov et al.