Raisin: Residual Algorithms for Versatile Offline Reinforcement Learning

Braham Snyder; Yuke Zhu

Raisin: Residual Algorithms for Versatile Offline Reinforcement Learning

Blind Submission by Conference • Raisin: Residual Algorithms for Versatile Offline Reinforcement Learning

Published: 02 Feb 2023, Last Modified: 14 Feb 2023Submitted to ICLR 2023Readers: EveryoneShow BibtexShow Revisions

Keywords: reinforcement learning, offline RL, residual algorithms, residual gradient

Abstract: The residual gradient algorithm (RG), gradient descent of the Mean Squared Bellman Error, brings robust convergence guarantees to bootstrapped value estimation. Meanwhile, the far more common semi-gradient algorithm (SG) suffers from well-known instabilities and divergence. Unfortunately, RG often converges slowly in practice. Baird (1995) proposed residual algorithms (RA), weighted averaging of RG and SG, to combine RG's robust convergence and SG's speed. RA works moderately well in the online setting. We find, however, that RA works disproportionately well in the offline setting. Concretely, we find that merely adding a variable residual component to SAC increases its score on D4RL gym tasks by a median factor of 54. We further show that using the minimum of ten critics lets our algorithm match SAC-

N

's state-of-the-art returns using 50

\times

less compute and no additional hyperparameters. In contrast, TD3+BC with the same minimum-of-ten-critics trick does not match SAC-

N

's returns on a handful of environments.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Reinforcement Learning (eg, decision and control, planning, hierarchical RL, robotics)

[–][+]

Paper Decision

Decision by Program Chairs • Paper Decision

ICLR 2023 Conference Program Chairs

21 Jan 2023, 04:09ICLR 2023 Conference Paper5886 DecisionReaders: EveryoneShow Revisions

Decision: Reject

Metareview: Summary, Strengths And Weaknesses:

The paper did an empirical investigation of residual algorithm, i.e., weighted averaging of residual gradient algorithm (RG) and semi-gradient algorithm (SG), for SAC. It showed empirically, the simple trick works great for SAC in offline setting, increasing the score of SAC on D4RL gym tasks, and achieving comparable performance for SAC-N with much less ensembles. All the reviewers expressed concerns on the limited novelty, i.e., applying a well-established technique. The scope of the work is quite limited. For example, does the conclusion generalizes to other RL algorithms involving bootstrapping (solving the Mean Squared Bellman Error) beyond SAC. In addition, the work missed comparison to more recent offline RL algorithms. The performance seems comparable at best or worse than CQL or decision transformer. It is thus hard to justify the cost of using ensemble of 10 critics.

Justification For Why Not Higher Score:

Limited novelty and scope of the work. Missed comparison to more recent offline RL algorithms which showed better or comparable performance with much lower compute.

Justification For Why Not Lower Score:

N/A

[–][+]

Thoughts for future readers

Official Comment by Paper5886 Authors • Thoughts for future readers

ICLR 2023 Conference Paper5886 Authors

26 Jan 2023, 06:55 (modified: 03 Feb 2023, 02:11)ICLR 2023 Conference Paper5886 Official CommentReaders: EveryoneShow Revisions

Comment:

Thanks for the metareview! Here are our thoughts for future readers.

For our thoughts on novelty, see Part 1 of our shared reply.

CQL and DT showed better or comparable performance with much lower compute

Raisin outscores CQL on 12 of the 15 tasks [1] and otherwise scores within 10%. (Similarly vs. DT [2,3].) Further, just like SAC- $N$ , Raisin often beats CQL and DT by large margins, e.g. by ~500% on hopper-random [1].

Moreover, Raisin likely runs faster than CQL and DT since we use EDAC's code but with 5 $\times$ fewer critics, and EDAC is already faster than CQL [1], which is faster than DT [3]. We are unsure which method uses less of the GPU, but [1] suggests the memory usage of CQL and Raisin roughly match.

does the conclusion generalizes to other RL algorithms involving bootstrapping (solving the Mean Squared Bellman Error) beyond SAC

We think our TD3 version of Raisin in Appendix E suggests this. Granted, we only tested the datasets we thought would be most difficult, and still only D4RL gym.

All that said, we also wish we had emphasized Raisin's empirical success mostly as further motivation for fixing pure RG.

[1] Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble, An et al.

[2] RvS: What is Essential for Offline RL via Supervised Learning?, Emmons et al.

[3] Offline Reinforcement Learning with Implicit Q-Learning, Kostrikov et al.

[–][+]

Shared Reply, Parts 1 and 2

Official Comment by Paper5886 Authors • Shared Reply, Parts 1 and 2

ICLR 2023 Conference Paper5886 Authors

16 Nov 2022, 03:44 (modified: 16 Nov 2022, 03:49)ICLR 2023 Conference Paper5886 Official CommentReaders: EveryoneShow Revisions

Comment:

We've updated the paper, as described in both individual replies and our shared reply below, highlighting in the paper the significantly changed text in blue.

Part 1: Novelty, e.g. "the methodology is a direct application of RA on SAC-N"

To recap, the most important of our novel findings is that RA works disproportionately well in the offline setting, increasing SAC's median score on D4RL by a factor of 54, and the second most important novel finding is that this also enables SAC- $N$ to roughly keep its SOTA scores with 50 $\times$ less compute. We've now clarified the additional novelty more in the paper:

Referring to Raisin as RA + SAC- $N$ is only a rough explanation we use. Raisin is not the only possible way to combine RA and SAC- $N$ . For example, we only briefly tested the alternative minimum placement $(Q - min \bar{y})^{2} + (\bar{Q} - min y)^{2}$ . (Where we step $min y$ in the second term towards the average of $\bar{Q}$ .) There are certainly reasons to think that formulation could be worse than Raisin (for example, it might reduce the possible range of pessimism), and Raisin indeed outscored that algorithm in preliminary experiments (not shown). That said, we don't yet consider alternative RA + SAC- $N$ approaches such as that one fully ruled out.
We find no single setting for $η$ universally works well: you must tune it per dataset. If you only tuned Raisin's $η$ on, say, hopper-expert, you would find that $η = 0 %$ catastrophically fails and that only a large $η$ around $30 %$ works. However, $η = 30 %$ catastrophically fails at, e.g., halfcheetah-expert. This is why another important direction for future work is automatically tuning $η$ .
The whole of Raisin is greater than the sum of its parts: SAC-10 catastrophically fails at, e.g., hopper-expert, and RA on its own (i.e., $N = 1$ ) doesn't even solve hopper-expert one-third of the way, yet their careful combination (see the previous bullet) reaches SOTA. Furthering this notion, RA without the minimum-of-N-critics trick performs only comparably to TD3+BC, yet Raisin outperforms TD3+BC-10.
We mention additional negative algorithmic findings as well: most importantly, ensemble aggregation techniques for imparting optimism to pure RG ( $η = 100 %$ ) don't seem to improve its scores for, e.g., halfcheetah-random (where we speculated RG might be "too pessimistic" in some sense). We tested those algorithms extensively. We've now gone into more detail on this in the paper, in Appendix G.

We've also added new experiments, discussed in Parts 3 and 4 of our shared reply.

Part 2: Why does Raisin work?

Our new experiments on fixed points (discussed in Part 4 of our shared reply) suggest they are likely important for Raisin's effectiveness. Section 4.3 confirms that value function initialization can also be important. Regarding value initialization, Figure 2 supports Wang & Ueda (2021)'s argument that RG tends to maintain its average value prediction, which means that a standard near-zero initialization and positive rewards can make RG naturally pessimistic, which is a natural fit for offline RL.

Granted, it's less clear how value function initialization interacts with the minimum-of-N-critics trick, nor what the $Q$ -target overshooting effect in Figure 3 is precisely. That said, we found SAC- $N$ similarly suffers from unexplained $Q$ -target overshooting issues even with its standard, high-scoring settings. So analyzing SAC- $N$ 's behavior in-depth first would be necessary. For future work, we hypothesize independent targets might be worth investigating for this purpose (Ghasemipour et al. 2022).

We've now stated all the above more explicitly in the paper and also further emphasized that future work in this direction is worthwhile.

References

Zhikang T. Wang and Masahito Ueda. Convergent and Efficient Deep Q Network Algorithm. arXiv, June 2021. doi: 10.48550/arXiv.2106.15419.

Seyed Kamyar Seyed Ghasemipour, Shixiang Shane Gu, and Ofir Nachum. Why So Pessimistic? Estimating Uncertainties for Offline RL through Ensembles, and Why Their Independence Matters. OpenReview, October 2022. URL https://openreview.net/forum?id=z64kN1h1-rR.

[–][+]

Shared Reply, Parts 3 and 4

Official Comment by Paper5886 Authors • Shared Reply, Parts 3 and 4

ICLR 2023 Conference Paper5886 Authors

16 Nov 2022, 03:48 (modified: 18 Nov 2022, 12:38)ICLR 2023 Conference Paper5886 Official CommentReaders: EveryoneShow Revisions

Comment:

Part 3: A Raisin analog for TD3

We've also now implemented a Raisin analog for TD3 to examine the generality of our approach and to show the advantage of Raisin over a method approximately equivalent to the prior work Bi-Res-DDPG (since TD3 is essentially an improved DDPG). As we expected, RA-TD3 with $N = 1$ (so roughly the prior work Bi-Res-DDPG) hardly scores at all on hopper-expert, but adding our minimum-of- $N$ -critics residual formulation gives scores similar to Raisin (the only difference is SAC vs TD3 as a base). We've added those results in Appendix E and show them here as well:

	RA-TD3-1	RA-TD3-10	Raisin-10
hopper-expert	12 ± 15	103 ± 16	110 ± 0.4
hopper-random	-	27 ± 10	31 ± 0.1
walker2d-random	-	12 ± 6.3	18 ± 9.2

Part 4: Fixed point experiments

We've additionally added new experiments in Appendix F training RG ( $η = 100 %$ ) and SG in new runs starting from weights at high scores to investigate the role of fixed points. When $N = 1$ , RG and SG attain high scores on hopper-expert and halfcheetah-random, respectively. We then restart training from those high-scoring weights using both RG and SG on each environment. On hopper-expert, SG trained from RG's high-scoring weights reverts to low scores, while new RG runs trained from those same weights retain high scores. Similarly, on halfcheetah-random, RG trained from SG's high-scoring weights falls to low scores while new SG runs starting from those same weights keep their high scores. Our results provide evidence in favor of fixed points playing an important role, though future work should study this in-depth. We also show those results here.

Resuming from SG on halfcheetah-random:

	SG-SAC-1	RG-SAC-1 resume	SG-SAC-1 resume
halfcheetah-random	17 ± 15	-2.0 ± 0.3	32 ± 3.5

Resuming from RG on hopper-expert:

	RG-SAC-1	RG-SAC-1 resume	SG-SAC-1 resume
hopper-expert	39 ± 20	63 ± 24	0.7 ± 0.1

[–][+]

Official Review of Paper5886 by Reviewer foJV

ICLR 2023 Conference Paper5886 Reviewer foJV

28 Oct 2022, 11:39ICLR 2023 Conference Paper5886 Official ReviewReaders: EveryoneShow Revisions

Summary Of The Paper:

This paper considers the problem of offline reinforcement learning and investigates the performance of the residual gradient algorithm (RG) and an improved residual algorithm (R) on SAC. Empirically, the simple trick increases the score on D4RL gym tasks. By adding a residual component to SAC-N, the algorithm (when $N = 10$ ) matches the state of the art performance on most of the tasks compared with SAC-N (which requires hundreds of ensembles). It reduces the computational cause dramatically.

Overall, this paper provides a clear explanation of the RAISIN algorithm (which is simply RA on top of SAC-N) and provides experimental results comparing the performance improvement and the ablation studies.

Strength And Weaknesses:

The methodology is simple, and the only parameter to tune is the pessimistic, as everything else is based on top of SAC. The proposed algorithm achieves comparable performance with much less ensembles, which makes the algorithm promising.

The weaknesses of this paper are listed as follows:

It seems that the novelty of this paper is limited, the methodology is a direct application of RA on SAC-N, and there are little theoretical discussion on how the performance get improved.
In Section 4.1 comparison with TD3+BC-10, the authors simply claim that Raisin outscores TD3+BC-10. However, in table 1, there are some tasks that RAISIN scores lower than TD3+BC, for example on hopper-expert, walker2d-expert, walker2d-medium-expert, hopper-medium-expert, hopper-medium, halfcheetah-random, etc. The authors do not provide sufficient explanation regarding what are the properties in these tasks that makes the performance of RAISIN worse.
Since this paper does not provide new techniques or methodologies that shed lights on future researches in offline RL, but only provides experimental studies on a combination of two existing methods. It is important for the authors to provide more thorough empirical explanations and interpretations. The experimental results are good for general papers, but for a paper without enough technical novelty, higher standard in experiments should be expected.

Clarity, Quality, Novelty And Reproducibility:

The paper is written clearly, the methodology lacks some novelty, the empirical results do provide some useful observations.

Summary Of The Review:

This paper provide a novel observation that RA + SAC-N has good performance and reduced computational cost. The algorithm becomes more robust on a number of gym tasks. However, the performance of RAISIN does not outperform all SOTA results on all tasks (when compared with TD3+BC), and the experiments are not interpreted sufficiently.

Correctness: 3: Some of the paper’s claims have minor issues. A few statements are not well-supported, or require small changes to be made correct.

Technical Novelty And Significance: 2: The contributions are only marginally significant or novel.

Empirical Novelty And Significance: 3: The contributions are significant and somewhat new. Aspects of the contributions exist in prior work.

Flag For Ethics Review: NO.

Recommendation: 6: marginally above the acceptance threshold

Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.

[–][+]

Following up

Official Comment by Paper5886 Authors • Following up

ICLR 2023 Conference Paper5886 Authors

18 Nov 2022, 12:24ICLR 2023 Conference Paper5886 Official CommentReaders: EveryoneShow Revisions

Comment:

Thanks again for your review! In light of our additions and replies, might you consider raising your score?

[–][+]

Reply to Reviewer foJV

Official Comment by Paper5886 Authors • Reply to Reviewer foJV

ICLR 2023 Conference Paper5886 Authors

16 Nov 2022, 03:51 (modified: 16 Nov 2022, 03:55)ICLR 2023 Conference Paper5886 Official CommentReaders: EveryoneShow Revisions

Comment:

Thanks a ton for the review! To recap the strengths you highlight:

Raisin is simple
The paper is written clearly
Raisin roughly matches SOTA with dramatically reduced compute
Aside from a fixed $N$ and tuning $η$ , we left all other hyperparameters untouched

It seems that the novelty of this paper is limited, the methodology is a direct application of RA on SAC-N,

Please see Part 1 of our shared reply, "Novelty." To our knowledge, RA has strictly been proposed and tested as an online algorithm. The only use of deep RA we are aware of is the RA-DDPG work (Zhang et al. 2019), where even the highest improvement in AUC over all tasks is a factor of about 3 $\times$ . In contrast, we have discovered that RA works disproportionately well in the offline setting (for which it was not designed), giving a median 54 $\times$ improvement in score over all tasks. Moreover, we find Raisin to be the best of multiple possible ways to combine RA and SAC- $N$ , and that this combination performs better than the sum of its parts. We also find that tuning the residual weight differently for every dataset is critical.

Please also see Parts 3 and 4 of our shared reply, where we discuss new experiments we've added.

there are little theoretical discussion on how the performance get improved.

Please see Part 2 of our shared reply, "Why does Raisin work?" Also, please see our new experiments in Part 4 of our shared reply, which suggest the importance of fixed points.

In Section 4.1 comparison with TD3+BC-10, the authors simply claim that Raisin outscores TD3+BC-10. However, in table 1, there are some tasks that RAISIN scores lower than TD3+BC, for example on hopper-expert, walker2d-expert, walker2d-medium-expert, hopper-medium-expert, hopper-medium, halfcheetah-random, etc.

We wrote it that way in our initial draft because we did not consider 1% score differences significant. However, we have updated our draft to be more precise. For example, Raisin scores 90% or higher than the best score on twelve tasks, whereas TD3+BC-10 only achieves this for seven tasks. And TD3+BC-10's worst relative score on a task is only 18% of that task's best (walker2d-random), whereas Raisin's worst score on a task is still 73% of that task's best (hopper-medium).

The authors do not provide sufficient explanation regarding what are the properties in these tasks that makes the performance of RAISIN worse.

We agree that it would be nice to know the fundamental reason for the slightly lower scores on hopper-medium and walker2d-random. Two possible explanations are: (i) we left every hyperparameter other than $η$ completely untouched — Raisin may fully match SAC- $N$ with more optimized hyperparameter choices, such as a different learning rate; (ii) for simplicity, we intentionally ignored the slight stochasticity of MuJoCo, i.e., the double sampling problem, though there are ways to fix this, see Section 5.3.

Since this paper does not provide new techniques or methodologies that shed lights on future researches in offline RL, but only provides experimental studies on a combination of two existing methods. It is important for the authors to provide more thorough empirical explanations and interpretations. The experimental results are good for general papers, but for a paper without enough technical novelty, higher standard in experiments should be expected.

We argue that our work has sufficient novelty in method design. Raisin is not just a combination of two existing methods. We also have negative algorithmic results, where ensemble approaches attempting to give pure RG optimism don't improve pure RG's score on tasks like halfcheetah-random. We've expanded on this in Appendix G. Please see Part 1 of our shared reply for more discussions on novelty. Furthermore, we have added more experiments, as discussed in Parts 3 and 4 of our shared reply, and provided more thorough empirical explanations of why Raisin works — please see Parts 2 and 4 of our shared reply.

References

Shangtong Zhang, Wendelin Boehmer, and Shimon Whiteson. Deep Residual Reinforcement Learning. arXiv, May 2019. doi: 10.48550/arXiv.1905.01072.

[–][+]

Official Review of Paper5886 by Reviewer vwJR

ICLR 2023 Conference Paper5886 Reviewer vwJR

24 Oct 2022, 22:11ICLR 2023 Conference Paper5886 Official ReviewReaders: EveryoneShow Revisions

Summary Of The Paper:

This paper incorporates the resisual algorithm (RA) with SAC-N used in offline RL setting, and finds that the additional RA objective can not only improve the SOTA performance, but also reduce the number of ensemble functions for a more efficient computation.

Strength And Weaknesses:

Strength

The paper empirically shows that adding RA to SAC-N is helpful in reducing the number of ensemble members.

Weakness:

The proposed RAISIN method can be viewed as a combination of RA and SAC-N, but in some sense lacking adequate motivation: RA method is helpful to the stable policy training, and the main issue of SAC-N is the large number of ensemble functions, but why simply combining the two would be good?
The proposed method can be easily extended to most current popular offline rl algorithms, so it's better to add more experiments to verify its scalability.
About the pessimism settings in Appendix.A, when N=10, this paper chooses $η = 0$ on half of tasks, but this choice means an actual equivalence with SAC-10, so I doubt whether the proposed method will work in this case.

Clarity, Quality, Novelty And Reproducibility:

fair

Summary Of The Review:

Although the proposed method can achieve significant performance improvement, there lacks adequate analysis and explaination on why combining the RA and SAC-N is useful.

Correctness: 3: Some of the paper’s claims have minor issues. A few statements are not well-supported, or require small changes to be made correct.

Technical Novelty And Significance: 2: The contributions are only marginally significant or novel.

Empirical Novelty And Significance: 2: The contributions are only marginally significant or novel.

Flag For Ethics Review: NO.

Recommendation: 5: marginally below the acceptance threshold

Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.

[–][+]

Following up

Official Comment by Paper5886 Authors • Following up

ICLR 2023 Conference Paper5886 Authors

18 Nov 2022, 12:25ICLR 2023 Conference Paper5886 Official CommentReaders: EveryoneShow Revisions

Comment:

Thanks again for the feedback! Given our additions and replies, might you consider raising your score?

[–][+]

Reply to Reviewer vwJR

Official Comment by Paper5886 Authors • Reply to Reviewer vwJR

ICLR 2023 Conference Paper5886 Authors

16 Nov 2022, 03:52 (modified: 16 Nov 2022, 07:20)ICLR 2023 Conference Paper5886 Official CommentReaders: EveryoneShow Revisions

Comment:

Thanks a bunch for the feedback! To recap the strengths you emphasize:

Raisin is performant
Raisin achieves SOTA scores
The paper is fairly clear and/or reproducible

The proposed RAISIN method can be viewed as a combination of RA and SAC-N, but in some sense lacking adequate motivation: RA method is helpful to the stable policy training, and the main issue of SAC-N is the large number of ensemble functions, but why simply combining the two would be good?

Please see Part 2 of our shared reply, "Why does Raisin work?" Also, please see our new experiments in Part 4 of our shared reply, which suggest the importance of fixed points.

The proposed method can be easily extended to most current popular offline rl algorithms, so it's better to add more experiments to verify its scalability.

We agree our approach can be applied beyond SAC — We have added experiments showing an analogous approach for TD3 works similarly. Please see Part 3 of our shared reply for more details. If you're referring instead to adding residual weight or the minimum-of- $N$ -critics trick to IQL, it is possible that they would improve its scores. Nonetheless, we chose to use TD3+BC and SAC- $N$ for their simplicity. Both approaches take a well-tested online algorithm and add a single source of pessimism. In contrast, IQL already has two different sources of pessimism (expectile regression and AWR), each with its own pessimism hyperparameter.

About the pessimism settings in Appendix.A, when N=10, this paper chooses $η = 0 %$ on half of tasks, but this choice means an actual equivalence with SAC-10, so I doubt whether the proposed method will work in this case.

We agree that $η = 0 %$ means an equivalence with SAC-10, as noted in Section 3. To clarify, our proposed method is to tune $η$ per task, including tasks where $η = 0 %$ (i.e., SAC-10) performs best. In other words, SAC-10 is part of our proposed method Raisin. We agree the need to tune $η$ per task is a potential weakness. In the Related Work section, we mention a few approaches that could eliminate the need to tune $η$ per dataset. As we mention in the paper, ultimately replacing SG entirely (for more robust convergence) may be ideal, but Raisin already gives SOTA scores even where the optimal setting is equivalent to SAC-10 (i.e., SG).

Technical Novelty And Significance: 2: The contributions are only marginally significant or novel. Empirical Novelty And Significance: 2: The contributions are only marginally significant or novel.

Please see Part 1 of our shared reply, "Novelty." To our knowledge, RA has strictly been proposed and tested as an online algorithm. The only use of deep RA we are aware of is the RA-DDPG work (Zhang et al. 2019), where even the highest improvement in AUC over all tasks is a factor of about 3 $\times$ . In contrast, we have discovered that RA works disproportionately well in the offline setting (for which it was not designed), giving a median 54 $\times$ improvement in score over all tasks. Moreover, we find Raisin to be the best of multiple possible ways to combine RA and SAC- $N$ , and that this combination performs better than the sum of its parts. We also find that tuning the residual weight differently for every dataset is critical.

Please also see Parts 3 and 4 of our shared reply, where we discuss new experiments we've added.

References

Shangtong Zhang, Wendelin Boehmer, and Shimon Whiteson. Deep Residual Reinforcement Learning. arXiv, May 2019. doi: 10.48550/arXiv.1905.01072.

[–][+]

Official Review of Paper5886 by Reviewer kejP

ICLR 2023 Conference Paper5886 Reviewer kejP

24 Oct 2022, 11:43 (modified: 29 Nov 2022, 03:02)ICLR 2023 Conference Paper5886 Official ReviewReaders: EveryoneShow Revisions

Summary Of The Paper:

In this paper, the author integrated the Residual gradient algorithm RG into SAC and conducted extensive experiments to prove this new approach can achieve the same level of SAC-N’s performance and, at the same time, greatly reduce the critic number N. The newly proposed method is more “versatile,” which means it can converge under a diverse range of datasets.

The main contribution of this paper is it gives a comprehensive study of the RA approach’s effect on SAC in the offline learning framework. And through comparing to state-of-the-art algorithms, it proves that combing RA into SAC can achieve the same performance while significantly reducing the number of critics.

Strength And Weaknesses:

Strength:

The strength of this paper is that 1. It conducted comprehensive experiments to prove the effectiveness of RA with the SAC algorithm and its generality to different datasets.
The paper gives a very clear motivation of why they investigated a relatively old and not proven effective algorithm and conducted experiments from different angles, such as the reduction of N compared to SG-based SAC methods and different value initialization.

Weakness:

Most of this paper’s efforts are on the experimental comparison of SG and RA SAC algorithms. And the difference between them is adding the gradient of the next states into the loss function. Though these two approaches have significant differences, the RA approach has already been investigated by previous researchers, as mentioned in the RA-DDPG work.
In the baseline comparison, the author should also involve past work that extended RA to DDPG to prove their models' effectiveness and novelty. Since in the introduction, the author talked in detail and claimed RA is in nature beneficial to offline learning reinforcement learning, they should provide more information to prove it, not just compare between SAC methods.
This paper aims to prove that RA-SAC is more versatile than the SG counterparts, and compared to traditional SAC, it needs fewer critics, thus significantly reducing the computing cost. However, as introduced in section 1, the RA method is slower to converge. So the author should also provide convergence curve comparison, not just RA-SAC’s.

Clarity, Quality, Novelty And Reproducibility:

The clarity of this paper is good and easy to follow. It provides a detailed discussion of SG and AG differences. However, some of the propositions are not clearly explained. For example, on page 2, the theoretical concerns are important but not mentioned though it seems this paper is focusing on experiments.

Quality: this work’s quality is good but not perfect. The author conducted many experiments to support their claims, but in order to prove that RG could outperform SG, more investigations, such as different agents or tasks, should be involved.

Novelty: the novelty is good since the investigation of combining RG and SG approaches is not enough. But the technical contribution is not that much.

Reproducibility: the technical description is okay and not in detail. The producibility could be increased if the author releases their codebase.

Summary Of The Review:

This paper investigates the RA approach, which does not attract much attention in the RL field. Though the author provides many experiments to support their claim, the comparison is not enough to justify their claims( as described in the weakness and quality).

Correctness: 3: Some of the paper’s claims have minor issues. A few statements are not well-supported, or require small changes to be made correct.

Technical Novelty And Significance: 2: The contributions are only marginally significant or novel.

Empirical Novelty And Significance: 3: The contributions are significant and somewhat new. Aspects of the contributions exist in prior work.

Flag For Ethics Review: NO.

Recommendation: 6: marginally above the acceptance threshold

Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.

[–][+]

Following up

Official Comment by Paper5886 Authors • Following up

ICLR 2023 Conference Paper5886 Authors

18 Nov 2022, 12:26ICLR 2023 Conference Paper5886 Official CommentReaders: EveryoneShow Revisions

Comment:

Thanks again for reviewing! In light of our replies and additions, might you consider increasing your score?

[–][+]

Increasing my score

Official Comment by Paper5886 Reviewer kejP • Increasing my score

ICLR 2023 Conference Paper5886 Reviewer kejP

29 Nov 2022, 03:02ICLR 2023 Conference Paper5886 Official CommentReaders: EveryoneShow Revisions

Comment:

Thanks for your reply. I've raised my score to 6.

[–][+]

Thanks!

Official Comment by Paper5886 Authors • Thanks!

ICLR 2023 Conference Paper5886 Authors

29 Nov 2022, 08:09ICLR 2023 Conference Paper5886 Official CommentReaders: EveryoneShow Revisions

Comment:

Thanks a bunch for the update! If you have any more questions, please do let us know.

[–][+]

Reply to Reviewer kejP

Official Comment by Paper5886 Authors • Reply to Reviewer kejP

ICLR 2023 Conference Paper5886 Authors

16 Nov 2022, 03:54ICLR 2023 Conference Paper5886 Official CommentReaders: EveryoneShow Revisions

Comment:

Thanks a ton for the review! To summarize the strengths you point out:

We give extensive experiments to show that Raisin is computationally efficient and scores well on a diverse range of datasets
The paper is clear and easy to follow, includes a detailed discussion of SG vs. RG, and experiments from different angles
The novelty is "good", and the reproducibility is "okay" even before we release the code

Most of this paper’s efforts are on the experimental comparison of SG and RA SAC algorithms. And the difference between them is adding the gradient of the next states into the loss function. Though these two approaches have significant differences, the RA approach has already been investigated by previous researchers, as mentioned in the RA-DDPG work.

Please see Part 1 of our shared reply, "Novelty." To our knowledge, RA has strictly been proposed and tested as an online algorithm. Additionally, the only use of deep RA we are aware of is indeed the RA-DDPG work (Zhang et al. 2019), where even the highest improvement in AUC over all tasks is a factor of about 3 $\times$ . In contrast, we have discovered that RA works disproportionately well in the offline setting (for which it was not designed), giving a median 54 $\times$ improvement in score over all tasks. Moreover, we find Raisin to be the best of multiple possible ways to combine RA and SAC- $N$ , and that this combination performs better than the sum of its parts. We also find that tuning the residual weight differently for every dataset is critical for performance.

In the baseline comparison, the author should also involve past work that extended RA to DDPG to prove their models' effectiveness and novelty. Since in the introduction, the author talked in detail and claimed RA is in nature beneficial to offline learning reinforcement learning, they should provide more information to prove it, not just compare between SAC methods. [...] The author conducted many experiments to support their claims, but in order to prove that RG could outperform SG, more investigations, such as different agents or tasks, should be involved.

We've added new experiments with a Raisin analog for TD3 (which also show that even an improved version of the prior work Bi-Res-DDPG does not score very well). Please see Part 3 of our shared reply.

This paper aims to prove that RA-SAC is more versatile than the SG counterparts, and compared to traditional SAC, it needs fewer critics, thus significantly reducing the computing cost. However, as introduced in section 1, the RA method is slower to converge. So the author should also provide convergence curve comparison, not just RA-SAC’s.

While slow convergence indeed seems to arise if using only a single critic and initialization that is too high (as shown in Figure 2 of our paper), with our proposed algorithm ( $N = 10$ and a standard near-zero initialization), we do not see this slow convergence empirically. The original SAC-N paper (which uses SG) requires 3M gradient steps to converge, which Raisin matches.

However, some of the propositions are not clearly explained. For example, on page 2, the theoretical concerns are important but not mentioned though it seems this paper is focusing on experiments.

Thank you for this comment. We've now clarified what those theoretical concerns are (and some counterpoints). Please let us know if there are other propositions you'd like us to explain more clearly.

Reproducibility: the technical description is okay and not in detail. The producibility could be increased if the author releases their codebase.

We present the pseudocode in Appendix B. We will also release clean code upon acceptance.

References

Shangtong Zhang, Wendelin Boehmer, and Shimon Whiteson. Deep Residual Reinforcement Learning. arXiv, May 2019. doi: 10.48550/arXiv.1905.01072.