1. Introduction

Large language models have been a hot topic in AI ever since the invention of transformer architecture in 2017. Due to the exponential growth in model sizes during the past years, models with more than 100 billion parameters have been trained, making it a very expensive task. These pre-trained models have emerged as powerful tools capable of generating human-like text, understanding context and performing a wide range of other natural language processing tasks. However, sometimes there is a need for more personalized models or correcting some undesirable behavior like the generation of false statements and toxic or biased text. To attend to these needs researchers have come up with various techniques for fine-tuning large language models.

Even though traditional supervised fine-tuning is a powerful tool, it can get expensive as it requires a large amount of annotated data, which is why reinforcement learning has recently been utilized to align the model with human preferences. Reinforcement learning aims to achieve optimal behavior in the environment by maximizing the output of a reward model. Due to the lack of training data, reinforcement learning is a self-teaching method that learns through trial and error.

Some of the promising methods of reinforcement learning for fine-tuning large language models are reinforcement learning on human feedback (RLHF) [1], Quantized Reward Konditioning (Quark) [2], Reinforced Knowledge Introspector (Rainier) [3] and Reward Learning on Policy (RLP) [4].

2. InstructGPT - Reinforcement Learning from Human Feedback

2.1. Method

Reinforcement learning from human feedback improves the alignment of language models by training them to follow the user’s intentions, including explicit ones like following instructions and implicit ones like being truthful and avoiding bias, toxicity or any harm.

InstructGPT [1] was developed by fine-tuning the GPT-3 model with reinforcement learning from human feedback, using human preferences as a reward signal. This method can best be explained in three steps (illustrated in Figure 1.):

Collect demonstration data and train a supervised policy:
The initial step involves creating a dataset from labeler-written prompts and prompts submitted through the OpenAI API along with corresponding demonstrations of the desired model behavior provided by human labelers. This dataset is used to fine-tune the language model using supervised learning, forming the baseline model.

Collect comparison data and train a reward model:
In the second step, labelers rank multiple outputs of the baseline model to the same prompt on a larger set of API prompts, generating $\begin{array}{l}\binom{K}{2}\end{array}$ comparisons where K is the number of ranked outputs for a single prompt. This dataset is then used to train a reward model with the goal of predicting preferred outputs by the labelers. The reward model is trained on comparisons between two model outputs for the same input using cross-entropy loss with comparisons serving as labels. All $\begin{array}{l}\binom{K}{2}\end{array}$ comparisons for a single prompt make one batch element thus avoiding overfitting and improving computational efficiency.

Optimize a policy against the reward model using reinforcement learning:
In the final step, the previously trained reward model is used as a reward function in a reinforcement learning setup, where a supervised fine-tuned baseline is further fine-tuned using Proximal Policy Optimization (PPO) [5] to maximize the reward given by the reward model. This way the behavior of the GPT-3 baseline is aligned with the preferences of the labelers.
Proximal Policy Optimization is a policy gradient algorithm used for reinforcement learning that iteratively samples data from the environment and performs several training epochs over sampled data. This method improves policy stability and efficiency while being simpler and easier to implement compared to previously used methods like Trust Region Policy Optimization (TRPO).

Figure 1. Illustration of the three steps of RLHF [1]

2.2. Experiments

The training tasks for InstructGPT models come from two main sources: prompts written by labelers and prompts from InstructGPT API. These prompts cover a wide range of natural language tasks ensuring robust training for various applications. Prompts typically specify tasks directly through natural language instructions, few-shot examples, or implicit continuations.

2.2.1. Results on the API distribution

Labelers significantly prefer InstructGPT outputs over those from GPT-3 across all model sizes. InstructGPT shows considerable performance improvements, particularly when compared to GPT-3 outputs, which perform the worst, as can be seen in Figure 2.

Figure 2. Human evaluation on API prompt distribution [1]

The graphs in Figure 3. show that InstructGPT outputs are more appropriate for customer assistant tasks, better follow explicit constraints, are better in following instructions, and hallucinate less often in closed-domain tasks compared to GPT-3. These findings indicate that InstructGPT models are more reliable and easier to control.

Figure 3. Metadata results on the API distribution [1]

Held-out labelers, who did not provide training data, show similar preferences, indicating that InstructGPT models are not overfitting to the training labelers’ preferences. Reward models also generalize well, with a slight decrease in accuracy when predicting preferences of held-out labelers compared to those in the training set.

GPT-3 baseline model fine-tuned on public NLP datasets, such as FLAN [10] and T0 [11], performs better than GPT-3, similar to GPT-3 with well-crafted prompts and worse than InstructGPT supervised fine-tuned baseline (Likert scores are shown in Figure 4.). This shows that the mentioned datasets do not reflect the diverse range of tasks required by real-world users. InstructGPT models outperform models fine-tuned on these datasets, as it captures a broader range of user needs, particularly in open-ended generation and brainstorming tasks, which constitute a significant portion of API usage.

Figure 4. Comparison of InstructGPT to FLAN and T0 [1]

2.2.2. Results on public NLP datasets

When measured with human evaluations on the TruthfulQA dataset [12], InstructGPT generates more truthful and informative outputs compared to the GPT-3, without needing explicit instructions to be truthful. InstructGPT also performed better when evaluated on the "Instruction+QA" prompt which tells the model to answer with "I have no comment" when unsure, by being less confident in its answers (shown in Figure 5.).

Figure 5. Results on the TruthfulQA dataset [1]

Toxicity of the models was evaluated on the RealToxicityPrompts dataset [13]. InstructGPT models generate less toxic outputs when run through Perspective API for automatic toxicity scores. InstructGPT also outperforms GPT-3 on human evaluation when instructed to be respectful, but performs similarly without such an instruction. These results are visible in Figure 6. It is worth mentioning that the InstructGPT model produces much more toxic outputs when instructed to do so, which indicates it is better at following instructions.

Figure 6. Comparison of human and automatic evaluations on RealToxicityPrompts [1]

The models’ tendency to generate biased outputs was evaluated on Winogender [14] and CrowS-Pairs [15] datasets and showed no improvement over the GPT-3 baseline.

The PPO model shows a performance decrease on public NLP datasets because it was trained on API distribution ("alignment tax"). This effect can be reversed by mixing in pretraining updates to the PPO fine-tuning (PPO-ptx).

2.2.3. Qualitative results

InstructGPT generalizes well even on the prompts outside of training data distribution. It shows promising results on non-English prompts and questions regarding code, compared to GPT-3 which requires more careful prompting and shows worse results. This is surprising considering those kinds of prompts were not well represented in training data.

However, InstructGPT is not perfect. For example, it sometimes overcomplicates output for a very simple question, giving multiple answers when only one is correct. It can also struggle with multiple constraints being given simultaneously, or can sometimes accept a false premise from the prompt as a correct one.

3. RLP - Reward Learning on Policy

3.1. Method

Reinforcement learning from human feedback consists of human preference collecting, reward learning and policy optimization. Policy optimization shifts the language model’s data distribution, causing the fixed reward model, which was trained on offline data, to become inaccurate when applied to new, off-distribution data. This inaccuracy in the reward model can degrade the policy performance during the policy optimization phase. One way to address this issue is to gather new human preference data from the current version of the policy [7], making the system more complex and harder to optimize, while maintaining high data quality over an extended period demands significant effort.

Fine-Tuning Language Models with Reward Learning on Policy paper [4] proposes a solution to keep the reward model on-distribution by optimizing it against policy without collecting new human preference data. Reward Learning on Policy (RLP) uses unsupervised learning to retrain the reward model using the samples from the trained policy.

Reward Learning on Policy starts just like RLHF, by fine-tuning a pre-trained language model with supervised learning to create a policy. Next, pairs of responses are generated by the policy and rated by human labelers to create a dataset of preferred and non-preferred responses. Preference data is used to fit a reward model that predicts preferred outputs by the labelers. The policy is then optimized against the reward model using proximal policy optimization to maximize the reward given by the reward model. Reward Model Retraining is where the RLP method differs from RLHF (comparison of the methods is shown in Figure 7.). The reward model is retrained on a dataset composed of samples gathered from the trained policy and original human preference data. For this purpose, the paper proposes two methods: Unsupervised Multi-View Learning (UML) and Synthetic Preference Generation (SPG).

Unsupervised Multi-View Learning enhances policy sample representations by creating two different but semantically similar views for each pair of input and output (multiple outputs are generated for a single input). Each view's representation is modeled using a Gaussian distribution, characterized by mean and deviation, which are computed using neural networks. UML uses specialized loss function Multi-view Information Bottleneck loss (MIB) [8] to ensure the representations keep essential information while removing irrelevant details. The calculated loss is optimized to preserve the mutual information between the views and to minimize the divergence between their distribution.

In Synthetic Preference Generation, the most frequent output for a single input is taken as a correct prediction and its frequency is used as a confidence score [9]. Semantically equivalent samples are then grouped and the confidence score of the set of outputs is calculated as a ratio of the largest cluster size and the total size of the set. Synthetic preference data is generated by selecting pairs of output from the group with the highest reward and randomly chosen output from other groups. Those pairs are then used for reward model optimization using pairwise ranking loss.

Figure 7. Comparison of RLHF and RLP methods [4]

3.2. Experiments

The experiments, testing Reward Learning on Policy, were conducted on the instruction following task using AlpacaFarm [29], LLMBar [30], and Vicuna [31] datasets.

To evaluate the performance of the proposed Reward Learning on Policy (RLP) framework, the experiments used simulated win-rate and human win-rate metrics. The win-rates were computed against a reference model, text-davinci-003.

RLP was compared to LLaMA-7B [32], supervised fine-tuned LLaMA-7B, Best-of-n [33](samples n independent responses from the SFT model and selects the one with the highest inferred reward), PPO, ChatGPT and GPT-4 baselines.

Results show that GPT-4 and ChatGPT outperform other models which was expected, but both RLP-SPG and RLP-UML achieve higher simulated and human win-rate than all other LLaMA-based models on all three datasets. Comparing the best-performing RLP method (RLP-SPG) with the best-performing baseline (PPO) shows that RLP outperforms PPO in every used test (results shown in Table 1.).

Table 1. The win-rate (%) performance of RLP and baselines [4]

4. Quark - Quantized Reward Konditioning

4.1. Method

To unlearn undesirable behavior such as toxicity, negative sentiment and repetition, authors of the paper [2] propose a Quantized Reward Konditioning (Quark) method for reward-based (un)learning with language models. Quark is initialized with a pre-trained language model and a data pool of (input, output, reward) examples sampled from the model and scored by the reward function. During training, the model alternates between exploration, quantization and learning steps (illustrated in Figure 7.) while using KL divergence not to deviate too far from the original model.

During the exploration step, the model generates new samples conditioned on the highest reward token, evaluates the reward, and stores generated results expanding the data pool.

The quantization step includes sorting the data pool by reward and splitting it into quantiles, assigning a reward token to each quantile. By partitioning data into quantiles and focusing on high-reward samples, the model can prioritize learning from examples that are most aligned with desired behaviors.

Learning entails updating the language model using samples from each quantile conditioned on its reward token. The goal of the training is to maximize the likelihood of a favorable outcome using KL divergence to keep the model close to the baseline thus maintaining stability and preventing overfitting to specific reward samples.

During inference, the model is conditioned on the highest reward token to produce high-quality results.

Figure 7. Quark method illustration [2]

4.2. Experiments

4.2.1. Unlearning Toxicity

In the following experiments, Perspective API was used as a reward function providing a score from 0 to 1 range, 0 being toxic and 1 not toxic.

The Quark model was compared to the GPT-2 baseline, PPLM [20], GEDI [21], DAPT [22], DExperts [23] and PPO model. Apart from the toxicity, fluency and diversity of the models are also tested with Perspective API, and fluency and topicality are tested by human evaluators.

Quark shows a significantly lower rate of toxic outputs compared to all of the baselines in both in-domain and out-of-domain tests, without a decrease in fluency or diversity, unlike other detoxification methods, as shown in Table 2. When compared to PPO model (RLHF method), Quark shows better results with fewer parameters and shorter training. Results in Table 3. show Quark's improvements in toxicity, topicality and fluency on human evaluation in both in-domain and out-of-domain tests.

Table 2. Automatic evaluation results of unlearning toxicity experiments [2]

Table 3. Human evaluation results of unlearning toxicity experiments [2]

4.2.2. Unwanted Sentiment

One of the goals of the Quark method was to steer the model away from generating text with unwanted sentiment, which was evaluated by testing the model’s ability to produce output with either positive or negative sentiment when prompted with an opposite one. The OpenWebText Corpus dataset [24] was used for training and evaluation.

Results in Table 4. show that Quark reduces the generation of unwanted sentiment more effectively than other baselines while maintaining the fluency and diversity of the text in automated evaluation. Human evaluation in Table 5. proves Quark to be not only more efficient in steering the model away from unwanted sentiment but also more fluent and more topical than the baselines.

Table 4. Automatic evaluation results of unlearning sentiment experiments [2]

Table 5. Human evaluation results of unlearning sentiment experiments [2]

4.2.3. Unlearning Degenerate Repetition

Language models tend to generate repetitive text. This experiment evaluates Quark’s ability to reduce such repetitions. The results were compared with maximum likelihood estimation (MLE), unlikelihood training (unlikelihood) [25], and contrastive training (SimCTG) [26]. Tests were done using the Wikitext-103 [27] test set, reporting perplexity, token prediction accuracy and prediction repetition for language modeling quality, and sequence-level repetition, diversity and MAUVE [28] for generation quality. Additionally, coherency, fluency and overall quality are tested by human evaluation.

As presented in Table 6., Quark shows better results than MLE and SimCTG, but worse than unlikelihood. Since Quark and Unlikelihood methods can be combined, the approach was evaluated against all baselines and showed significant improvements in almost all metrics.

Table 6. Results of unlearning repetitions of sequences experiments [2]

5. RAINIER - Reinforced Knowledge Introspector

5.1. Method

Commonsense reasoning poses a major challenge for modern NLP models because of the underlying knowledge required for the reasoning process being hidden. Unlike humans, who can introspect and articulate the reasons behind their conclusions [6], neural models struggle to express the reasoning behind the prediction. This limitation affects the models’ performance and robustness in commonsense tasks and makes it difficult to identify where they fail.

Authors of RAINIER: Reinforced Knowledge Introspector for Commonsense Question Answering paper [3] propose a RAINIER, a generative neural model trained to introspect and generate useful knowledge in response to given questions. Generated knowledge serves as additional context for a question-answering model, improving its performance. The paper proposes two-stage training process, illustrated in Figure 9.:

In Stage I (Imitation Learning) RAINIER is trained using knowledge generated by GPT-3, termed as "silver knowledge." This stage equips RAINIER with the basic ability to generate relevant knowledge by imitating GPT-3.

During Stage II (Reinforcement Learning), RAINIER is further trained using reinforcement learning, specifically Proximal Policy Optimization (PPO). The reward function in this stage is defined based on the performance improvement of a fixed QA model when prompted with the generated knowledge. This stage ensures the generated knowledge is both fluent and beneficial for answering questions.

Figure 9. Illustration of RAINIER method training [3]

5.2. Experiments

RAINIER has been tested on 8 seen multiple-choice datasets that were used for training UnifiedQA [34] which is used as a QA model for this approach.

RAINIER uses T5-large [35] as the knowledge introspector, the GPT-3-Curie model [36] for "silver knowledge" generation, and UnifiedQA-large as a fixed QA model. During stage II, the language modeling head is replaced with a regression head.

Results are compared to the vanilla UnifiedQA-large model and UnifiedQA-large model using knowledge from few-shot GPT-3 [37], self-talk [38] (10 knowledge statements per question with GPT-3-Curie) and DREAM [39] (10 scene elaborations per question using DREAM model)

Results show that QA model using knowledge generated by RAINIER outperforms vanilla QA by more than 5% on average. It improves the scores on 5 out of 8 seen datasets, as visible in Table 7. When compared to other knowledge generation models it shows smaller improvements, but it’s worth mentioning that RAINIER is significantly smaller, having 0.77B parameters compared to 13B with GPT-3 model. On the unseen dataset, the RAINIER-enhanced model outperforms the vanilla QA model on all 4 datasets (results shown in Table 8.).

RAINIER knowledge introspector was tested with other QA models including UnifiedQA models of different sizes and the Unicorn model, showing that it improves performance on any QA used. It shows the most significant gains with smaller QA models.

Table 7. Results on seen datasets [3]

Table 8. Results on unseen datasets [3]

6. Review

6.1. Methods comparison

Presented methods all use reinforcement learning to fine-tune the language model but utilize it in a different manner to solve various problems.

InstructGPT (RLHF), the oldest of the presented methods, laid the foundation of language model alignment using reinforcement learning from human feedback with proximal policy optimization. It showed significant improvements over the baseline and supervised fine-tuning without the need for additional annotated data.

Later the same year, a new paper was published presenting Quark, a reinforcement learning method that uses output quantization for more effective learning. Quark outperforms the baselines, including the InstructGPT, in unlearning undesirable behavior without sacrificing the fluency and diversity of the generated text.

A new paper was published recently, proposing an improvement on RLHF approach by retraining the reward model on policy samples to keep it on-distribution. The RLP method successfully improved the RLHF method showing better instruction-following abilities than the baselines. Unfortunately, the baselines didn’t include the Quark model so we can’t compare their results.

RAINIER tackles different problem than other explained methods making it difficult to compare. Nevertheless, a method that uses a generative neural model to generate useful knowledge and make the reasoning of LLMs more transparent is definitely interesting and worth mentioning.

6.2. Strengths

All reinforcement learning methods have some common advantages. They usually don’t require annotated data which can be hard and expensive to acquire. Compared to SFT models, which can get as good as their training data, reinforcement learning methods don’t suffer from that limitation as they learn from experience. The cost of data collection and training of all four models is modest compared to the cost of pretraining of the baselines, indicating it is worth the investment considering the improvements.

InstructGPT manages to follow instructions even on the tasks it was not trained for, like non-English tasks or code-related tasks, which is a great advantage because it can’t be trained for every single task. The Quark method, as well as InstructGPT, improve performance on targeted tasks without sacrificing performance on other tasks, making them low-tax alignment techniques. RLP inherits the advantages of RLHF (InstructGPT) with one additional advantage, it keeps the reward model on-distribution. Quark and RAINIER methods don’t require any labeled data, which makes training easier because of the availability of unlabeled public datasets. RAINIER provides meaningful knowledge that improves the performance of any QA model. Advantage of this model is that it outperforms the few-shot GPT-3 model that is 16 times bigger parameter-wise.

6.3. Weaknesses

A common disadvantage of reinforcement learning approaches is the need for a large amount of unlabeled data and computations. Reinforcement learning setups are usually more complicated than supervised fine-tuning setups making a bug more likely and hyper-parameter tuning more difficult.

One of the weaknesses of InstructGPT is the alignment of the model to the preferences of labelers who annotated and rated the data, which might not match the preferences of all users. Another disadvantage is that the reward model can become off-distribution due to data distribution shift during reinforcement learning. The biggest weakness of this approach is that it follows any instruction, making the outputs of the model even more toxic or biased than the baseline when asked to. Like InstructGPT, Quark can also be misused to increase model's toxicity and bias by providing the lowest quantile reward tokens. Another disadvantage is the inheritance of social biases from the reward function. RLP inherits all the weaknesses of InstructGPT except for the off-distribution reward model. RAINIER’s weakness is the gap between model performance and human reasoning, so it might not be sufficient for real-world applications.

7. References

[1] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. 2022. arXiv: 2203.02155 [cs.CL].

[2] X. Lu, S. Welleck, J. Hessel, L. Jiang, L. Qin, P. West, P. Ammanabrolu, and Y. Choi. Quark: Controllable Text Generation with Reinforced Unlearning. 2022. arXiv: 2205.13636 [cs.CL].

[3] J. Liu, S. Hallinan, X. Lu, P. He, S. Welleck, H. Hajishirzi, and Y. Choi. Rainier: Reinforced Knowledge Introspector for Commonsense Question Answering. 2022. arXiv: 2210.03078 [cs.CL].

[4] H. Lang, F. Huang, and Y. Li. Fine-Tuning Language Models with Reward Learning on Policy. 2024. arXiv: 2403.19279 [cs.CL].

[5] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. “Proximal policy optimization algorithms”. In: arXiv preprint arXiv:1707.06347 (2017).

[6] H. Mercier and D. Sperber. The enigma of reason. Harvard University Press, 2017.

[7] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving. “Fine-tuning language models from human preferences”. In: arXiv preprint arXiv:1909.08593 (2019).

[8] M. Federici, A. Dutta, P. Forré, N. Kushman, and Z. Akata. “Learning robust representations via multi-view information bottleneck”. In: arXiv preprint arXiv:2002.07017 (2020).

[9] C. Si, Z. Gan, Z. Yang, S. Wang, J. Wang, J. Boyd-Graber, and L. Wang. “Prompting gpt-3 to be reliable”. In: arXiv preprint arXiv:2210.09150 (2022).

[10] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le. “Finetuned language models are zero-shot learners”. In: arXiv preprint arXiv:2109.01652 (2021).

[11] V. Sanh, A. Webson, C. Raffel, S. H. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, T. L. Scao, A. Raja, et al. “Multitask prompted training enables zero-shot task generalization”. In: arXiv preprint arXiv:2110.08207 (2021).

[12] S. Lin, J. Hilton, and O. Evans. “Truthfulqa: Measuring how models mimic human falsehoods”. In: arXiv preprint arXiv:2109.07958 (2021).

[13] S. Gehman, S. Gururangan, M. Sap, Y. Choi, and N. A. Smith. “Realtoxicityprompts: Evaluating neural toxic degeneration in language models”. In: arXiv preprint arXiv:2009.11462 (2020).

[14] R. Rudinger, J. Naradowsky, B. Leonard, and B. Van Durme. “Gender bias in coreference resolution”. In: arXiv preprint arXiv:1804.09301 (2018).

[15] N. Nangia, C. Vania, R. Bhalerao, and S. R. Bowman. “CrowS-pairs: A challenge dataset for measuring social biases in masked language models”. In: arXiv preprint arXiv:2010.00133 (2020).

[16] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi. “Hellaswag: Can a machine really finish your sentence?” In: arXiv preprint arXiv:1905.07830 (2019).

[17] D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner. “DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs”. In: arXiv preprint arXiv:1903.00161 (2019).

[18] P. Rajpurkar, R. Jia, and P. Liang. “Know what you don’t know: Unanswerable questions for SQuAD”. In: arXiv preprint arXiv:1806.03822 (2018).

[19] A. Fan, M. Lewis, and Y. Dauphin. “Hierarchical neural story generation”. In: arXiv preprint arXiv:1805.04833 (2018).

[20] S. Dathathri, A. Madotto, J. Lan, J. Hung, E. Frank, P. Molino, J. Yosinski, and R. Liu. “Plug and play language models: A simple approach to controlled text generation”. In: arXiv preprint arXiv:1912.02164 (2019).

[21] B. Krause, A. D. Gotmare, B. McCann, N. S. Keskar, S. Joty, R. Socher, and N. F. Rajani. “Gedi: Generative discriminator guided sequence generation”. In: arXiv preprint arXiv:2009.06367 (2020).

[22] S. Gururangan, A. Marasovic ́, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, and N. A. Smith. “Don’t stop pretraining: Adapt language models to domains and tasks”. In: arXiv preprint arXiv:2004.10964 (2020).

[23] A. Liu, M. Sap, X. Lu, S. Swayamdipta, C. Bhagavatula, N. A. Smith, and Y. Choi. “DExperts: Decoding-time controlled text generation with experts and anti-experts”. In: arXiv preprint arXiv:2105.03023 (2021).

[24] A. Gokaslan, V. Cohen, E. Pavlick, and S. Tellex. “Openwebtext corpus. urlhttp”. In: Skylion007. github. io/OpenWebTextCorpus (2019).

[25] S. Welleck, I. Kulikov, S. Roller, E. Dinan, K. Cho, and J. Weston. “Neural text generation with unlikelihood training”. In: arXiv preprint arXiv:1908.04319 (2019).

[26] Y. Su, T. Lan, Y. Wang, D. Yogatama, L. Kong, and N. Collier. “A contrastive framework for neural text generation”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 21548–21561.

[27] S. Merity, C. Xiong, J. Bradbury, and R. Socher. “Pointer sentinel mixture models”. In: arXiv preprint arXiv:1609.07843 (2016).

[28] K. Pillutla, S. Swayamdipta, R. Zellers, J. Thickstun, S. Welleck, Y. Choi, and Z. Harchaoui. “Mauve: Measuring the gap between neural text and human text using divergence frontiers”. In: Advances in Neural Information Processing Systems 34 (2021), pp. 4816– 4828.

[29] Y. Dubois, C. X. Li, R. Taori, T. Zhang, I. Gulrajani, J. Ba, C. Guestrin, P. S. Liang, and T. B. Hashimoto. “Alpacafarm: A simulation framework for methods that learn from human feedback”. In: Advances in Neural Information Processing Systems 36 (2024).

[30] Z. Zeng, J. Yu, T. Gao, Y. Meng, T. Goyal, and D. Chen. “Evaluating large language models at evaluating instruction following”. In: arXiv preprint arXiv:2310.07641 (2023).

[31] C.-H. Chiang and H.-y. Lee. “Can large language models be an alternative to human evaluations?” In: arXiv preprint arXiv:2305.01937 (2023).

[32] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. “Llama: Open and efficient foundation language models”. In: arXiv preprint arXiv:2302.13971 (2023).

[33] N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. Voss, A. Radford, D. Amodei, and P. F. Christiano. “Learning to summarize with human feedback”. In: Advances in Neural Information Processing Systems 33 (2020), pp. 3008–3021.

[34] D. Khashabi, Y. Kordi, and H. Hajishirzi. UnifiedQA-v2: Stronger Generalization via Broader Cross-Format Training. 2022. arXiv: 2202.12359 [cs.CL].

[35] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. 2023. arXiv: 1910.10683 [cs.LG].

[36] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. Language Models are Few-Shot Learners. 2020. arXiv: 2005.14165 [cs.CL].

[37] J. Liu, A. Liu, X. Lu, S. Welleck, P. West, R. L. Bras, Y. Choi, and H. Hajishirzi. Generated Knowledge Prompting for Commonsense Reasoning. 2022. arXiv: 2110.08387 [cs.CL].

[38] V. Shwartz, P. West, R. L. Bras, C. Bhagavatula, and Y. Choi. Unsupervised Commonsense Question Answering with Self-Talk. 2020. arXiv: 2004.05483 [cs.CL].

[39] Y. Gu, B. D. Mishra, and P. Clark. DREAM: Improving Situational QA by First Elaborating the Situation. 2022. arXiv: 2112.08656 [cs.CL].

Seitenhierarchie

Fine-tuning Large Language Models using Reinforcement Learning