Self-Alignment Instruction Backtranslation

preview_player
Показать описание
Like 👍. Comment 💬. Subscribe 🟥.

Рекомендации по теме
Комментарии
Автор

1. *Prior work:*
- *Raw Language Model:* Raw language models are trained to predict text, such as continuing a Wikipedia article.
- *Use of Reinforcement Learning:* OpenAI applied reinforcement learning with a small dataset of back-and-forth communication to enable the model to have a conversation with users.

2. *Creation of "Humpback Llama2 70B":*
- *Web Crawl:* Started from a web crawl to gather initial data.
- *Instruction Generation:* Utilized text from the crawl to generate instructions, such as synthesized questions for information in Wikipedia.
- *Iterative Process:* Employed an iterative process of language models to select good pairs of instructions and answers.
- *Data Augmentation:* Augmented the data with synthetic examples that still have a base in originally human-generated information.
- *Fine-Tuning:* Fine-tuning led to "Humpback Llama2 70B" outperforming "Llama2 70B" on the Alpaca leaderboard.

3. *Alpaca Leaderboard:*
- *Evaluation Criteria:* Evaluates generation quality on 805 prompts.
- *Comparison Method:* Compares the pairwise win rate against a reference model (text-davinci-003).
- *Judgment Basis:* Uses GPT-4 judgments for evaluation.

4. *Comparison with Distilled Models:*
- *Performance:* "Humpback Llama2 70B" did not perform as well as a distilled Llama2 model (Vicuna).
- *Advantages of Non-Distilled Models:*
- Often larger and more complex, capturing nuanced patterns.
- Trained on original data, avoiding biases or errors from relying on another model's predictions.

5. *Methodology Insights:*
- *Scalable Approach:* The paper proposes a scalable method to fine-tune large language models to follow instructions.
- *Self-Training Algorithm:* Utilizes an iterative self-training algorithm called instruction backtranslation, allowing models to improve their own performance.
- *Impact on Leaderboard:* The fine-tuned models outperform other non-distilled instruction-following models on the Alpaca leaderboard.

wolpumba
Автор

The didgeridoo is very nice soundcheck.

wolpumba
Автор

When Reinforced Self-Training (ReST) by DeepMind in the same spirit?

darshank
Автор

should be called "the recipe paper"

khalilsabri