OpenAI has developed a new method leveraging Rule-Based Rewards (RBRs) to align models to behave safely without extensive human data collection. RBRs use clear, simple rules to evaluate if the model’s outputs meet safety standards, integrated into the standard reinforcement learning from human feedback (RLHF) pipeline. Experiments show RBR-trained models have comparable safety performance to those trained with human feedback, reducing the need for extensive human data and making training faster and more cost-effective.
Source: OpenAI (July 24, 2024) OpenAI
Leave a Reply