Improving Model Safety Behavior with Rule-Based Rewards | OpenAI

OpenAI has developed a new method leveraging Rule-Based Rewards (RBRs) to align models to behave safely without extensive human data collection. RBRs use clear, simple rules to evaluate if the model’s outputs meet safety standards, integrated into the standard reinforcement learning from human feedback (RLHF) pipeline. Experiments show RBR-trained models have comparable safety performance to those trained with human feedback, reducing the need for extensive human data and making training faster and more cost-effective.

Source: OpenAI (July 24, 2024) OpenAI


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *