Tag: model safety

  • Improving Model Safety Behavior with Rule-Based Rewards | OpenAI

    OpenAI has developed a new method leveraging Rule-Based Rewards (RBRs) to align models to behave safely without extensive human data collection. RBRs use clear, simple rules to evaluate if the model’s outputs meet safety standards, integrated into the standard reinforcement learning from human feedback (RLHF) pipeline. Experiments show RBR-trained models have comparable safety performance to…