JudgeRail: Harnessing Open-Source LLMs for Fast Harmful Text Detection with Judicial Prompting and Logit Rectification
Published in ICLR 2025 (rejected), 2024
Large language models (LLMs) simultaneously facilitate the generation and detection of harmful text. This paper proposes JudgeRail, a novel framework that guides open-source LLMs to adhere to judicial principles during text moderation. We introduce a new logit rectification method that extracts an LLM’s classification intent, effectively controls its output format, and accelerates detection. Our evaluations demonstrate that JudgeRail can adapt open-source LLMs to be competitive with fine-tuned moderation models while requiring only 46% to 55% of the processing time needed by specialized models like LlamaGuard3 and ShieldGemma.