JudgeRail: Harnessing Open-Source LLMs for Fast Harmful Text Detection with Judicial Prompting and Logit Rectification

Published in ICLR 2025 (rejected), 2024

Large language models (LLMs) simultaneously facilitate the generation and detection of harmful text. This paper proposes JudgeRail, a novel framework that guides open-source LLMs to adhere to judicial principles during text moderation. We introduce a new logit rectification method that extracts an LLM’s classification intent, effectively controls its output format, and accelerates detection. Our evaluations demonstrate that JudgeRail can adapt open-source LLMs to be competitive with fine-tuned moderation models while requiring only 46% to 55% of the processing time needed by specialized models like LlamaGuard3 and ShieldGemma.