Sitemap

A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.

Posts

Future Blog Post

less than 1 minute read

Published: January 01, 2199

This post will show up by default. To disable scheduling of future posts, edit config.yml and set future: false.

Blog Post number 4

less than 1 minute read

Published: August 14, 2015

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Blog Post number 3

less than 1 minute read

Published: August 14, 2014

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Blog Post number 2

less than 1 minute read

Published: August 14, 2013

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Blog Post number 1

less than 1 minute read

Published: August 14, 2012

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

portfolio

Portfolio item number 1

Short description of portfolio item number 1

Portfolio item number 2

Short description of portfolio item number 2

publications

JUDGERAIL:HARNESSINGOPEN-SOURCELLMSFOR FASTHARMFUL TEXTDETECTIONWITH JUDICIAL PROMPTINGANDLOGITRECTIFICATION

Published in ICLR 2025(Reject), 2024

Largelanguagemodels(LLMs)simultaneouslyfacilitatethegenerationandde tectionofharmful text. LeadingLLMdevelopers, suchasOpenAI,Meta, and Google, aredrivingaparadigmshift in thedetectionofharmful text,moving fromconventionaldetectorstofine-tunedLLMs.However, thesenewlyreleased models,whichrequiresubstantialcomputationalanddataresources,havenotyet beenthoroughlyinvestigatedfortheireffectivenessinthisnewparadigm. Inthis work,weproposeJudgeRail, anovelandgenericframeworkthatguidesopen sourceLLMstoadheretojudicialprinciplesduringtextmoderation.Additionally, weintroduceanewlogit rectificationmethodthatcanextractanLLM’sclassi ficationintent, effectivelycontrols itsoutput format, andacceleratesdetection. Byintegratingseveral top-performingopen-sourceLLMs intoJudgeRailwith out anyfine-tuningandevaluatingthemagainstOpenAIModerationAPI,Lla maGuard3, ShieldGemma, andother conventionalmoderationsolutionsacross variousdatasets,includingthosespecificallydesignedforjailbreakingLLMs,we demonstrate that JudgeRail canadapt theseLLMs tobecompetitivewithfine tunedmoderationmodels andsignificantlyoutperformconventional solutions. Moreover,weevaluateallmodelsfordetectionlatency, acriticalyet rarelyex aminedpracticalaspect,andshowthatLLMswithJudgeRailrequireonly46%to 55%ofthetimeneededbyLlamaGuard3andShieldGemma.Thegenericnature andcompetitiveperformanceofJudgeRailhighlight itspotential forpromoting thepracticalityofLLM-basedharmful textdetectors.Warning: sometextex amplespresentedinthispapermaybeoffensivetosomereaders.

Download Paper