Benchmarking Codethreat’s AI SAST Engine
What Actually Matters in AppSec Tooling
Application security tools are notorious for flooding teams with irrelevant alerts, missing context-rich vulnerabilities, or simply underperforming in real-world conditions. Developers end up tuning out the noise, and security teams struggle to prioritize effectively.
At Codethreat, we designed a benchmark to measure what truly matters:
Can a tool catch real security issues, not just textbook examples?
Can it reduce the burden of false positives?
Can it understand code context, not just match patterns?
This benchmark is our attempt to validate that.
Why This Benchmark Format Matters
We didn’t build the vulnerable projects from scratch but we deliberately used an anonymized, de-biased version proposed.
Many public benchmarks unintentionally contain cues that make detection easier:
File names like xss_example.js
Variables called unsafe_input
Comments that highlight the exact vulnerability
Such hints can inflate accuracy, especially for AI-based tools trained on large corpora of known patterns.
Codethreat’s AI agents operate differently. They infer risk based on developer intent, data flow, and structural context. To evaluate this, we removed every artificial clue, simulating how a real-world team might encounter and fix a bug in production.
The Benchmark Setup
🧪 39 open-source projects
⚠️ 35 critical security vulnerabilities seeded
✅ Both patched and unpatched versions included
📄 No CWE hints, no explicit comments, no suggestive filenames
📊 Outputs parsed in SARIF, with recall and false-positive metrics compared across tools
Tools evaluated:
Codethreat
ZeroPath
Semgrep
Snyk
Bearer
Full source and validation process: GitHub - Codethreat Benchmark Repo
📊 Benchmark Results
Technical Vulnerabilities
Tool | Detection Rate | False Positive Rate |
|---|---|---|
CodeThreat | 88.57% | 0% |
ZeroPath | 77.14% | 5% |
Semgrep | 54.29% | 5% |
Snyk | 42.86% | 25% |
Bearer | 5.71% | 0% |
Based on 35 technical vulnerabilities across 39 benchmarks (XSS, SQLi, SSTI, Command Injection, and more)
Business Logic & Authentication Vulnerabilities
Tool | Detection Rate | False Positive Rate |
|---|---|---|
CodeThreat | 100% | 0% |
ZeroPath | 87.5% | 0% |
Semgrep | 12.5% | 0% |
Snyk | 0% | 0% |
Bearer | 0% | 0% |
Based on 8 business logic benchmarks, including broken authentication, missing authorization, and complex data validation issues.
Key Insights
Context still matters. Many tools struggle with vulnerabilities spread across files, behind framework abstractions, or involving deep control flow. Codethreat’s architecture-aware agents excelled in these scenarios.
False positives remain a major barrier. Especially in CI/CD pipelines, noisy alerts erode developer trust. Only Codethreat maintained precision without compromise.
Business logic and repo-level flaws are still hard to catch. While no tool achieves full coverage, Codethreat’s approach combining SAST, structural mapping, and PR-level AI reviews shows promising direction.
Codethreat blends traditional and intelligent analysis layers:
• 🔍 Rule-based SAST with rich language support
• 🧠 AI-powered contextual analysis aligned with developer workflows
• 🔄 Control/data flow resolution across repository structures
• ❌ A false positive elimination engine filtering before surfacing
Try It Yourself
The benchmark is open-source and reproducible. We invite the community to contribute, validate, and explore the results.



