SQUR beat humans in CTF

SQUR captures flags in a digital landscape

SQUR found more flags than human pentesters on the XBEN CTF benchmark suite. We reached 91 out of 104 flags (87.5%), exceeding the best reported human score of 85% in the published XBOW results.

This benchmark is a useful signal for exploitation capability, but it is not the end goal for SQUR.

What the benchmark measures

XBEN is a set of 104 CTF-style challenges. A CTF (Capture The Flag) is a security exercise where the goal is to find a hidden “flag” in a target environment, typically by identifying and exploiting one or more vulnerabilities. Each benchmark has a single objective: find the flag. That is a useful indicator of exploitation capability, but it is not a full pentest. A production pentest requires broader coverage across many vulnerability classes, strict guardrails, and remediation-ready proof-of-concept for every finding. We treat CTF results as a signal, not a substitute for real-world pentesting outcomes.

XBOW publishes the benchmark suite for public use here.

How we ran SQUR

SQUR’s pentesting methodology is designed to meet compliance and regulatory expectations (ISO 27001, the EU Cloud Resilience Act, SOC 2, and similar frameworks). For these CTF benchmarks, we simplified that methodology to focus on flag discovery rather than full pentest coverage.

We used the default SQUR implementation with finding verification disabled, which normally validates findings and performs deduplication (see how SQUR defines duplicates), because CTFs reward a single success path rather than comprehensive proof. We supplied the benchmark name and description to the agents, as CTFs typically provide a hint. We do not publish precise agent steps to avoid training future models on benchmark paths, and we cannot guarantee that benchmark data is absent from model training data.

Even with the pentest methodology simplified for CTFs, SQUR still reflects its core strengths: disciplined workflows, safety guardrails, and clear prioritization across many vulnerability classes.

Headline results

  • Flags found: 91 of 104 (87.5%)
  • Difficulty success: 93.3% easy, 86.3% medium, 62.5% hard
  • Median time to flag: 16.8 min (easy), 37.7 min (medium), 39.9 min (hard)
  • Time range: 5 minutes to 5 hours (a few outliers)
SQUR success rate by level Flags found per difficulty level Easy 93.3% Medium 86.3% Hard 62.5%

For comparison, the public XBOW report shows the top human pentester solved 85% of benchmarks in 40 hours, with other pentesters solving less. The principal pentester in the experiment was Federico Muttis, a highly respected security professional with multiple CVEs.

Success rate by pentester (approx.) Estimated from XBOW's published chart SQUR 87.5% Principal 1 ~85% Staff 1 ~59–60% Senior 1 ~52–53% Senior 2 ~20% Junior 1 ~27–28%

Strengths by vulnerability class

SQUR performed best in the most common web vulnerability classes:

  • 100% success: IDOR (15/15), SQLi (6/6), SSRF (3/3), XXE (3/3), GraphQL (3/3), business logic (7/7)
  • Very strong: XSS (21/23), privilege escalation (13/14), command injection (10/11)

These are the categories most frequently exploited in real-world web attacks.

Where SQUR struggled

The 13 unsolved benchmarks cluster into multi-step or protocol-heavy patterns: SSTI combined with default credentials, CVE-specific exploitation, advanced XSS filter bypass, HTTP smuggling and race conditions, multi-step file upload chains, and JWT privilege escalation chains.

These are precisely the areas we are actively improving. The goal is consistent outperformance across all classes, not just the common ones.

Timing and guardrails

SQUR solved the CTF benchmarks in roughly 5 hours of parallel wall-clock time. Because SQUR runs in parallel, this is not directly comparable to a single human pentester working sequentially.

As a side note, the total execution time across all solved benchmarks was 93 hours, which is more than the 40 hours given to human pentesters in the XBOW report. This is one reason we focus on parallel time rather than consecutive time when discussing autonomous systems.

The slowest runs are multi-step attacks requiring credential discovery, blind exploitation, or advanced techniques (200–300 minutes). A significant part of the runtime comes from guardrails that keep exploitation safe and non-destructive, which are mandatory for production pentesting.

What this indicates (and what it doesn’t)

The XBEN results indicate strong autonomous exploitation capability, but they are a signal, not definitive proof of full pentest performance.

That matters, but it is not our end goal. SQUR is built for full pentesting, not for CTF puzzles. We are not optimized for CTF challenges, and still, SQUR performs strongly. Our ambition is to consistently outperform human pentesters across all vulnerability classes in real-world pentests.

Conclusion

SQUR found more flags than the reported human outcomes in the XBEN benchmark suite. We are not optimizing for CTFs, yet we already outperform human results in this test. Our ambition is to consistently outperform human pentesters across all vulnerability classes in real-world pentests.

Want to learn more about how SQUR can test your environment? Book a demo.