This report presents the results of a comprehensive security vulnerability detection benchmark. The experiment tested 9 different LLM models using 4 prompt variations across 3 independent runs to ensure statistical reliability.
Rank | Model | Detection Rate | 95% CI | Parsing Rate | Avg Response Time |
---|---|---|---|---|---|
1 | gpt-4o-mini | 99.3% | [99.3, 99.4] | 100.0% | 1.24s |
2 | gpt-4.1-mini | 97.7% | [97.7, 97.8] | 100.0% | 3.47s |
3 | meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo | 97.5% | [97.5, 97.5] | 93.0% | 1.81s |
4 | gpt-4.1 | 97.0% | [96.7, 97.3] | 99.9% | 3.44s |
5 | Qwen/Qwen2.5-7B-Instruct-Turbo | 96.8% | [96.6, 96.9] | 98.7% | 0.94s |
6 | gpt-4o | 87.7% | [86.9, 88.5] | 100.0% | 3.01s |
7 | meta-llama/Llama-3.2-3B-Instruct-Turbo | 79.4% | [78.2, 80.5] | 99.6% | 5.14s |
8 | gpt-3.5-turbo | 71.3% | [70.1, 72.6] | 100.0% | 0.89s |
9 | gpt-4.1-nano | 60.0% | [59.1, 60.8] | 98.9% | 0.86s |
Prompt Type | Avg Detection Rate | Std Deviation | Best Performing Model |
---|---|---|---|
flawed | 83.0% | 16.9% | gpt-4o-mini (99.7%) |
minimal | 83.4% | 23.1% | gpt-4.1 (100.0%) |
standard | 92.7% | 13.1% | gpt-4.1 (100.0%) |
standard_deception | 97.4% | 1.7% | gpt-4o-mini (100.0%) |
The following table shows detection performance for specific test cases across models:
Detailed test-by-test results are available in the raw data files.
Key patterns observed:
{ "description": "Full benchmark testing all models and prompt types with statistical validation", "execution_config": { "max_workers": null, "parallel": false, "parallel_prompts": true, "parallel_runs": true }, "models": [ "gpt-4o-mini", "gpt-4o", "gpt-4.1-nano", "gpt-4.1-mini", "gpt-4.1", "gpt-3.5-turbo", "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo", "meta-llama/Llama-3.2-3B-Instruct-Turbo", "Qwen/Qwen2.5-7B-Instruct-Turbo" ], "name": "Comprehensive Security Benchmark Experiment", "prompt_types": [ "flawed", "minimal", "standard", "standard_deception" ], "reporting": { "generate_report": true, "generate_visualizations": true, "report_formats": [ "html", "pdf", "markdown" ] }, "runs_per_config": 3, "save_intermediate": true, "test_levels": { "advanced": true, "basic": true, "ultra": true } }