Best Open Model for Real Prompts

Best Open Model for Real Prompts

Ryan Wong October 18, 2025 AI, LLM, benchmarks, model-comparison, GPT-OSS, Qwen3, DeepSeek, research

TLDR;

Winner: gpt-oss-120b it led the pack in technical tasks like bug solving and complex logic, and it ran efficiently on a single 80GB GPU, which let us test heavy prompts without huge infra overhead.

Other highlights: Qwen3 stood out for research style prompts and live-information lookups. GPT5 and DEEPSEEK V3.1 performed very well on coding and analysis. KIMI THINKING PREVIEW produced solid output in several areas but failed a specific "needle in a haystack" test.

Introduction

Benchmarks are useful, but many published scores do not match how teams use models in practice. Public leaderboards often focus on small or curated tasks, or on single metrics that do not reflect real engineering demands. We built a test set that mixes: (a) practical engineering problems, (b) long form writing, (c) research lookups and accuracy, (d) multi step logic puzzles, and (e) retrieval style "needle in a haystack" challenges. The goal: measure capability under conditions you might actually push a model into on Oct 18, 2025.

How we conduct testing

  • Task suite: five categories Bug Solving, Creative Writing, Research, Complex Logic, and Needle in Haystack retrieval. (The raw scoring table is shown below)

  • Prompts: real examples collected from engineers, content teams, and analysts; each prompt run 3 times and scored by two human raters for correctness, completeness, and practical usefulness.

  • Latency & resources: models ran on representative GPU setups. Wherever possible we used recommended runtime configs (for example, 80GB H100/H100 class or H100/H100 equivalents for large MoE and 20 40GB setups for smaller weights).

  • Scoring: 0 - 10 scale per task, averaged across prompts. Ratings focused on usable output e.g., a bug fix that compiles and is correct scored higher than a plausible but incorrect patch.

  • Reproducibility: every prompt, seed, and GPU profile is logged in the shared spreadsheet linked at the end.

Results

Below is the table of averaged scores as of (Oct 18, 2025):

Test Kimi-Latest Kimi K2 Moonshot-V1 GPT-OSS-20B GPT-OSS-120B QWEN3 GPT-5 COPILOT DEEPSEEK V3.1 KIMI-THINKING-PREVIEW
1. Bug Solving 6/10 8/10 9/10 9/10 10/10 9/10 10/10 9/10 10/10 10/10
2. Creative Writing 9/10 9/10 9/10 9/10 9/10 8/10 9/10 9/10 9/10 9/10
3. Research 6/10 4/10 4/10 4/10 4/10 9/10 4/10 4/10 3/10 4/10
4. Complex Logic 7/10 6/10 6/10 9/10 10/10 10/10 9/10 9/10 10/10 10/10
5. Needle in Haystack 10/10 10/10 10/10 10/10 10/10 10/10 10/10 10/10 10/10 0/10
  • gpt-oss-120b was the most consistent high scorer across technical tasks. That aligns with gpt-oss positioning as a high reasoning model designed to run on a single 80GB GPU.

  • Qwen3 was the strongest model for the research style prompts in our suite; it handled factual lookups and citation style replies better than most other models in this test. That corresponds with Qwen3's public positioning and launch notes.

  • DEEPSEEK V3.1 and GPT-OSS variants both performed well on analytical and coding demands. Several community reports and docs place these models in the same performance tier for reasoning benchmarks.

  • KIMI-THINKING-PREVIEW produced reasonable answers in multiple categories but failed the needle in haystack task in a way that suggests a retrieval or prompt routing bug during that run.

Take away

  • If your priority is reliable, high quality reasoning and you want a model that can be run on a single 80GB GPU, gpt-oss-120b is the strongest choice from our set. The model's design and available runtime metadata match the strengths we observed in bug solving and logic tasks. link

  • If you need the best results for factual lookups and research style responses, test Qwen3 on your actual prompts. It handled our lookup suite better than the other open models in this round. link

  • For teams building tooling around agent workflows or heavy code generation, GPT-5 and some of the leading research models are strong candidates; run your own representative prompts and measure both correctness and execution cost. link

For full logs, per prompt scores, and the exact GPU/seed settings we used, see the research spreadsheet: benchmark link

Ready to Build Your AI Product?

Book a consultation to learn more about implementing the best AI models for your project.

Book Consultation

Related Posts

AI News Week of April 10, 2026

AI News Week of April 10, 2026

Anthropic launches Project Glasswing giving select partners access to Claude Mythos Preview for defensive cybersecurity. Google and Broadcom lock in a long-term TPU partnership through 2031 while securing multi-gigawatt compute for Anthropic. Meta unveils Muse Spark as its first model from rebuilt superintelligence efforts, Google pushes Gemini beyond chat toward a real project workspace, and the EU begins assessing whether ChatGPT should face stricter platform regulations.

April 10, 2026 Read More →
AI News Week of May 1, 2026

AI News Week of May 1, 2026

OpenAI and Microsoft rewrite one of AI's most important contracts ending Microsoft's exclusive resale rights, OpenAI expands to AWS with GPT-5.5 and Codex on Bedrock, Google makes Gemini a real work output tool with file generation, Anthropic launches Claude for Creative Work with Adobe and Blender connectors, Microsoft commits to $190B capex as Azure grows 40%, Meta raises AI infrastructure spending to $145B with $25B bond sale, and the Pentagon signs classified AI deals excluding Anthropic.

May 1, 2026 Read More →
AI News Week of October 18, 2025

AI News Week of October 18, 2025

OpenAI partners with Walmart for instant ChatGPT checkout, Slack launches AI workspace assistant, Intel announces Panther Lake AI chips, and OpenAI releases Sora 2 with Cameo feature. Stay ahead of the curve with the latest developments.

October 18, 2025 Read More →