OpenAI has just taken the wraps off a brand-new lineup of AI models, all falling under the GPT-4.1 banner – think GPT-4.1 itself, along with its smaller siblings, GPT-4.1 mini and GPT-4.1 nano.
These models are now up and running through OpenAI’s API and are specifically built to shine when it comes to coding and really nailing detailed instructions. This feels like a genuine leap forward on the path to having AI power our software development.
One of the truly impressive things about these models is their massive 1-million-token context window. To put it in perspective, that means they can chew through nearly 750,000 words at once – that’s longer than Tolstoy’s War and Peace! Interestingly though, you won’t find the GPT-4.1 models directly inside ChatGPT just yet. This launch comes at a time when the AI scene is getting seriously competitive, with big players like Google and Anthropic pushing out their own high-performing models. Google’s Gemini 2.5 Pro and Anthropic’s Claude 3.7 Sonnet, both also boasting that million-token capacity, have been showing some strong results in the usual coding benchmark tests. Not to be left out, the Chinese startup DeepSeek is also making waves with their updated V3 model.
Looking ahead, OpenAI’s big picture goal is to create a fully capable AI software engineer – or, as the company’s CFO Sarah Friar put it, an “agentic software engineer” that can handle the whole lifecycle of building an app, from writing the initial code and squashing bugs to handling quality assurance and writing up the documentation.
GPT-4.1 marks some serious progress in that direction. According to OpenAI, they’ve really listened to developer feedback and fine-tuned the model to be more efficient and reliable in areas like:
- Frontend coding
- Keeping things formatted and structured correctly
- Using tools consistently
- Cutting down on unnecessary edits
“These improvements mean developers can build agents that are noticeably better at tackling real-world software engineering tasks,” OpenAI mentioned in a statement to TechCrunch.
Benchmark Performance
OpenAI is saying that GPT-4.1 is showing better results than its older brothers (GPT-4o and GPT-4o mini) on tests like SWE-bench, which is designed to evaluate how well AI handles actual software engineering challenges. While the full GPT-4.1 model gives you the highest accuracy, the mini and nano versions are all about speed and efficiency – with nano being OpenAI’s fastest and cheapest model to date.
- GPT-4.1: $2 per million input tokens, $8 per million output tokens
- GPT-4.1 mini: $0.40 per million input, $1.60 per million output
- GPT-4.1 nano: $0.10 per million input, $0.40 per million output
Based on OpenAI’s own internal testing, GPT-4.1 scored between 52% and 54.6% on SWE-bench Verified. For comparison, Google’s Gemini 2.5 Pro hit 63.8%, and Claude 3.7 Sonnet scored 62.3%.
GPT-4.1 also did pretty well when it came to understanding video. In the Video-MME evaluation, it managed to get 72% accuracy on long videos without subtitles – which is the best score among all the models they tested.
Limitations and Reliability
Despite all these advancements, OpenAI is being upfront about the fact that even GPT-4.1 isn’t perfect. The model can still introduce bugs into the code or fail to fix them, and it can become less accurate when dealing with really long prompts. On their own OpenAI-MRCR test, the model’s accuracy dropped from 84% with 8,000 tokens down to just 50% with a million tokens.
Plus, GPT-4.1 tends to be a bit more literal than GPT-4o, sometimes needing more precise and very clear instructions to give you the best output.
Even with these limitations, the arrival of GPT-4.1 feels like another significant step forward in the ongoing race to create fully autonomous coding tools. OpenAI is definitely laying the groundwork for their ambitious vision of AI taking the lead in software engineering.