The new cohere vision model works on two GPUs, beats high -level VLM on visual tasks -

Do you want smarter information in your reception box? Sign up for our weekly newsletters to obtain only what matters for business managers, data and security managers. Subscribe now

The increase in in -depth research characteristics and other AI -fed analyzes has given birth to more models and services seeking to simplify this process and read more documents than companies really use.

The Canadian Company of AI Cohere Bance on its models, including a newly published visual model, to assert that in -depth research features should also be optimized for business use cases.

The company has published Command A Vision, a visual model specifically targeting business use cases, built a model at the back of its order. The 112 billion parameter model can “unlock valuable information from visual data and make very precise and data -based decisions thanks to optical character recognition (OCR) and image analysis”, explains the company.

“Whether it is to interpret product manuals with complex diagrams or the analysis of real world scenes photographs for risk detection, ordering a vision excels in meeting the most demanding business challenges,” said the company in a blog article.

The IA Impact series returns to San Francisco – August 5

The next AI phase is here – are you ready? Join the Block, GSK and SAP leaders for an exclusive overview of how autonomous agents reshape business workflows – from real -time decision -making to end -to -end automation.

Secure your place now – space is limited:

This means ordering a vision can read and analyze the most common types of images that companies need: graphics, graphics, diagrams, digitized documents and PDF.

? @adhere I just dropped a vision on @Huggface ?
Designed for business multimodal use cases: interpret product manuals, analyze photos, ask questions about graphics… ❓ ??
A dense 112B model of vision with SOTA performance – Consult the reference metrics in… pic.twitter.com/ormfm5f8cf
– Jeff Boudier? (@jeffboudier) July 31, 2025

Since it is built on the architecture of command A, controlling a vision requires two GPUs or less, just like the text model. The vision model also retains the text capacities of the command A for reading words on the images and includes at least 23 languages. COHERE said that, unlike other models, ordering a vision reduces the total cost of companies’ possession and is fully optimized for the recovery of user cases for companies.

How cohere is the command architecture to

COHERE said that it had followed an LLAVA architecture to build its order a model, including the visual model. This architecture transforms the visual characteristics into sweet vision tokens, which can be divided into different tiles.

These tiles are transmitted in the order a text tower, “a dense and 111b textual LLM,” said the company. “In this way, only one image consumes up to 3,328 tokens.”

Cohere said that he had formed the visual model in three stages: alignment of vision, supervised fine adjustment (SFT) and learning post-training with human feedback (RLHF).

“This approach allows the mapping of image encoder features in the space of integration of the language model,” said society. “On the other hand, during the SFT stadium, we simultaneously trained the vision coder, the vision adapter and the language model on a diversified set of multimodal tasks that follow instructions.”

Visualize the AI company

Reference tests have shown the command of a vision of other models with similar visual capacities.

COHERE PIQUE A vision at GPT 4.1 of OPENAI, Meta’s LLAMA 4 MAVERICK, Mistral’s Pixtral Grand and Mistral Medium 3 in Nine Benchmark Tests. The company did not mention whether it has tested the model against the API focused on the Mistral OCR, Mistral OCR.

It allows agents to see safely in the visual data of your organization, unlocking the automation of tedious tasks involving slides, diagrams, PDFs and photos. pic.twitter.com/ihznuwekrkrk
– COHERE (@cohere) July 31, 2025

Order a vision has exceeded other models in tests such as Chartqa, Ocrbench, AI2D and Textvqa. Overall, the order of a vision had an average score of 83.1% against 78.6% of GPT 4.1, 80.5% of Llama 4 and 78.3% of Mistral Medium 3.

Nowadays, most major language models (LLM) are multimodal, which means that they can generate or understand visual media such as photos or videos. However, companies generally use more graphic documents such as graphics and PDFs, therefore the extraction of information from these unstructured data sources is often difficult.

With in -depth research on the climb, the importance of bringing models capable of reading, analyzing and even downloading unstructured data has increased.

COHERE also said that he offered the order a vision in an open weight system, in the hope that companies seeking to move away from closed or owners’ models begin to use its products. So far, there is some interest on the part of developers.

Very impressed by its precision extraction of handwritten notes by hand of an image!
– Adam Sardo (@sardo_adam) July 31, 2025

Finally, an AI that will not judge my terrible scratches.
– Martha Wisener? (@Martwiser) August 1, 2025

Daily information on business use cases with VB daily

If you want to impress your boss, VB Daily has covered you. We give you the interior scoop on what companies do with a generative AI, from regulatory changes to practical deployments, so that you can share information for a maximum return on investment.

Read our privacy policy

Thank you for subscribing. Discover more VB newsletters here.

An error occurred.