Nvidia’s open-nano-9b-9b-9b-9b-9b to activate / deactivate reasoning

Do you want smarter information in your reception box? Sign up for our weekly newsletters to obtain only what matters for business managers, data and security managers. Subscribe now
Small models have a moment. In the heels of the release of a new IA vision model enough to adapt to a smartwatch from the MIT spin-off liquid AI, and a model fairly small to operate on a smartphone from Google, Nvidia joins the party today with a new model of small language (SLM) to him, Nemotron-Nano-9b-V2which has achieved the highest performance in its class on selected landmarks and is delivered with the possibility for users to switch to and deactivate the “reasoning” of the AI, that is to say self-truth before producing a response.
While the 9 billion parameters are larger than some of the parameters of several million models, the small Venturebeat models have recently coveredNvidia notes that this is a significant reduction compared to its original size of 12 billion parameters and is designed to adapt on a Unique nvidia a10 gpu.
Like Oleksii Kuchiaev, director of the AI model, NVIDIA, said on X in response to a question that I submitted to him: “The 12B was cut to 9B to adapt specifically to A10 which is a popular GPU choice for deployment. It is also a hybrid model which allows it to treat a larger lot size and to be up to 6x faster than the models of transformers of similar size. »»
For the context, many leading LLM are in the parameter range of 70 billion billion (the recall parameters refer to the internal parameters governing the behavior of the model, generally denoting a larger and more capable, but more calculated) model.
The AI scale reached its limits
Electricity ceilings, increase in token costs and inference delays restart the AI company. Join our exclusive fair to discover how best the teams are:
- Transform energy into a strategic advantage
- Effective inference architecting for real debit gains
- Unlock a competitive return on investment with sustainable AI systems
Secure your place to stay in advance::
The model manages several languages, notably English, German, Spanish, French, Italian, Japanese and in extended, Korean, Portuguese, Russian and Chinese descriptions. It is suitable for both Next instruction and code generation.
Nemotron-Nano-9b-V2 and its pre-training data sets available now on the face of the embrace and via the company’s catalog of models.
A merger of transformer and mamba architectures
It is based on Nemotron-H, a set of hybrid Mamba transformer models which constitute the bases of the last offers of the company.
While the most popular LLMs are pure “transformer” models, which are fully based on layers of attention, they can become expensive in memory and calculate as the sequence lengths increase.
Instead, Nemotron-H models and others using MAMBA architecture developed by researchers from Carnegie Mellon and Princeton University, also Weave in selective state space models (or SSM), which can manage very long sequences of information in and out of the state.
These layers evolve linearly with the sequence length and can treat contexts much longer than standard self-tension without the same memory and calculate the general costs.
AhYbrid Mamba-Transformator reduces these costs by substituting most of the attention with layers of linear state space, reaching up to 2 to 3 × higher speed on long contexts with comparable precision.
Other laboratories AI beyond Nvidia such as AI2 have also published models based on Mamba architecture.
Top on / reasoning using the language
Nemotron-Nano-9b-V2 is positioned as a unified cat and reasoning model only formed from zero.
THE The default system generates a trace of reasoning before providing a final response, although users can switch this behavior Through simple control tokens such as / think or / no_think.
The model too INtrodces management of the “reflection budget” of the executionwhich allows developers to cap the number of tokens devoted to internal reasoning before the model ends an answer.
This mechanism aims to balance precision with latency, especially in applications such as customer support or autonomous agents.
The landmarks tell a promising story
The results of the evaluation highlight competitive precision against other models open on a small scale. Tested in “reasoning on” mode using the Nemo-Skills suite, Nemotron-Nano-9b-V2 reached 72.1% on AIME25,, 97.8% on Math500, 64.0% on GPQAAnd 71.1% on Livecodebench.
Scores on the following instruction and long context benchmarks are also reported: 90.3% on Ifeval, 78.9% on the 128K rule testAnd smaller but measurable gains on BFCL V3 and the Hle reference.

Overall, nano-9b-v2 shows higher precision than Qwen3-8b, a common comparison point.

NVIDIA illustrates these results with precision curves towards the budget which show how the performance evolves as a token allocation for reasoning increases. The company suggests that careful budget control can help developers optimize both quality and latency in the use of production.
Trained on synthetic data sets
The Nano model and the Nemotron-H family are based on a mixture of organized training data, of web and synthetic origin.
Corpus include the general text, the code, mathematics, sciences, legal and financial documents, as well as sets of answers to alignment style questions.
NVIDIA confirms the use of traces of synthetic reasoning generated by other large models to strengthen performance on complex benchmarks.
License and commercial use
The Nano-9B-V2 model is published as part of the NVIDIA Open model license agreement, updated for the last time in June 2025.
The license is designed to be permissive and adapted to businesses. Nvidia explicitly indicates that the models are Commercially outside the boxand that Developers are free to create and distribute derivative models.
Above all, Nvidia does not claim the property of the results generated by the model, leaving responsibility and rights with the developer or the organization that uses it.
For a corporate developer, this means that the model can be put into production immediately without negotiating a separate commercial license or payment costs linked to use thresholds, income levels or user counts. There are no clauses requiring a paid license once a company has a certain scale, unlike certain licenses open on several levels used by other suppliers.
That said, the agreement includes several conditions that companies must observe:
- Railing: Users cannot bypass or deactivate integrated safety mechanisms (called “goalkeepers”) without implementing comparable replacements adapted to their deployment.
- Redistribution: Any redistribution of the model or derivatives must include the text and the allocation of the NVIDIA open model license (“under Nvidia Corporation license under the license of the Open Nvidia model”).
- Compliance: Users must comply with commercial regulations and restrictions (for example, American export laws).
- Trustful AI terms: The use must be aligned with the trust guidelines of the trustworthy AI of Nvidia, which cover responsible deployment and ethical considerations.
- Litigation clause: If a user initiates copyright or patent disputes against another entity alleging a counterfeit by the model, the license ends automatically.
These conditions focus on legal and responsible use rather than on the commercial scale. Companies do not need to request additional royalties or pay for royalties in Nvidia simply for the construction of products, monetize them or the scaling of their user base. Instead, they must ensure that deployment practices respect security, allocation and compliance obligations.
Positioning on the market
With Nemotron-Nano-9b-V2, Nvidia targets developers who need a balance of reasoning capacity and deployment efficiency on smaller scales.
The control of the execution budget and the reasoning features are intended to give the manufacturers of systems more flexibility in the management of the precision compared to the response speed.
Their release on the embraced face and the catalog of Nvidia models indicates that they are intended to be widely accessible for experimentation and integration.
The release by NVIDIA of Nemotron-Nano-9b-V2 presents a continuous accent on efficiency and controllable reasoning in language models.
By combining hybrid architectures with new compression and training techniquesThe company offers developer tools that seek to maintain precision while reducing costs and latency.
https://venturebeat.com/wp-content/uploads/2025/08/cfr0z3n_graphic_novel_style_isometric_intricate_hand_drawn_pe_3bf0f985-47a4-4a51-ae00-d74636547a00_2.png?w=1024?w=1200&strip=all