7 New, Cutting-Edge AI Chips From Nvidia And Rivals In 2025

While Nvidia remains at the top of the AI computing hierarchy, several other companies are vying to grab a sizeable piece of the market with AI chips that either go head-to-head with Nvidia’s fastest GPUs or go after adjacent opportunities like the edge.

AI data. innovations and technology. AI text on CPU. Artificial Intelligence digital concept.

While Nvidia remains at the top of the AI computing hierarchy with its GPUs, several other companies are vying to grab a sizeable piece of the market with AI chips that either go head-to-head with Nvidia’s fastest or go after adjacent opportunities like the edge.

These include companies big and small, ranging from Nvidia customers such as Amazon and Google to direct competitors like AMD and d-Matrix, the latter of which belongs to a stable of startups developing new chip architectures from the ground up.

Most recently, AMD revealed its biggest challenge to Nvidia yet with the Instinct MI350 series that will provide 60 percent greater high-bandwidth memory capacity than its rival’s B200 GPU and GB200 Superchip. This helps the MI355X provide up to 40 percent more tokens per dollar for than the B200 for inferencing models, according to the company.

As part of CRN's AI Week 2025, here are seven cutting-edge AI chips from Nvidia and rivals that have been released in the past several months or are coming to the market later this year.

Amazon Web Services Trainium2

Amazon Web Services launched its Trainium2 AI chip last December, saying that it typically offers 30 percent to 40 percent better price performance than other GPU-powered instances that were available at the time.

This is based on the Trn2 instances, which feature 16 Trainium2 chips to offer 20.8 petaflops of peak computing power for training and deploying multibillion-parameter large language models, according to AWS.

Trainium2 also powers the new Amazon EC2 Trn2 UltraServers, which feature 64 such chips connected using its NeuronLink interconnect, allowing them to scale up to 83.2 peak petaflops. This is quadruple the compute, memory and networking capabilities of a single instance.

Each Trainium2 chip consists of eight NeuronCore-v3 components that collectively enable nearly 1,300 teraflops of 8-bit floating-point compute, which is 6.7 times faster than the first-generation Trainium. Its high-bandwidth memory capacity is 96 GB, three times greater, while the high-bandwidth memory bandwidth is 2.9 TBps, 3.6 times faster.

AMD Instinct MI355X

AMD said last week that its Instinct MI355X GPU comes with greater memory capacity and better or similar AI performance compared to Nvidia’s fastest Blackwell-based chips.

The MI355X features 288 GB of HBM3e memory, which is higher than the 256-GB capacity of its MI325X and roughly 60 percent higher than the capacity of Nvidia’s B200 GPU and GB200 Superchip, according to AMD. The company said this allows the GPU to support an AI model with up to 520 billion parameters on a single chip. The GPU’s memory bandwidth is 8 TBps, which it said is the same as the B200 and GB200.

The MI355X—which has a thermal design power of up to 1,400 watts and is targeted for liquid-cooled servers—can provide up to 20 petaflops of peak 6-bit floating point (FP6) and 4-bit floating point (FP4) performance.

AMD claimed that the FP6 performance is two times higher than what is possible with the GB200 and more than double that of the B200. FP4 performance, on the other hand, is the same as the GB200 and 10 percent faster than the B200, according to the company.

The MI355X can also perform 10 petaflops of peak 8-bit floating point (FP8), which AMD said is on par with the GB200 but 10 percent faster than the B200; five petaflops of peak 16-bit floating point (FP16), which it said is on par with the GB200 but 10 percent faster than the B200; and 79 teraflops of 64-bit floating point (FP64), which it said is double that of the GB200 and B200.

AMD said the MI355X “delivers the highest inference throughput” for large models, with the GPU providing roughly 20 percent better performance for the DeepSeek R1 model and approximately 30 percent better performance for a 405-billion-parameter Llama 3.1 model than the B200.

Compared to the GB200, the company said the MI355X is on par when it comes to the same 405-billion-paramater Llama 3.1 model.

The MI355X’s inference advantage over the B200 allows the GPU to provide up to 40 percent more tokens per dollar, which AMD called a “key value proposition” against Nvidia.

D-Matrix Corsair

Chip design startup d-Matrix last November launched its Corsair accelerator card, which it called the “world’s most efficient AI computing platform for inference in data centers.”

Compared to Nvidia’s H100 GPU, Corsair provides 10 times faster performance, three times better price-performance and three times better energy efficiency for a 70-billion-paramater Llama model using 8-bit math, according to d-Matrix.

Using a PCIe Gen 5 form factor, Corsair combines two application-specific integrated circuits (ASICs) connected via PCIe, with each ASIC containing cores based on the startup’s Digital In-Memory Compute Architecture (DIMC) alongside RISC-V cores and other elements across four interconnected four chiplets.

D-Matrix said its DIMC architecture is able to deliver on-chip memory bandwidth of 150 TBps, which it said was an “order of magnitude higher than [high-bandwidth memory] available” at the time of launch. This is made possible by the DIMC’s tight integration of memory and compute.

EnCharge EN100

Announced in late May, the EnCharge EN100 is what is being called the “world’s first AI accelerator built on precise and scalable analog in-memory computing,” according to chip design startup EnCharge.

Built for laptops, workstations and edge devices, the EN100 is capable of delivering more than 200 trillion operations per second (TOPS) using as little as 8.25 watts with the M.2 form factor for notebooks, the company said.

The PCIe-based workstation solution, on the other hand, packs four neural processing units (NPUs) to provide roughly 1,000 TOPS, allowing it to provide “GPU-level compute capacity at a fraction of the cost and power consumption,” according to EnCharge.

The EN100, which features up to 128 GB of LPDDR memory, is being made available to developers and OEMs in an early access program.

Google TPU v7

Google revealed in April its seventh-generation TPU, Ironwood, which it said it designed to improve performance and scalability for inferencing.

Announced at the Google Cloud Next 2025 event, Google Cloud CEO Thomas Kurian said this advancement will allow Google and its cloud customers to develop and deploy more sophisticated AI models at scale for inference.

“Our seventh-generation TPU Ironwood represents our largest and most powerful TPU to date, a more than 10x improvement from our most recent high-performance TPU,” he said.

Google is making Ironwood available in a 256-chip configuration and a 9,216-chip configuration depending on the needs of customers.

The latter configuration allows Ironwood to deliver 42.5 exaflops of AI compute in a pod, which Kurian said allows Google to meet the demand of thinking models like Gemini 2.5.

Compared to Google’s sixth-generation TPU, Ironwood has two times greater performance-per-watt with 29.3 peak floating-point operations per second per watt and six times the high-bandwidth memory with 192 GB per chip. The TPU also has 4.5 times greater high-bandwidth memory at 7.37 TBps and 50 percent greater inter-chip interconnect bandwidth at 1.2 TBps, according to the company.

Nvidia Blackwell Ultra GPU

Nvidia said in March that its upcoming Blackwell Ultra GPU architecture is built for AI reasoning models, claiming that it can significantly increase the revenue AI providers generate over the previous generation.

Announced at Nvidia’s GTC 2025 event in March, Blackwell Ultra increases the maximum HBM3e high-bandwidth memory by 50 percent to 288 GB and boosts 4-bit floating point (FP4) inference performance by just as much.

Blackwell Ultra-based products from technology partners are set to debut in the second half of 2025. These partners include OEMs such as Dell Technologies, Cisco, Hewlett Packard Enterprise, Lenovo and Supermicro as well as cloud service providers like Amazon Web Services, Google Cloud, Microsoft Azure and Oracle Cloud Infrastructure.

The flagship vehicle for Blackwell Ultra is the GB300 NVL72 platform, which consists of 72 Blackwell Ultra GPUs and 36 Grace CPUs, giving it 1.1 exaflops of FP4 dense computation, 20 TB of high-bandwidth memory and 40 TB of fast memory. The platform’s NVLink bandwidth can top out at 130 TBps while networking speeds reach 14.4 TBps.

With data centers running DeepSeek and other kinds of AI models representing what Nvidia called a $1 trillion opportunity, the company is focusing on how its GPUs, systems and software can help AI application providers make more money, with Nvidia saying that Blackwell Ultra alone can enable a 50-fold increase in “data center revenue opportunity.”

The 50-fold increase is based on the performance improvement Nvidia said it can provide for the 671-billion-parameter DeepSeek-R1 reasoning model with the new GB300 NVL72 rack-scale platform—which updates the recently launched GB200 NVL72 with the new Blackwell Ultra-based GB300 superchip—over an HGX H100-based data center at the same power level.

Whereas the HGX H100 can deliver 100 tokens per second in 90 seconds with the DeepSeek-R1 model, the GB300 NVL72 can increase the tokens per second by 10 times to 1,000 while shortening the delivery time to just 10 seconds, according to Nvidia.

SiMa.AI MLSoC Modalix

Chip design startup SiMa.AI announced in March that its MLSoC Modalix chip will power a system-on-module that it said will lower the total cost of deploying AI at the edge.

Developed in partnership with Enclustra, the MLSoC Modalix system-on-module (SoM) can “seamlessly” integrate into a “range of leading systems,” thanks to its compatibility with the “leading GPU SoM provider in the edge AI space,” according to SiMa.ai.

The startup said the MLSoC Modalix, which began sampling with customers in January, offers more than 10 times the performance per watt of alternative products for multi-modal AI applications that make use of large language models and computer vision models, among other things.

With the MLSoC Modalix capable of 50 trillion operations per second (TOPS), the product supports clustering of two or four chips to offer up to 100 TOPS or 200 TOPS, with the latter being available in a PCIe card.

The high-level functions of the MLSoC Modalix include the 50-TOPS machine learning accelerator, an application processing unit consisting of eight Arm Cortex A65 CPUs, a video encoder and decoder, a computer vision unit, an image signal processor, a DRAM interface system and a high-speed I/O subsystem.