Chimera Multicore

100 TOPS to 6,400 TOPS.
One Toolchain.
Scaled Workloads.

Scale any workload. Data or pipeline parallelism for throughput. Tensor parallelism for single-batch.

Vision: high-res, real-time. Tiling splits 4K+ images across cores. Low-latency inference at full resolution.

LLMs: saturate your bandwidth. Model parallelism keeps latency tight as models grow.

Multi-chip and chiplet ready. Bridge clusters across dies. Same toolchain. Compiler-managed.

Talk to Our Team Explore Architecture

Core Scaling

100TOPS

Three Dimensions of Scale

From 1 TOPS to 6,400 TOPS

Scale array size, add cores, bridge chiplets—same software stack.

Core Scaling

Up to 100 TOPS per core

PE Count

8×8 → 16×16 → 32×32

Cluster Scaling

Up to 800 TOPS

Core Count

2 → 4 → 8

System Scaling

Up to 6,400 TOPS

Cluster Count

2 → 4 → 8

MLSDirect L2↔L2 sharing between cores

Customer NoCCross-cluster interconnect

Parallelism Strategies

Four Ways to Scale Your Workload

One architecture supports spatial, data, pipeline, and task parallelism—choose the pattern that fits.

Spatial Parallelism

Tiling

Large input partitioned into tiles, processed in parallel. Adjacent cores exchange edge data via MLS.

Best for: 4K+ images, low-latency inference

Data Parallelism

Batching

Same model on each core, processing different batches. 100% linear scaling.

Best for: High throughput on smaller models

Pipeline Parallelism

Stages

Model layers partitioned across cores as pipeline stages. Maximizes weight bandwidth.

Best for: Large language models (8B–30B+)

Task Parallelism

MIMD

Each core runs a separate model or workload independently. No synchronization overhead.

Best for: Multiple independent workloads

Try DevStudio Explore the SDK

Cluster Architecture

QC-M: Multicore Cluster

Homogeneous clusters of 2, 4, or 8 cores with direct L2↔L2 sharing.

QC-M Cluster

Key: MLS enables direct L2↔L2 access between cores; AXI Coalescer optimizes external memory bandwidth.

Hardware View

For Processor Architects

MLS Crossbar — Direct L2→L2 paths between all cores; any core reads/writes neighbor's L2
Software-Managed Memory — L2 is scratchpad (not cache), data movement is explicit
AXI Coalescer — Single DDR fetch, broadcast to all cores running same model
Independent AXI — Per-core bandwidth to system

Programming Model

For Software Architects

Producer-Consumer — Cores produce and consume data simultaneously
Explicit Sync — HW provides sync instructions, compiler manages coherency
No Manual Partitioning — Compiler handles data placement across cores

Multi-Cluster

Multi-Cluster GPNPU System

Scale clusters through customer's Network-on-Chip.

Multi-Cluster Architecture

How Multi-Cluster Works

Component Overview

MLS Crossbar — Direct L2↔L2 sharing within each cluster
AXI Coalescer — Shared memory access, weight broadcasting
Customer NoC — System integrator's interconnect for cross-cluster communication

AXI Interfaces

Per Cluster

C-AXI — Control — host commands, status
M-AXI — MLS — cross-cluster L2 data sharing
I-AXI — Instruction — program fetch
D-AXI — Data — DDR access for weights/activations

Reference Configurations

From Edge to Autonomous Driving

TinyML / IoT

1× QC-N

1TOPS

High-Volume Vision

2× QC-P

24TOPS

Edge LLM

4× QC-P

48TOPS

ADAS L2+

8× QC-U (1 chiplet)

800TOPS

Autonomous / L4+

64× QC-U (8 chiplets)

6,400TOPS

8-Chiplet System Architecture

Scale Your AI Chip Ambition

From 1 TOPS edge devices to 6,400 TOPS autonomous systems—same architecture, same software.

Talk to Our Team Download Datasheet

100% linear scaling efficiency

2-8 cores per cluster

Up to 8 clusters via NoC

100 TOPS to 6,400 TOPS.
One Toolchain.
Scaled Workloads.

Scale any workload. Data or pipeline parallelism for throughput. Tensor parallelism for single-batch.

Vision: high-res, real-time. Tiling splits 4K+ images across cores. Low-latency inference at full resolution.

LLMs: saturate your bandwidth. Model parallelism keeps latency tight as models grow.

Multi-chip and chiplet ready. Bridge clusters across dies. Same toolchain. Compiler-managed.

100 TOPS to 6,400 TOPS.One Toolchain.Scaled Workloads.

From 1 TOPS to 6,400 TOPS

Core Scaling

Cluster Scaling

System Scaling

Four Ways to Scale Your Workload

Spatial Parallelism

Data Parallelism

Pipeline Parallelism

Task Parallelism

QC-M: Multicore Cluster

Hardware View

Programming Model

Multi-Cluster GPNPU System

How Multi-Cluster Works

AXI Interfaces

From Edge to Autonomous Driving

Scale Your AI Chip Ambition

100 TOPS to 6,400 TOPS.One Toolchain.Scaled Workloads.

From 1 TOPS to 6,400 TOPS

Core Scaling

Cluster Scaling

System Scaling

Four Ways to Scale Your Workload

Spatial Parallelism

Data Parallelism

Pipeline Parallelism

Task Parallelism

QC-M: Multicore Cluster

Hardware View

Programming Model

Multi-Cluster GPNPU System

How Multi-Cluster Works

AXI Interfaces

From Edge to Autonomous Driving

Scale Your AI Chip Ambition

100 TOPS to 6,400 TOPS.
One Toolchain.
Scaled Workloads.

100 TOPS to 6,400 TOPS.
One Toolchain.
Scaled Workloads.