Software and silicon,
designed as one.

Running LLMs locally is an engineering problem, not a scaling problem. Cohicular builds the compiler, runtime, and chip microarchitecture together—so your model runs where your data lives.

10×inference efficiency
<5mstime to first token
Zerocloud dependency
VECTORUNITSATTENTIONENGINELLM INFERENCE CORECO-DESIGNED · CX-1 · 32 TOPSWEIGHTCACHEI/OINTERFACE420 mmCOHICULAR CX-1 | REV. A | ON-DEVICE LLM INFERENCE ENGINE | CONFIDENTIALABCDE1234

The best optimization is the one nobody had to make.

Most teams optimize software for hardware that already exists, or they spec out silicon hoping the software stack will catch up. Each side leaves headroom on the table that only the other could unlock. Co-design is the answer—but it only works if both sides are done by the same people, at the same time.

Cohicular was founded on that premise. We build the inference compiler and the chip microarchitecture as a single problem. The quantization pass knows about the NPU’s memory bus. The datapath knows what the scheduler will throw at it. That tight feedback loop is how you get 10× efficiency, not 10% efficiency.

We focus entirely on on-device LLM inference—not because the cloud is broken, but because latency, privacy, and cost compound against you over time. Local inference done right isn’t a compromise. It’s a better product.

Hardware-software joint optimization from day one
Custom compiler passes co-designed with the microarchitecture
End-to-end inference stack: model → silicon
CO-DESIGN ARCHITECTURE | COHICULARSOFTWAREHARDWAREMODEL ARCHTransformer / SSMCOMPILERJoint optimizationQUANTIZATIONINT4 / INT8 / FP8RUNTIMEScheduling & dispatchLOGIC DESIGNRTL / architectureMEMORY SYSTEMSRAM / bandwidthNPU DATAPATHSystolic / SIMDPOWER MGMTDVFS / clock gatingON-DEVICE INFERENCEFULLY CO-OPTIMIZED

The gap between model and metal is where performance hides.

Our compiler knows the hardware. Our hardware was designed knowing what the compiler would generate. That tight loop is what makes efficiency possible at this level.

COHICULAR INFERENCE PIPELINE | CO-DESIGN FLOWMODELDEFINITIONArch + weightsJOINTCOMPILERGraph + HW-awareKERNELGENERATIONCustom ISAHWSCHEDULERDatapath mappingRUNTIMEEXECUTIONOn-deviceEND-TO-ENDCONTINUOUS CO-OPTIMIZATION FEEDBACK LOOPSCOMPILER ↔ MEMORYTiling / bandwidth co-optQUANT ↔ DATAPATHPrecision / ops matchingSCHED ↔ POWERDVFS / compute budgetRUNTIME ↔ CACHEPrefetch / eviction policyPROPRIETARY · ALL STAGES DESIGNED JOINTLY · © COHICULAR
01

Joint compilation

Our compiler doesn't target a fixed ISA—it negotiates with the hardware it's building for. Operator fusion, memory tiling, and quantization decisions are made knowing the exact datapath constraints of the target NPU.

02

Microarchitecture co-design

The NPU is designed knowing what the compiler will emit. If the compiler prefers 8-wide INT4 dot products, the datapath is built for that. No abstraction layers wasting cycles translating between what software wants and what silicon provides.

03

End-to-end weight awareness

Quantization isn't a post-training afterthought. Weight distributions, sparsity patterns, and hardware precision trade-offs are modeled together. The result is models that are smaller, faster, and more accurate on the target device.

One stack. Every layer.

Use one piece, use all three, or let us integrate the stack into your existing hardware program.

fn()Available

Cohicular Compiler (CXC)

A graph compiler that optimizes LLM models against a hardware target you define—or one we designed. It handles operator fusion, weight packing, memory tiling, and mixed-precision quantization as a single joint pass, not a sequential pipeline.

INT4/INT8/FP8 quantization
Hardware-aware graph rewriting
Custom kernel emission
MLIR-based IR
NPUAvailable

Inference SDK

A lightweight C++ runtime for deploying compiled models on edge hardware. It handles tensor dispatch, memory management, and device scheduling. Drop it into your firmware stack without pulling in a 200MB ML framework.

< 500 KB runtime footprint
Static memory planning
Async prefetch
C / C++ / Rust APIs
Early access

Reference NPU (CX-1)

Our reference neural processing unit microarchitecture, built specifically for transformer inference on edge SoCs. Every datapath decision was made knowing what CXC emits. License it as RTL and tape it into your next chip.

32 TOPS @ 1W target
Systolic + vector hybrid
On-chip weight cache
RTL deliverable
Bringing your own hardware?We also do joint development engagements—your silicon, our compiler and runtime.
DISCUSS A PROJECT →

Let’s build it right.

Whether you’re designing a chip, building an edge product, or just trying to get a large model running locally without embarrassing yourself—reach out.

Hardware programsNPU co-design, RTL licensing, and joint tape-out engagements.
Software integrationDeploying CXC or the Inference SDK into your product stack.
Research & partnershipsAcademic collaboration, benchmarking, and technical exchange.
[email protected]San Francisco, CA

We read every message ourselves and respond within one business day.