CohicularOn-Device LLM

Software and silicon,
designed as one.

Running LLMs locally is an engineering problem, not a scaling problem. Cohicular builds the compiler, runtime, and chip microarchitecture together—so your model runs where your data lives.

See the approach Work with us

10×inference efficiency

<5mstime to first token

Zerocloud dependency

SCROLL

About

The best optimization is the one nobody had to make.

Most teams optimize software for hardware that already exists, or they spec out silicon hoping the software stack will catch up. Each side leaves headroom on the table that only the other could unlock. Co-design is the answer—but it only works if both sides are done by the same people, at the same time.

Cohicular was founded on that premise. We build the inference compiler and the chip microarchitecture as a single problem. The quantization pass knows about the NPU’s memory bus. The datapath knows what the scheduler will throw at it. That tight feedback loop is how you get 10× efficiency, not 10% efficiency.

We focus entirely on on-device LLM inference—not because the cloud is broken, but because latency, privacy, and cost compound against you over time. Local inference done right isn’t a compromise. It’s a better product.

Hardware-software joint optimization from day one

Custom compiler passes co-designed with the microarchitecture

End-to-end inference stack: model → silicon

Technology

The gap between model and metal is where performance hides.

Our compiler knows the hardware. Our hardware was designed knowing what the compiler would generate. That tight loop is what makes efficiency possible at this level.

Joint compilation

Our compiler doesn't target a fixed ISA—it negotiates with the hardware it's building for. Operator fusion, memory tiling, and quantization decisions are made knowing the exact datapath constraints of the target NPU.

Microarchitecture co-design

The NPU is designed knowing what the compiler will emit. If the compiler prefers 8-wide INT4 dot products, the datapath is built for that. No abstraction layers wasting cycles translating between what software wants and what silicon provides.

End-to-end weight awareness

Quantization isn't a post-training afterthought. Weight distributions, sparsity patterns, and hardware precision trade-offs are modeled together. The result is models that are smaller, faster, and more accurate on the target device.

Solutions

One stack. Every layer.

Use one piece, use all three, or let us integrate the stack into your existing hardware program.

Available

Compiler

Cohicular Compiler (CXC)

A graph compiler that optimizes LLM models against a hardware target you define—or one we designed. It handles operator fusion, weight packing, memory tiling, and mixed-precision quantization as a single joint pass, not a sequential pipeline.

INT4/INT8/FP8 quantization

Hardware-aware graph rewriting

Custom kernel emission

MLIR-based IR

Available

Runtime

Inference SDK

A lightweight C++ runtime for deploying compiled models on edge hardware. It handles tensor dispatch, memory management, and device scheduling. Drop it into your firmware stack without pulling in a 200MB ML framework.

< 500 KB runtime footprint

Static memory planning

Async prefetch

C / C++ / Rust APIs

Early access

Silicon

Reference NPU (CX-1)

Our reference neural processing unit microarchitecture, built specifically for transformer inference on edge SoCs. Every datapath decision was made knowing what CXC emits. License it as RTL and tape it into your next chip.

32 TOPS @ 1W target

Systolic + vector hybrid

On-chip weight cache

RTL deliverable

Bringing your own hardware?We also do joint development engagements—your silicon, our compiler and runtime.

DISCUSS A PROJECT →

Contact

Let’s build it right.

Whether you’re designing a chip, building an edge product, or just trying to get a large model running locally without embarrassing yourself—reach out.

Hardware programsNPU co-design, RTL licensing, and joint tape-out engagements.

Software integrationDeploying CXC or the Inference SDK into your product stack.

Research & partnershipsAcademic collaboration, benchmarking, and technical exchange.

[email protected]San Francisco, CA

Software and silicon,designed as one.