Software and silicon,
designed as one.
Running LLMs locally is an engineering problem, not a scaling problem. Cohicular builds the compiler, runtime, and chip microarchitecture together—so your model runs where your data lives.
The best optimization is the one nobody had to make.
Most teams optimize software for hardware that already exists, or they spec out silicon hoping the software stack will catch up. Each side leaves headroom on the table that only the other could unlock. Co-design is the answer—but it only works if both sides are done by the same people, at the same time.
Cohicular was founded on that premise. We build the inference compiler and the chip microarchitecture as a single problem. The quantization pass knows about the NPU’s memory bus. The datapath knows what the scheduler will throw at it. That tight feedback loop is how you get 10× efficiency, not 10% efficiency.
We focus entirely on on-device LLM inference—not because the cloud is broken, but because latency, privacy, and cost compound against you over time. Local inference done right isn’t a compromise. It’s a better product.
The gap between model and metal is where performance hides.
Our compiler knows the hardware. Our hardware was designed knowing what the compiler would generate. That tight loop is what makes efficiency possible at this level.
Joint compilation
Our compiler doesn't target a fixed ISA—it negotiates with the hardware it's building for. Operator fusion, memory tiling, and quantization decisions are made knowing the exact datapath constraints of the target NPU.
Microarchitecture co-design
The NPU is designed knowing what the compiler will emit. If the compiler prefers 8-wide INT4 dot products, the datapath is built for that. No abstraction layers wasting cycles translating between what software wants and what silicon provides.
End-to-end weight awareness
Quantization isn't a post-training afterthought. Weight distributions, sparsity patterns, and hardware precision trade-offs are modeled together. The result is models that are smaller, faster, and more accurate on the target device.
One stack. Every layer.
Use one piece, use all three, or let us integrate the stack into your existing hardware program.
Cohicular Compiler (CXC)
A graph compiler that optimizes LLM models against a hardware target you define—or one we designed. It handles operator fusion, weight packing, memory tiling, and mixed-precision quantization as a single joint pass, not a sequential pipeline.
Inference SDK
A lightweight C++ runtime for deploying compiled models on edge hardware. It handles tensor dispatch, memory management, and device scheduling. Drop it into your firmware stack without pulling in a 200MB ML framework.
Reference NPU (CX-1)
Our reference neural processing unit microarchitecture, built specifically for transformer inference on edge SoCs. Every datapath decision was made knowing what CXC emits. License it as RTL and tape it into your next chip.
Let’s build it right.
Whether you’re designing a chip, building an edge product, or just trying to get a large model running locally without embarrassing yourself—reach out.