|

Unlocking the NPU: FastFlowLM


To learn more about Local AI topics, check out related posts in the Local AI Series 

Disclaimer: I create this content entirely on my own time, and the views expressed here are mine alone (not my employer’s). Because I love leveraging new tech, I use AI tools like Gemini, NotebookLM, Claude, Perplexity and others as a “digital team” to help research and polish these articles so I can share the best possible insights with you!

How I Bypassed Ollama and LM studio Limitations on my Ryzen AI NPU to Hit 50+ TPS

If you recently purchased a modern AI-PC, you bought into a promising vision: a dedicated, cutting-edge Neural Processing Unit (NPU) sitting right inside your silicon, designed to stream large language models (LLMs) smoothly without draining your battery or spinning up your GPU fans. But if you are like me, your initial attempt to tap into that hardware felt like hitting a brick wall.

I started where most enthusiasts do: downloading industry staples like Ollama and LM Studio. I assumed they would automatically detect the specialized NPU. Instead, they ignored it entirely. Because these tools are architected for general-purpose GPU execution (CUDA/ROCm), my state-of-the-art Ryzen AI chip was sidelined. The system defaulted to heavy CPU emulation, yielding a sluggish, disappointing 3 to 10 tokens per second (TPS). It was inefficient, and left a massive piece of my processor sitting idle.

Then, I discovered FastFlowLM (FLM).

The Breakthrough: Native Hardware Optimization

FLM is built from the ground up to execute models natively on XDNA-based architectures. By leveraging AMD’s specialized kernel drivers, it bypasses standard GPU-centric abstraction layers to communicate directly with the NPU silicon.

The difference was staggering. The moment I initiated the FLM environment, it claimed exclusive, low-overhead access to the NPU. My standard daily-driver models jumped from single-digit speeds to a consistent 22 to 37 TPS. The generation was fluid, instant, and consumed virtually no background GPU resources. When I switched to edge-optimized small language models—specifically Meta’s llama3.2:1b—the performance cracked an unbelievable 50+ TPS.

Inference EngineTarget Hardware BlockAvg. PerformanceUser Experience
Ollama / LM StudioCPU (Fallback)3 – 10 TPSSluggish, high overhead
FastFlowLM (FLM)Native Ryzen NPU22 – 37 TPSFluid, immediate
FLM (Llama 3.2 1B)Native Ryzen NPU50+ TPSBlindingly fast

Setting Up Your Environment

Configuration I used:

  • FastFlowLM in the host machine — (the only one as of now to utilize NPU)
    • unless you changed the default, Model will go into c:\users\<user>\.flm\models
    • Started with Llama3,2b:1B but later added llama3.2:3b and gemma4 and qwen3.5:4B and they workeed GREAT!
  • Docker (because I like to keep my machine clean, and can configure many items virtually)

To get started,

  1. you must first install the FLM Application directly on your Windows or Linux OS (Check out their GitHub repository). This binary acts as the bridge between your model files and the NPU silicon.
    • Start FLM from the terminal using the following command:
  2. flm serve llama3.2:1b –host 0.0.0.0
    • Note: If you use: the standard command: flm server lamma3.2.1b and do not use the –host 0.0.0.0 flag it will only bind itself to 127.0.01 (localhost) which means you will not be able to access it from other computers.
  3. You are not able to interact with llamma3.2.1b on the terminal. if you open another terminal session and type FLM list you will see all of the models available (a lot!) for your machine.
  4. While FLM runs well in a terminal, most users prefer a GUI to display the server status and to interface with the chat
    • for a Display GUI of the Server ( where you can start/stop, configure the servier and also show NPU usage: there is a tool called FLM Companion (highgjly recommended!)
    • For a front end chatbot I integrated my FLM instance into a containerized workflow using OpenWebUI and n8n. Because OpenWebUI expects an OpenAI-compliant API, you can point it directly to your local FLM serving port.

If you are using Docker, configure your docker-compose.yml like this:

services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    ports:
      - "3000:8080"
    environment:
      - ENABLE_OLLAMA=False
      - OPENAI_API_BASE_URL=http://host.docker.internal:52625/v1
      - OPENAI_API_KEY=flm
    extra_hosts:
      - "host.docker.internal:host-gateway"

⚠️ Crucial Note: Because Docker containers operate in an isolated network, routing is key. ENABLE_OLLAMA=False prevents connection timeouts, while host.docker.internal (and the host-gateway flag) maps the container bridge back to your host machine’s FLM port.

Alternatively, if you have a different chatbot interface ( like my own: Unified Chat Hub) use

http://<your.IP.ADDRESS>:52625/v1/chat/completions

Analyzing Real-Time Execution

When a request hits your stack, the handoff is immediate. The execution logs confirm the NPU engagement:

[🔗 ] TCP connection established - Remote: 127.0.0.1:61894
[🟢 ] NPU Locked!
[FLM] Model: gemma4-it:e4b
[FLM] Start prefill... (14 tokens)
[FLM] Start generating...

The [🟢 ] NPU Locked! status indicates that FLM has successfully bypassed the CPU/GPU bottleneck and established a direct hardware lock. The model prefill phase caches your context into high-speed memory, allowing subsequent inference to leverage the NPU for near-instant response times.

The Takeaway

NPUs represent a paradigm shift for local AI, but hardware is only as good as the software stack running on it. If you are stuck using legacy, GPU-dependent tools for your Ryzen AI silicon, you are leaving massive performance gains on the table. Moving to a native engine like FLM transforms your hardware from an underutilized component into a highly responsive, ultra-fast localized AI powerhouse.

Update: AS I researched and read more about these guys I noticed that FLM is now embedded into Lemonade: Local AI

Have questions, ideas to share, or just want to connect? I’d love to hear from you! Check out my About Page to learn more about me or connect with me.