 {"id":521064,"date":"2026-06-28T20:25:43","date_gmt":"2026-06-29T03:25:43","guid":{"rendered":"https:\/\/jorgep.com\/blog\/?p=521064"},"modified":"2026-06-29T07:48:29","modified_gmt":"2026-06-29T14:48:29","slug":"disaggregated-inference-future-of-llm-serving","status":"publish","type":"post","link":"https:\/\/jorgep.com\/blog\/disaggregated-inference-future-of-llm-serving\/","title":{"rendered":"Disaggregated Inference:  Future of LLM Serving"},"content":{"rendered":"\n<h1 class=\"wp-block-heading\"><\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">If you\u2019ve ever wondered why your AI chatbot suddenly slows down when you feed it a massive 50-page PDF, you\u2019ve encountered a fundamental bottleneck in modern AI infrastructure.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For years, we\u2019ve served LLMs like a one-person kitchen: the same chef (GPU) does all the prep work&nbsp;<em>and<\/em>&nbsp;all the cooking. But as companies start deploying models at massive scales, we\u2019re moving to a restaurant model: the&nbsp;<strong>Disaggregated Inference<\/strong>&nbsp;model.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Disaggregated inference<\/strong> is an AI serving architecture in which different stages of model inference\u2014such as request routing, prompt processing (prefill), KV-cache storage, and token generation (decode)\u2014are executed on separate hardware resources or services rather than on the same accelerator or server.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Disaggregated inference is primarily a datacenter-scale architecture. While it can be implemented on multi-GPU workstations or specialized local systems, the performance and operational benefits are usually too small to justify the added complexity for a single-user PC. Its biggest advantages emerge when serving many concurrent users, where separating prefill and decode workloads improves utilization, fairness, and throughput.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The Anatomy of an LLM: Prefill vs. Decode<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">To understand disaggregation, you have to realize that generating text happens in two very different stages:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>The Prefill (The &#8220;Reading&#8221;):<\/strong>&nbsp;The model reads your entire prompt at once to build a &#8220;KV Cache&#8221; (the internal map of context). This is a&nbsp;<strong>math-heavy<\/strong>&nbsp;sprint. It requires massive parallel computing power.<\/li>\n\n\n\n<li><strong>The Decode (The &#8220;Writing&#8221;):<\/strong>&nbsp;The model spits out tokens one by one, constantly looking at the context it built. This is&nbsp;<strong>memory-bandwidth heavy<\/strong>. The model is constantly &#8220;reaching&#8221; into its memory to see what to say next; it can\u2019t do this in parallel.<\/li>\n<\/ol>\n\n\n<style>.kb-row-layout-id521064_addfc9-37 > .kt-row-column-wrap{align-content:start;}:where(.kb-row-layout-id521064_addfc9-37 > .kt-row-column-wrap) > .wp-block-kadence-column{justify-content:start;}.kb-row-layout-id521064_addfc9-37 > .kt-row-column-wrap{column-gap:var(--global-kb-gap-md, 2rem);row-gap:var(--global-kb-gap-md, 2rem);padding-top:var(--global-kb-spacing-sm, 1.5rem);padding-bottom:var(--global-kb-spacing-sm, 1.5rem);grid-template-columns:repeat(2, minmax(0, 1fr));}.kb-row-layout-id521064_addfc9-37 > .kt-row-layout-overlay{opacity:0.30;}@media all and (max-width: 1024px){.kb-row-layout-id521064_addfc9-37 > .kt-row-column-wrap{grid-template-columns:repeat(2, minmax(0, 1fr));}}@media all and (max-width: 767px){.kb-row-layout-id521064_addfc9-37 > .kt-row-column-wrap{grid-template-columns:minmax(0, 1fr);}}<\/style><div class=\"kb-row-layout-wrap kb-row-layout-id521064_addfc9-37 alignnone wp-block-kadence-rowlayout\"><div class=\"kt-row-column-wrap kt-has-2-columns kt-row-layout-equal kt-tab-layout-inherit kt-mobile-layout-row kt-row-valign-top\">\n<style>.kadence-column521064_9f5215-b5 > .kt-inside-inner-col,.kadence-column521064_9f5215-b5 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column521064_9f5215-b5 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column521064_9f5215-b5 > .kt-inside-inner-col{flex-direction:column;}.kadence-column521064_9f5215-b5 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column521064_9f5215-b5 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column521064_9f5215-b5{position:relative;}@media all and (max-width: 1024px){.kadence-column521064_9f5215-b5 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column521064_9f5215-b5 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column521064_9f5215-b5\"><div class=\"kt-inside-inner-col\">\n<h3 class=\"wp-block-heading\">Traditional Inference<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">A typical inference server does everything:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Receives the request<\/li>\n\n\n\n<li>Loads model weights<\/li>\n\n\n\n<li>Processes the prompt (prefill phase)<\/li>\n\n\n\n<li>Generates tokens (decode phase)<\/li>\n\n\n\n<li>Returns the response<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>All of this happens on the same set of GPUs<\/strong> or P<\/p>\n<\/div><\/div>\n\n\n<style>.kadence-column521064_f48869-2d > .kt-inside-inner-col,.kadence-column521064_f48869-2d > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column521064_f48869-2d > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column521064_f48869-2d > .kt-inside-inner-col{flex-direction:column;}.kadence-column521064_f48869-2d > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column521064_f48869-2d > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column521064_f48869-2d{position:relative;}@media all and (max-width: 1024px){.kadence-column521064_f48869-2d > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column521064_f48869-2d > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column521064_f48869-2d\"><div class=\"kt-inside-inner-col\">\n<h3 class=\"wp-block-heading\">Disaggregated Inference<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">With disaggregation, different components are separated:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Client<br>  \u2502<br>  \u25bc<br>Router \/ Scheduler<br>  \u2502<br>  \u251c\u2500\u2500 Prefill Cluster<br>  \u2502      (prompt processing)<br>  \u2502<br>  \u2514\u2500\u2500 Decode Cluster<br>         (token generation)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>This allows each stage to be optimized independently.<\/strong><\/p>\n<\/div><\/div>\n\n<\/div><\/div>\n\n\n<h3 class=\"wp-block-heading\">Restaurant Analogy<\/h3>\n\n\n<style>.kb-row-layout-id521064_915ab5-4f > .kt-row-column-wrap{align-content:start;}:where(.kb-row-layout-id521064_915ab5-4f > .kt-row-column-wrap) > .wp-block-kadence-column{justify-content:start;}.kb-row-layout-id521064_915ab5-4f > .kt-row-column-wrap{column-gap:var(--global-kb-gap-md, 2rem);row-gap:var(--global-kb-gap-md, 2rem);padding-top:var(--global-kb-spacing-sm, 1.5rem);padding-bottom:var(--global-kb-spacing-sm, 1.5rem);grid-template-columns:repeat(2, minmax(0, 1fr));}.kb-row-layout-id521064_915ab5-4f > .kt-row-layout-overlay{opacity:0.30;}@media all and (max-width: 1024px){.kb-row-layout-id521064_915ab5-4f > .kt-row-column-wrap{grid-template-columns:repeat(2, minmax(0, 1fr));}}@media all and (max-width: 767px){.kb-row-layout-id521064_915ab5-4f > .kt-row-column-wrap{grid-template-columns:minmax(0, 1fr);}}<\/style><div class=\"kb-row-layout-wrap kb-row-layout-id521064_915ab5-4f alignnone wp-block-kadence-rowlayout\"><div class=\"kt-row-column-wrap kt-has-2-columns kt-row-layout-equal kt-tab-layout-inherit kt-mobile-layout-row kt-row-valign-top\">\n<style>.kadence-column521064_082d02-d8 > .kt-inside-inner-col,.kadence-column521064_082d02-d8 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column521064_082d02-d8 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column521064_082d02-d8 > .kt-inside-inner-col{flex-direction:column;}.kadence-column521064_082d02-d8 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column521064_082d02-d8 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column521064_082d02-d8{position:relative;}@media all and (max-width: 1024px){.kadence-column521064_082d02-d8 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column521064_082d02-d8 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column521064_082d02-d8\"><div class=\"kt-inside-inner-col\">\n<h5 class=\"wp-block-heading\">Traditional inference server <\/h5>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">one chef:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Takes the order<\/li>\n\n\n\n<li>Prepares ingredients<\/li>\n\n\n\n<li>Cooks the meal<\/li>\n\n\n\n<li>Plates the food<\/li>\n<\/ol>\n<\/div><\/div>\n\n\n<style>.kadence-column521064_3ba3cd-96 > .kt-inside-inner-col,.kadence-column521064_3ba3cd-96 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column521064_3ba3cd-96 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column521064_3ba3cd-96 > .kt-inside-inner-col{flex-direction:column;}.kadence-column521064_3ba3cd-96 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column521064_3ba3cd-96 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column521064_3ba3cd-96{position:relative;}@media all and (max-width: 1024px){.kadence-column521064_3ba3cd-96 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column521064_3ba3cd-96 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column521064_3ba3cd-96\"><div class=\"kt-inside-inner-col\">\n<h5 class=\"wp-block-heading\">Disaggregated inference <\/h5>\n\n\n\n<p class=\"wp-block-paragraph\">An assembly line:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Order station<\/li>\n\n\n\n<li>Prep station<\/li>\n\n\n\n<li>Cooking station<\/li>\n\n\n\n<li>Plating station<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">Each station specializes in a specific task, increasing overall throughput and efficiency when serving large numbers of customers.<\/p>\n<\/div><\/div>\n\n<\/div><\/div>\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Conceptual Diagram<\/h4>\n\n\n\n<pre class=\"wp-block-code\"><code>&#91; REQUEST ] \n      \u2502\n      \u25bc\n+---------------------+      +---------------------+\n|   PREFILL CLUSTER   | ---&gt; |    DECODE CLUSTER   |\n| (Compute Optimized) |      | (Memory Optimized)  |\n+---------------------+      +---------------------+\n      \u2502                               \u2502\n      \u25bc                               \u25bc\n&#91;   Fast Math      ]         &#91; Fast Memory Access  ]\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Why It Exists<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Modern LLM inference contains fundamentally different workloads:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th>Phase<\/th><th>Characteristics<\/th><\/tr><\/thead><tbody><tr><td>Prefill<\/td><td>Compute-intensive, processes many tokens simultaneously<\/td><\/tr><tr><td>Decode<\/td><td>Memory-intensive, generates tokens one at a time<\/td><\/tr><tr><td>KV Cache Storage<\/td><td>Capacity-intensive, stores conversation state<\/td><\/tr><tr><td>Routing\/Scheduling<\/td><td>Network and orchestration-intensive<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Disaggregated inference separates these workloads so they can scale independent<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">When Should You Disaggregate?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Disaggregation helps most when:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Variable Workloads:<\/strong>&nbsp;You handle a mix of &#8220;summarization&#8221; tasks (long inputs, short outputs) and &#8220;chat&#8221; tasks (short inputs, long outputs).<\/li>\n\n\n\n<li><strong>Cost Efficiency:<\/strong>&nbsp;You can use cheaper, high-compute GPUs for the &#8220;Prefill&#8221; stage and save your expensive, high-bandwidth memory GPUs for the &#8220;Decode&#8221; stage.<\/li>\n\n\n\n<li><strong>Preventing &#8220;Head-of-Line Blocking&#8221;:<\/strong>&nbsp;If a user sends a 100,000-token prompt, it shouldn&#8217;t block everyone else&#8217;s 5-word chat responses from generating. Disaggregation isolates these heavy lifts.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>When does it NOT help?<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Low Traffic:<\/strong>&nbsp;If you are only serving a few requests an hour, the overhead of managing a network connection between two clusters is just extra complexity you don&#8217;t need.<\/li>\n\n\n\n<li><strong>Small Models:<\/strong>&nbsp;With smaller models that fit entirely on one GPU, the speed lost from &#8220;handing off&#8221; data across the network (latency) often outweighs the time gained by splitting the work.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Disaggregation vs. The Alternatives<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">You might have heard of other ways to make inference faster, like&nbsp;<strong>Continuous Batching<\/strong>&nbsp;or&nbsp;<strong>Speculative Decoding<\/strong>. How do they compare?<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">1. Continuous Batching (The &#8220;Manager&#8221;)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>How it works:<\/strong>&nbsp;Instead of waiting for one prompt to finish before starting the next, the server &#8220;slots&#8221; new requests into unused spaces in the GPU memory.<\/li>\n\n\n\n<li><strong>Vs. Disaggregation:<\/strong>&nbsp;Think of this as&nbsp;<strong>optimizing the kitchen staff<\/strong>. It makes the existing setup efficient. Disaggregation is&nbsp;<strong>changing the kitchen layout entirely<\/strong>. They often work best together.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">2. Speculative Decoding (The &#8220;Guessing Game&#8221;)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>How it works:<\/strong>&nbsp;A small, fast &#8220;draft&#8221; model guesses the next few words, and the big, slow &#8220;heavyweight&#8221; model simply verifies if they are correct.<\/li>\n\n\n\n<li><strong>Vs. Disaggregation:<\/strong>&nbsp;This is about&nbsp;<strong>speeding up the cook<\/strong>. It reduces the work the &#8220;Decode&#8221; stage has to do. Disaggregation is about&nbsp;<strong>distributing the tasks<\/strong>&nbsp;to better specialists.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">The Verdict: <\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Disaggregated inference is the &#8220;Enterprise Architecture&#8221; of the AI world. If you are building a small app or a internal tool, keep it simple with standard serving (like vLLM on a single node).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Future systems may use heterogeneous inference, where different phases run on different hardware types:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>CPU  \u2192 orchestration<br>NPU  \u2192 lightweight\/local inference<br>GPU  \u2192 large-context prefill<br>GPU  \u2192 high-throughput decode<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">In that world, the key distinction isn&#8217;t GPU vs. NPU. It&#8217;s whether the inference pipeline is <strong>monolithic<\/strong> (everything on the same accelerator pool) or <strong>disaggregated<\/strong> (different stages on different resources).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">But if you are building the next big platform where latency is your product and costs are ballooning, disaggregation is the secret sauce that allows you to treat your expensive GPU fleet like a finely tuned, modular production line.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n","protected":false},"excerpt":{"rendered":"<p>If you\u2019ve ever wondered why your AI chatbot suddenly slows down when you feed it a massive 50-page PDF, you\u2019ve encountered a fundamental bottleneck in modern AI infrastructure. For years, we\u2019ve served LLMs like a one-person kitchen: the same chef (GPU) does all the prep work&nbsp;and&nbsp;all the cooking. But as companies start deploying models at&#8230;<\/p>\n","protected":false},"author":2,"featured_media":427864,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_kad_blocks_custom_css":"","_kad_blocks_head_custom_js":"","_kad_blocks_body_custom_js":"","_kad_blocks_footer_custom_js":"","ngg_post_thumbnail":0,"episode_type":"","audio_file":"","podmotor_file_id":"","podmotor_episode_id":"","cover_image":"","cover_image_id":"","duration":"","filesize":"","filesize_raw":"","date_recorded":"","explicit":"","block":"","itunes_episode_number":"","itunes_title":"","itunes_season_number":"","itunes_episode_type":"","_kad_post_transparent":"","_kad_post_title":"","_kad_post_layout":"","_kad_post_sidebar_id":"","_kad_post_content_style":"","_kad_post_vertical_padding":"","_kad_post_feature":"","_kad_post_feature_position":"","_kad_post_header":false,"_kad_post_footer":false,"_kad_post_classname":"","footnotes":""},"categories":[441],"tags":[930,894],"class_list":["post-521064","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-tech-talk","tag-ai-series","tag-artificial-intelligence"],"taxonomy_info":{"category":[{"value":441,"label":"Tech Talk"}],"post_tag":[{"value":930,"label":"AI Series"},{"value":894,"label":"artificial intelligence"}]},"featured_image_src_large":["https:\/\/jorgep.com\/blog\/wp-content\/uploads\/FeaturedImage-Topic-AI-1024x512.png",1024,512,true],"author_info":{"display_name":"Jorge Pereira","author_link":"https:\/\/jorgep.com\/blog\/author\/jorge\/"},"comment_info":0,"category_info":[{"term_id":441,"name":"Tech Talk","slug":"tech-talk","term_group":0,"term_taxonomy_id":451,"taxonomy":"category","description":"","parent":0,"count":734,"filter":"raw","cat_ID":441,"category_count":734,"category_description":"","cat_name":"Tech Talk","category_nicename":"tech-talk","category_parent":0}],"tag_info":[{"term_id":930,"name":"AI Series","slug":"ai-series","term_group":0,"term_taxonomy_id":940,"taxonomy":"post_tag","description":"","parent":0,"count":217,"filter":"raw"},{"term_id":894,"name":"artificial intelligence","slug":"artificial-intelligence","term_group":0,"term_taxonomy_id":904,"taxonomy":"post_tag","description":"","parent":0,"count":194,"filter":"raw"}],"_links":{"self":[{"href":"https:\/\/jorgep.com\/blog\/wp-json\/wp\/v2\/posts\/521064","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/jorgep.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/jorgep.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/jorgep.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/jorgep.com\/blog\/wp-json\/wp\/v2\/comments?post=521064"}],"version-history":[{"count":3,"href":"https:\/\/jorgep.com\/blog\/wp-json\/wp\/v2\/posts\/521064\/revisions"}],"predecessor-version":[{"id":521327,"href":"https:\/\/jorgep.com\/blog\/wp-json\/wp\/v2\/posts\/521064\/revisions\/521327"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/jorgep.com\/blog\/wp-json\/wp\/v2\/media\/427864"}],"wp:attachment":[{"href":"https:\/\/jorgep.com\/blog\/wp-json\/wp\/v2\/media?parent=521064"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/jorgep.com\/blog\/wp-json\/wp\/v2\/categories?post=521064"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/jorgep.com\/blog\/wp-json\/wp\/v2\/tags?post=521064"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}