Shuhan Liu

Stanford University

Position: PhD Candidate
Rising Stars year of participation: 2025
Bio

Yuhan Liu is a final-year PhD student at the University of Chicago, advised by Junchen Jiang and Shan Lu. Her research interests lie in building efficient systems for large-scale LLM inference. In particular, she focuses on designing an efficient KV caching layer for LLM inference, including KV cache compression, dynamic blending of KV caches, and cross-model KV cache sharing. She has received EuroSys Best Paper award, UU Fellowship, and the Neubauer Fellowship. She earned her bachelor’s degree in Computer Science from the University of Wisconsin-Madison. Her research has been integrated into LMCache, the first and most efficient open-source KV caching layer to date, which has been widely adopted and tested in enterprise settings. She is also a key contributor and community leader of the open-source projects vLLM Production Stack and LMCache.

Areas of Research
  • Materials and Devices
BRIDGE: Blended Retention-Indexed Diverse Gain cEll

DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving

Compound AI systems, such as agentic systems, are an emerging trend in large-scale enterprise settings, with multiple LLMs specialized for different users, tasks, and/or roles working together. In these scenarios, different models often process inputs that share the same context prefix. Although much work was done in the past to enable the reuse of prefix KV caches across inputs for a single model, how to enable one model to reuse the prefix KV caches of a different model remains an open question.
We introduce DroidSpeak, the first distributed LLM inference system that enables KV cache reuse across distributed nodes running inference of different LLMs, so long as the LLMs have the same architecture. We present the first study that aims at understanding the impact of sharing KV caches across different LLMs, and if/when such sharing affects quality. Inspired by the findings, we present DroidSpeak, which selectively recomputes a few layers of the KV cache produced by another LLM and reuses the remaining layers, with negligible quality loss. Moreover, carefully pipelining the layer-wise re-computation and the loading of reused KV cache further improves the inference performance. Experiments on diverse datasets and model pairs demonstrate that DroidSpeak achieves up to 4x throughput improvement and about 3.1x faster prefill (time to first token), with negligible loss of quality in F1 scores, Rouge-L or code similarity score, compared to the baseline which does not allow any sharing across models.