Published: January 13, 2024
This is the second in a three-part series on LLMs and chatbots. The previous article discussed the benefits and drawbacks of on-device and in-browser LLMs.
Now that you better understand client-side AI, you're ready to add WebLLM to a
to-do list web application. You can find the code in the
web-llm
branch of the GitHub repository.
WebLLM is a web-based runtime for LLMs provided by Machine Learning Compilation. You can try out WebLLM as a standalone application. The application is inspired by cloud-backed chat applications, such as Gemini, but the LLM inference is run on your device instead of the cloud. Your prompts and data never leave your device, and you can be sure that they aren't used to train models.
To perform model inference on the device, WebLLM combines WebAssembly and WebGPU. While WebAssembly allows for efficient calculations on the central processing unit (CPU), WebGPU gives developers low-level access to the device's graphics processing unit (GPU).
Install WebLLM
WebLLM is available as an npm package.
You can add this package to your to-do list application by running
npm install @mlc-ai/web-llm
.
Select a model
Next, you need to decide on an LLM to execute locally. Various models are available.
To decide, you should know the following key terms and figures:
- Token: The smallest unit of text an LLM can process.
- Context window: The maximum number of tokens the model can process.
- Parameters or weights: The internal variables learned during training, counted in billions.
- Quantization: The number of bits representing the weights. More bits mean higher precision, but also higher memory usage.
- Floating-point number formats: 32-bit floating numbers (full-precision, F32) offer better accuracy, while 16-bit floating numbers (half-precision, F16) have higher speeds and less memory usage but require compatible hardware.
These key terms tend to be part of the model name. For example,
Llama-3.2-3B-Instruct-q4f32_1-MLC
contains the following information:
- The model is LLaMa 3.2.
- The model has 3 billion parameters.
- It's fine-tuned for instruction and prompt-style assistants (Instruct).
- It uses 4-bit (q4) uniform (_1) quantization.
- It has full-precision, 32-bit floating-point numbers.
- It's a special version created by Machine Learning Compilation.
You may need to test different models to determine which suits your use case.
A model with 3 billion parameters and 4 bits per parameter already could have a file size as large as 1.4 GB at the time of this writing, which the application needs to download to the user's device before first use. It's possible to work with 3B models, but when it comes to translation capabilities or trivia knowledge, 7B models deliver better results. With 3.3 GB and up, they are significantly larger, though.
To create the WebLLM engine and initiate the model download for your to-do list chatbot, add the following code to your application:
import {CreateMLCEngine} from '@mlc-ai/web-llm';
const engine = await CreateMLCEngine('Llama-3.2-3B-Instruct-q4f32_1-MLC', {
initProgressCallback: ({progress}) => console.log(progress);
});
The CreateMLCEngine
method takes the model string and an optional
configuration object. Using the initProgressCallback
method, you can query
the model's download progress to present it to users while they are waiting.
Cache API: Make your LLM run offline
The model is downloaded into your website's cache storage. The Cache API was introduced along with Service Workers to make your website or web application run offline. It's the best storage mechanism to cache AI models. In contrast to HTTP caching, the Cache API is a programmable cache that is fully under the developer's control.
Once downloaded, WebLLM reads the model files from the Cache API instead of requesting them over the network, making WebLLM fully offline-capable.
As with all website storage, the cache is isolated per origin. This means that two origins, example.com and example.net, cannot share the same storage. If those two websites wanted to use the same model, they would have to download the model separately.
You can inspect the cache using DevTools by navigating to Application > Storage and opening Cache storage.
Set up the conversation
The model can be initialized with a set of initial prompts. Commonly, there are three message roles:
- System prompt: This prompt defines the model's behavior, role, and character. It can also be used for grounding, that is, feeding custom data into the model that is not part of its training set (such as your domain-specific data). You can only specify one system prompt.
- User prompt: Prompts entered by the user.
- Assistant prompt: Answers from the assistant, optional.
User and assistant prompts can be used for N-shot prompting by providing natural language examples to the LLM on how it should behave or respond.
Here is a minimal example for setting up the conversation for the to-do list app:
const messages = [
{ role: "system",
content: `You are a helpful assistant. You will answer questions related to
the user's to-do list. Decline all other requests not related to the user's
todos. This is the to-do list in JSON: ${JSON.stringify(todos)}`
},
{role: "user", content: "How many open todos do I have?"}
];
Answer your first question
The chat completion capability is exposed as a property on the WebLLM engine
created before (engine.chat.completions
). After the model is downloaded, you
can run model inference by calling the create()
method on this property. For
your use case, you want to stream responses so the user can start reading
while it is generated, reducing the perceived waiting time:
const chunks = await engine.chat.completions.create({ messages, stream: true, });
This method returns an AsyncGenerator
, a subclass of the hidden
AsyncIterator
class. Use a for await...of
loop to wait for the chunks as
they come in. However, the response only contains the new tokens (delta
), so
you must assemble the full reply yourself.
let reply = '';
for await (const chunk of chunks) {
reply += chunk.choices[0]?.delta.content ?? '';
console.log(reply);
}
It turns out that the web always had to deal with streaming responses. You can use APIs such as DOMImplementation to work with these streaming responses and efficiently update your HTML.
The results are purely string-based. You must parse them first if you want to interpret them as JSON or as other file formats.
WebLLM, however, has some restrictions: The application needs to download a huge model before first use, which cannot be shared across origins, so another web app may have to download the same model again. While WebGPU achieves near-native inference performance, it doesn't reach the full native speed.
Demo
These drawbacks are addressed by the Prompt API, an exploratory API proposed by Google that also runs client-side, but uses a central model downloaded into Chrome. This means multiple applications can use the same model at full execution speed.
Read more about adding chatbot capabilities using Prompt API in the next article.