使用BitNet在本地运行小型AI模型（新手指南）

› 论坛 › 数据科学与人工智能 › 人工智能

AIU人工智能学院

413

收藏 2026-03-18

引言

BitNet b1.58是由微软研究人员开发的原生低位语言模型。它从头开始训练，采用值为-1、0和+1的三元权重。与对大型预训练模型进行压缩不同，BitNet从设计之初就旨在以极低精度高效运行。这在降低内存占用和计算需求的同时，仍能保持出色的性能。

有一个重要细节需要注意：如果你使用标准的Transformers库加载BitNet，将无法自动获得速度和效率优势。要充分利用其设计优势，你需要使用专门的C++实现版本——bitnet.cpp，该版本是专门为这些模型优化的。

在本教程中，你将学习如何在本地运行BitNet。我们将从安装所需的Linux软件包开始，然后从源代码克隆并构建bitnet.cpp，之后下载20亿参数的BitNet模型，将BitNet作为交互式聊天工具运行，启动推理服务器，并将其与OpenAI Python SDK连接。

步骤1：在Linux上安装所需工具

从源代码构建BitNet之前，我们需要安装编译C++项目所需的基本开发工具。

Clang：我们将使用的C++编译器；
CMake：用于配置和编译项目的构建系统；
Git：用于从GitHub克隆BitNet代码仓库。

首先，安装LLVM（包含Clang）：

bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)"

然后更新软件包列表并安装所需工具：

sudo apt update
sudo apt install clang cmake git

完成此步骤后，你的系统就已准备好从源代码构建bitnet.cpp。

步骤2：从源代码克隆并构建BitNet

安装好所需工具后，我们将克隆BitNet代码仓库并在本地构建它。

首先，克隆官方代码仓库并进入项目文件夹：

git clone --recursive https://github.com/microsoft/BitNet.git
cd BitNet

接下来，创建Python虚拟环境（将依赖项与系统Python隔离开来）：

python -m venv venv
source venv/bin/activate

安装所需的Python依赖项：

pip install -r requirements.txt

现在我们编译项目并准备20亿参数模型。以下命令使用CMake构建C++后端，并设置BitNet-b1.58-2B-4T模型：

python setup_env.py -md models/BitNet-b1.58-2B-4T -q i2_s

如果遇到与int8_t * y_col相关的编译问题，可应用以下快速修复方法（在需要的地方将指针类型替换为常量指针）：

sed -i 's/^\([[:space:]]*\)int8_t \* y_col/\1const int8_t * y_col/' src/ggml-bitnet-mad.cpp

此步骤成功完成后，BitNet即构建完成，可在本地运行。

步骤3：下载轻量级BitNet模型

现在我们将下载GGUF格式的轻量级20亿参数BitNet模型，该格式针对bitnet.cpp的本地推理进行了优化。

BitNet代码仓库提供了使用Hugging Face CLI的支持模型快捷方式。

运行以下命令：

hf download microsoft/BitNet-b1.58-2B-4T-gguf --local-dir models/BitNet-b1.58-2B-4T

这将把所需的模型文件下载到models/BitNet-b1.58-2B-4T目录中。

下载过程中，你可能会看到如下输出：

data_summary_card.md: 3.86kB [00:00, 8.06MB/s]
Download complete. Moving file to models/BitNet-b1.58-2B-4T/data_summary_card.md

ggml-model-i2_s.gguf: 100%|████████████████████████████████████████████████| 1.19G/1.19G [00:11<00:00, 106MB/s]
Download complete. Moving file to models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf

Fetching 4 files: 100%|████████████████████████████████████████████████| 4/4 [00:11<00:00, 2.89s/it]

下载完成后，你的模型目录结构应如下所示：

BitNet/models/BitNet-b1.58-2B-4T

现在你已拥有可用于本地推理的20亿参数BitNet模型。

步骤4：在CPU上以交互式聊天模式运行BitNet

现在是时候在CPU上以交互式聊天模式在本地运行BitNet了。

使用以下命令：

python run_inference.py \
 -m "models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf" \
 -p "You are a helpful assistant." \
 -cnv

各参数说明：

-m：加载GGUF模型文件；
-p：设置系统提示词；
-cnv：启用对话模式。

你还可以使用以下可选参数控制性能：

-t 8：设置CPU线程数为8；
-n 128：设置生成新令牌的最大数量为128。

带可选参数的示例：

python run_inference.py \
 -m "models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf" \
 -p "You are a helpful assistant." \
 -cnv -t 8 -n 128

运行后，你将看到一个简单的命令行界面（CLI）聊天界面。你可以输入问题，模型将直接在终端中给出响应。

例如，我们询问“谁是世界上最富有的人”，模型根据其知识截止日期给出了清晰易读的答案。尽管这是一个在CPU上运行的20亿参数小型模型，但输出内容连贯且实用。

至此，你的设备上已运行一个完全可用的本地AI聊天工具。

步骤5：启动本地BitNet推理服务器

现在我们将BitNet作为本地推理服务器启动，这样你就可以通过浏览器访问该模型，或将其与其他应用程序连接。

运行以下命令：

python run_inference_server.py \
  -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
 --host 0.0.0.0 \
 --port 8080 \
 -t 8 \
 -c 2048 \
 --temperature 0.7

各参数说明：

-m：加载模型文件；
--host 0.0.0.0：使服务器可在本地访问；
--port 8080：在8080端口运行服务器；
-t 8：设置CPU线程数为8；
-c 2048：设置上下文长度为2048；
--temperature 0.7：控制响应的创造性。

服务器启动后，将在8080端口可用。

打开浏览器，访问http://127.0.0.1:8080，你将看到一个简单的Web界面，可在其中与BitNet聊天。

尽管模型在本地CPU上运行，但聊天界面响应迅速、流畅。至此，你的设备上已运行一个完全可用的本地AI服务器。

步骤6：使用OpenAI Python SDK连接到BitNet服务器

现在你的BitNet服务器已在本地运行，你可以使用OpenAI Python SDK连接到它。这使你能够像使用云API一样使用本地模型。

首先，安装OpenAI软件包：

pip install openai

接下来，创建一个简单的Python脚本：

from openai import OpenAI

client = OpenAI(
   base_url="http://127.0.0.1:8080/v1",
   api_key="not-needed"  # 许多本地服务器会忽略此参数
)

resp = client.chat.completions.create(
   model="bitnet1b",
   messages=[
       {"role": "system", "content": "You are a helpful assistant."},
       {"role": "user", "content": "Explain Neural Networks in simple terms."}
   ],
   temperature=0.7,
   max_tokens=200,
)

print(resp.choices[0].message.content)

代码说明：

base_url：指向你的本地BitNet服务器；
api_key：SDK要求必须提供，但通常被本地服务器忽略；
model：应与服务器暴露的模型名称匹配；
messages：定义系统提示词和用户提示词。

输出结果：

Neural networks are a type of machine learning model inspired by the human brain. They are used to recognize patterns in data. Think of them as a group of neurons (like tiny brain cells) that work together to solve a problem or make a prediction.

Imagine you are trying to recognize whether a picture shows a cat or a dog. A neural network would take the picture as input and process it. Each neuron in the network would analyze a small part of the picture, like a whisker or a tail. They would then pass this information to other neurons, which would analyze the whole picture.

By sharing and combining the information, the network can make a decision about whether the picture shows a cat or a dog.

In summary, neural networks are a way for computers to learn from data by mimicking how our brains work. They can recognize patterns and make decisions based on that recognition.