Deploy native privatized ChatGPT primarily based on ChatGLM-6B

干货分享9个月前发布 Youzhizhan
1,045 0

chatGPT has been very fashionable not too long ago, but it surely must go over the wall to make use of it. There are additionally many fashions in China, equivalent to Baidu’s Wenxin Yiyan, Ali’s Pangu, and iFlytek’s mannequin, and so on. , So at present we’ll introduce tips on how to deploy our personal chat mannequin regionally, and you may as well study so much. Data.

1. Open supply mannequin

1. Introduction of ChatGLM-6B

  • A language mannequin collectively skilled by the Data Engineering (KEG) Laboratory of Tsinghua College and Zhipu AI Firm in 2023; ChatGLM-6B refers back to the design concepts of ChatGPT and infuses code pre-training into the trillion base mannequin GLM-130B to attain alignment with human intentions by means of supervised fine-tuning and different applied sciences (that’s, to make the machine’s reply meet human expectations and values);
  • ChatGLM-6B is an open supply dialogue language mannequin that helps each Chinese language and English. It’s primarily based on the Common Language Mannequin (GLM) structure and has 6.2 billion parameters.;
  • Mixed with mannequin quantification know-how, customers can deploy regionally on consumer-grade graphics playing cards (solely 6GB of video reminiscence is required on the INT4 quantification degree);
  • ChatGLM-6B makes use of an analogous know-how to ChatGPT and is optimized for Chinese language Q&A and dialogue.After about 1T of bilingual coaching in Chinese language and English with identifiers, supplemented by supervision and fine-tuning, suggestions self-help, and enhanced studying with human suggestions, ChatGLM-6B with 6.2 billion parameters has been capable of generate solutions which might be fairly in step with human preferences.;

2. ChatGLM-6B has the next traits

  • Sufficient pre-training in Chinese language and English bilingualism: ChatGLM-6B trains 1T of tokens on a 1:1 ratio of Chinese language and English supplies, and has each bilingual skills.;
  • Optimized mannequin structure and measurement: Drawing on the coaching expertise of GLM-130B, the implementation of two-dimensional RoPE place coding is corrected, and the standard FFN construction is used.The parameter measurement of 6B (6.2 billion) additionally makes it doable for researchers and particular person builders to fine-tune and deploy ChatGLM-6B by themselves.;
  • Decrease deployment threshold: At FP16 half-precision, ChatGLM-6B requires at the least 13GB of video reminiscence for inference. Mixed with mannequin quantification know-how, this requirement might be additional decreased to 10GB (INT8) and 6GB (INT4), in order that ChatGLM-6B might be deployed on consumer-grade graphics playing cards.;
  • Longer sequence size: In contrast with GLM-10B (sequence size 1024), ChatGLM-6B has a sequence size of 2048, which helps longer conversations and functions.;
  • Human intent alignment coaching: Supervised High quality-Tuning, Suggestions Bootstrapping, reinforcement Studying from Human Suggestions, and so on. are used to make the mannequin initially have the flexibility to grasp the intent of human directions.The output format is markdown, which is handy for show; subsequently, ChatGLM-6B has good dialogue and question-and-answer capabilities beneath sure circumstances.;

3. ChatGLM-6B additionally has fairly a number of recognized limitations and shortcomings

  • Small mannequin capability: The small capability of 6B determines its comparatively weak mannequin reminiscence and language means; when confronted with many factual data duties, ChatGLM-6B might generate incorrect info.;
  • She can be not good at answering logic questions (equivalent to arithmetic and programming);
  • Could produce dangerous or biased content material: ChatGLM-6B is only a language mannequin that’s initially aligned with human intentions, and should generate dangerous and biased content material.;
  • Weak multi-round dialogue means: ChatGLM-6B’s contextual understanding means is just not adequate. Within the face of lengthy reply era and a number of rounds of dialogue eventualities, context loss and understanding errors might happen.;
  • Inadequate English proficiency: A lot of the directions utilized in coaching are in Chinese language, and solely a small a part of the directions are in English.Subsequently, when utilizing English directions, the standard of the reply will not be nearly as good as that of the Chinese language directions, and even contradict the reply beneath the Chinese language directions.;
  • Straightforward to be misled: ChatGLM-6B’s “self-awareness” could also be problematic, and it’s straightforward to be misled and produce flawed remarks.For instance, if the present model of the mannequin is misled, it’s going to deviate in self-awareness.Though the mannequin has undergone bilingual pre-training of about 1 trillion identifiers (tokens), and has undergone instruction fine-tuning and human suggestions reinforcement studying (RLHF), as a result of the mannequin capability is small, it could produce deceptive content material beneath sure directions.;

Deploy native privatized ChatGPT primarily based on ChatGLM-6B

2. System deployment

1. {Hardware} necessities

Deploy native privatized ChatGPT primarily based on ChatGLM-6B

2. System surroundings

Working system: CentOS 7.6/Ubuntu (reminiscence: 32G)

Graphics card configuration: 2x NVIDIA Gefore 3070Ti 8G (complete 16G video reminiscence)

Python 3.8.13 (the model shouldn’t be increased than 3.10, in any other case some dependencies can’t be downloaded, equivalent to paddlepaddle 2.4.2, which isn’t but supported in increased variations of Python)

# 安装Python3.8所需依赖
sudo yum -y set up gcc zlib zlib-devel openssl-devel
# 下载源码
# 解压缩
tar -zxvf Python-3.8.13.tgz
# 编译配置,注意:不要加 --enable-optimizations 参数
./configure --prefix=/usr/native/python3
# 编译并安装
make && make set up

3. Deploy ChatGLM 6B

3.1 Obtain the supply code

Obtain chatGLM-6B straight

git obtain git clone

3.2 Set up dependencies

Enter the ChatGLM-6B listing

Use pip to put in dependencies: pip install-r necessities.txt, the place the transformers library model is really helpful to be 4.27.1, however in idea it’s not decrease than 4.23.1.

As well as, if you want to run the quantized mannequin on the cpu, you additionally want to put in gcc and openmp.Most Linux distributions are put in by default.For Home windows, you may test openmp when putting in TDM-GCC. The Home windows check surroundings gcc model is TDM-GCC 10.3.0, and Linux is gcc 11.3.0

3.3 Obtain the mannequin

Obtain from Hugging Face Hub

You may manually obtain

git obtain git clone

After downloading the mannequin regionally, change THUDM/chatglm-6b within the above code with the trail of your native chatglm-6b folder to load the mannequin regionally.;

Create a mannequin folder beneath the chatglm-6b file to place the mannequin file

3.4 Code name

You may name the ChatGLM-6B mannequin to generate a dialog by means of the next code:

The implementation of the mannequin remains to be in flux.If you wish to repair the mannequin implementation used to make sure compatibility, you may add the revisinotallow=”v1.1.0″ parameter to the from_pretrained name.

>>> from transformers import AutoTokenizer, AutoModel
>>> tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True)
>>> mannequin = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).half().cuda()
>>> mannequin = mannequin.eval()
>>> response, historical past =, "你好", historical past=[])
>>> print(response)
你好👋!我是人工智能助手 ChatGLM-6B,很高兴见到你,欢迎问我任何问题。
>>> response, historical past =, "晚上睡不着应该怎么办", historical past=historical past)
>>> print(response)

Not with the ability to sleep at night time might make you are feeling anxious or uncomfortable, however listed below are some methods that can assist you go to sleep:

1. Make a daily sleep schedule: Sustaining a daily sleep schedule may also help you identify wholesome sleep habits and make it simpler so that you can go to sleep.Attempt to go to mattress on the similar time every single day and rise up on the similar time.

2. Create a cushty sleeping surroundings: be sure that the sleeping surroundings is comfy, quiet, darkish and the temperature is acceptable.You need to use comfy bedding and hold the room ventilated.

3. Calm down your physique and thoughts: Do some stress-free actions earlier than going to mattress, equivalent to taking a scorching tub, listening to some delicate music, studying some attention-grabbing books, and so on., which can assist relieve rigidity and nervousness and make it simpler so that you can go to sleep.

4. Keep away from ingesting drinks containing caffeine: Caffeine is an irritating substance that may have an effect on your sleep high quality.Attempt to keep away from ingesting caffeine-containing drinks equivalent to espresso, tea and cola earlier than going to mattress.

5. Keep away from doing issues that don’t have anything to do with sleep in mattress: Doing issues that don’t have anything to do with sleep in mattress, equivalent to watching motion pictures, enjoying video games or work, might intervene along with your sleep.

6. Strive respiration methods: Deep respiration is a rest approach that may make it easier to relieve rigidity and nervousness and make it simpler so that you can go to sleep.Attempt to inhale slowly, maintain for a number of seconds, after which exhale slowly.

If these strategies can not make it easier to go to sleep, you may think about consulting a health care provider or sleep specialist for additional recommendation.

3.5 Low-cost deployment

Mannequin quantification

By default, the mannequin is loaded with FP16 precision, and roughly 13GB of video reminiscence is required to run the above code.In case your GPU is considerably restricted, you may attempt to load the mannequin quantitatively, utilizing the next strategies:

# Modify as wanted, at the moment solely 4/8 bit quantification is supported

mannequin = AutoModel.from_pretrained(“THUDM/chatglm-6b”, trust_remote_code=True).quantize(8).half().cuda()

After 2 to three rounds of dialogue, the GPU video reminiscence footprint is about 10GB beneath 8-bit quantification, and solely 6GB beneath 4-bit quantification.With the rise of the variety of dialogue rounds, the corresponding video reminiscence consumption additionally will increase. Resulting from using relative place coding, ChatGLM-6B theoretically helps infinite context-length, however the complete size exceeds 2048 (coaching size). Efficiency will regularly lower.

Mannequin quantification will convey sure efficiency losses. After testing, ChatGLM-6B can nonetheless generate naturally and easily beneath 4-bit quantification.The usage of GPT-Q and different quantification schemes can additional compress the quantification accuracy/enhance the efficiency of the mannequin with the identical quantification accuracy. Everyone seems to be welcome to submit corresponding pull requests.

The quantification course of requires the FP16 format mannequin to be loaded in reminiscence first, consuming about 13GB of reminiscence.If you do not have sufficient reminiscence, you may straight load the quantized mannequin. The INT4 quantized mannequin solely wants about 5.2GB of reminiscence.:

#INT8 For the quantified mannequin, change “THUDM/chatglm-6b-int4” to “THUDM/chatglm-6b-int8”

mannequin = AutoModel.from_pretrained(“THUDM/chatglm-6b-int4”, trust_remote_code=True).half().cuda()

The parameter recordsdata of the quantitative mannequin can be downloaded manually from right here.

3.6 CPU deployment

If you do not have GPU {hardware}, you may as well cause on the CPU, however the reasoning pace shall be slower.The tactic of use is as follows (roughly 32GB of reminiscence is required)

mannequin = AutoModel.from_pretrained(“THUDM/chatglm-6b”, trust_remote_code=True).float()

If you happen to run out of reminiscence, you may straight load the quantized mannequin:

#INT8 For the quantified mannequin, change “THUDM/chatglm-6b-int4” to “THUDM/chatglm-6b-int8”

mannequin = AutoModel.from_pretrained(“THUDM/chatglm-6b-int4”,trust_remote_code=True).float()

If you happen to encounter an error, you may’t discover module’vcuda.dll’ or RuntimeError: Unknown platform: darwin (macOS), please load the mannequin regionally

3.7 Multi-card deployment

If in case you have a number of GPUs, however the video reminiscence measurement of every GPU is just not sufficient to accommodate an entire mannequin, then you may break up the mannequin into a number of GPUs.First set up speed up: pip set up speed up, after which load the mannequin as follows:

from utils import load_model_on_gpus

mannequin = load_model_on_gpus(“THUDM/chatglm-6b”, num_gpus=2)

The mannequin might be deployed to 2 GPUs for inference.You may change num_gpus to the variety of GPUs you need to use.The default is evenly segmented, you may as well go within the device_map parameter to specify it your self.

Fourth, the system begins

4.1 Internet model Demo

First set up Gradio: pip set up gradio, after which run the warehouse


Deploy native privatized ChatGPT primarily based on ChatGLM-6B

Deploy native privatized ChatGPT primarily based on ChatGLM-6B

This system will run a Internet server and output the deal with.Open the output deal with within the browser to make use of.The newest model of the Demo realizes the typewriter impact, and the pace expertise is drastically improved.Notice that because of the sluggish community entry of Gradio in China, allow demo.queue().When launching (share=True, inbrowser=True), all networks shall be forwarded by means of the Gradio server, leading to a big lower within the typewriter expertise. Now the default startup technique has been modified to share=False. If you happen to want public community entry, you may re-modify it to share=True to start out.

4.2 Command line Demo

Working within the warehouse


Deploy native privatized ChatGPT primarily based on ChatGLM-6B

This system may have an interactive dialogue within the command line, enter directions within the command line and press enter to generate a reply, enter clear to clear the dialog historical past, and enter cease to terminate this system

4.3 API deployment

First you want to set up the extra dependency pip set up fastapi uvicorn, after which run the


By default, it’s deployed on the native port 8000 and referred to as by means of the POST technique.

curl -X POST "" 
     -H 'Content material-Kind: software/json' 
     -d '{"immediate": "你好", "historical past": []}'
  "response":"你好👋!我是人工智能助手 ChatGLM-6B,很高兴见到你,欢迎问我任何问题。",
  "historical past":[["你好","你好👋!我是人工智能助手 ChatGLM-6B,很高兴见到你,欢迎问我任何问题。"]],
  "time":"2023-03-23 21:38:40"

4.4 Frequent issues in deployment

Query 1. torch.cuda.OutOfMemoryError: CUDA out of reminiscence

Clearly, the video reminiscence is inadequate, it is suggested to modify to chatglm-6b-int4 or chatglm-6b-int4

torch.cuda.OutOfMemoryError: CUDA out of reminiscence

Query 2. “RuntimeError: Library cudart is just not initialized”

This error is normally brought on by a lacking or corrupted CUDA library file.To resolve this drawback, you want to set up the CUDA Toolkit :

安装CUDA Toolkit
sudo yum-config-manager --add-repo
sudo yum clear all
sudo yum -y set up nvidia-driver-latest-dkms
sudo yum -y set up cuda
© 版权声明