
2023-10-18 14:20



  本文将会部署LLaMA-2 70B模型,使得其兼容OpenAI的调用风格。部署的Dockerfile文件如下:

FROM nvidia/cuda:11.7.1-runtime-ubuntu20.04RUN apt-get update -y && apt-get install -y python3.9 python3.9-distutils curl
RUN curl -o
RUN python3.9
RUN pip3 install fschat


version: "3.9"services:fastchat-controller:build:context: .dockerfile: Dockerfileimage: fastchat:latestports:- "21001:21001"entrypoint: ["python3.9", "-m", "fastchat.serve.controller", "--host", "", "--port", "21001"]fastchat-model-worker:build:context: .dockerfile: Dockerfilevolumes:- ./model:/root/modelimage: fastchat:latestports:- "21002:21002"deploy:resources:reservations:devices:- driver: nvidiadevice_ids: ['0', '1']capabilities: [gpu]entrypoint: ["python3.9", "-m", "fastchat.serve.model_worker", "--model-names", "llama2-70b-chat", "--model-path", "/root/model/llama2/Llama-2-70b-chat-hf", "--num-gpus", "2", "--gpus",  "0,1", "--worker-address", "http://fastchat-model-worker:21002", "--controller-address", "http://fastchat-controller:21001", "--host", "", "--port", "21002"]fastchat-api-server:build:context: .dockerfile: Dockerfileimage: fastchat:latestports:- "8000:8000"entrypoint: ["python3.9", "-m", "fastchat.serve.openai_api_server", "--controller-address", "http://fastchat-controller:21001", "--host", "", "--port", "8000"]


curl http://localhost:8000/v1/models


{"object": "list","data": [{"id": "llama2-70b-chat","object": "model","created": 1691504717,"owned_by": "fastchat","root": "llama2-70b-chat","parent": null,"permission": [{"id": "modelperm-3XG6nzMAqfEkwfNqQ52fdv","object": "model_permission","created": 1691504717,"allow_create_engine": false,"allow_sampling": true,"allow_logprobs": true,"allow_search_indices": true,"allow_view": true,"allow_fine_tuning": false,"organization": "*","group": null,"is_blocking": false}]}]

部署LLaMA-2 70B模型成功!

Prompt token长度计算


curl --location 'localhost:21002/count_token' \
--header 'Content-Type: application/json' \
--data '{"prompt": "What is your name?"}'


{"count": 6,"error_code": 0

Conversation token长度计算

  首先我们需要获取LLaMA-2 70B模型的对话配置,调用API如下:

curl --location --request POST 'http://localhost:21002/worker_get_conv_template'


{'conv': {'messages': [],'name': 'llama-2','offset': 0,'roles': ['[INST]', '[/INST]'],'sep': ' ','sep2': ' </s><s>','sep_style': 7,'stop_str': None,'stop_token_ids': [2],'system_message': 'You are a helpful, respectful and honest ''assistant. Always answer as helpfully as ''possible, while being safe. Your answers should ''not include any harmful, unethical, racist, ''sexist, toxic, dangerous, or illegal content. ''Please ensure that your responses are socially ''unbiased and positive in nature.\n''\n''If a question does not make any sense, or is not ''factually coherent, explain why instead of '"answering something not correct. If you don't ""know the answer to a question, please don't share "'false information.','system_template': '[INST] <<SYS>>\n{system_message}\n<</SYS>>\n\n'}}


messages = [{“role”: “system”, “content”: “You are Jack, you are 20 years old, answer questions with humor.”}, {“role”: “user”, “content”: “What is your name?”},{“role”: “assistant”, “content”: " Well, well, well! Look who’s asking the questions now! My name is Jack, but you can call me the king of the castle, the lord of the rings, or the prince of the pizza party. Whatever floats your boat, my friend!“}, {“role”: “user”, “content”: “How old are you?”}, {“role”: “assistant”, “content”: " Oh, you want to know my age? Well, let’s just say I’m older than a bottle of wine but younger than a bottle of whiskey. I’m like a fine cheese, getting better with age, but still young enough to party like it’s 1999!”}, {“role”: “user”, “content”: “Where is your hometown?”}]


# -*- coding: utf-8 -*-
# @place: Pudong, Shanghai 
# @file:
# @time: 2023/8/8 19:24
from conversation import Conversation, SeparatorStylemessages = [{"role": "system", "content": "You are Jack, you are 20 years old, answer questions with humor."}, {"role": "user", "content": "What is your name?"},{"role": "assistant", "content": " Well, well, well! Look who's asking the questions now! My name is Jack, but you can call me the king of the castle, the lord of the rings, or the prince of the pizza party. Whatever floats your boat, my friend!"}, {"role": "user", "content": "How old are you?"}, {"role": "assistant", "content": " Oh, you want to know my age? Well, let's just say I'm older than a bottle of wine but younger than a bottle of whiskey. I'm like a fine cheese, getting better with age, but still young enough to party like it's 1999!"}, {"role": "user", "content": "Where is your hometown?"}]llama2_conv = {"conv":{"name":"llama-2","system_template":"[INST] <<SYS>>\n{system_message}\n<</SYS>>\n\n","system_message":"You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.","roles":["[INST]","[/INST]"],"messages":[],"offset":0,"sep_style":7,"sep":" ","sep2":" </s><s>","stop_str":None,"stop_token_ids":[2]}}
conv = llama2_conv['conv']conv = Conversation(name=conv["name"],system_template=conv["system_template"],system_message=conv["system_message"],roles=conv["roles"],messages=list(conv["messages"]),  # prevent in-place modificationoffset=conv["offset"],sep_style=SeparatorStyle(conv["sep_style"]),sep=conv["sep"],sep2=conv["sep2"],stop_str=conv["stop_str"],stop_token_ids=conv["stop_token_ids"],)if isinstance(messages, str):prompt = messages
else:for message in messages:msg_role = message["role"]if msg_role == "system":conv.set_system_message(message["content"])elif msg_role == "user":conv.append_message(conv.roles[0], message["content"])elif msg_role == "assistant":conv.append_message(conv.roles[1], message["content"])else:raise ValueError(f"Unknown role: {msg_role}")# Add a blank message for the assistant.conv.append_message(conv.roles[1], None)prompt = conv.get_prompt()print(repr(prompt))


"[INST] <<SYS>>\nYou are Jack, you are 20 years old, answer questions with humor.\n<</SYS>>\n\nWhat is your name?[/INST]  Well, well, well! Look who's asking the questions now! My name is Jack, but you can call me the king of the castle, the lord of the rings, or the prince of the pizza party. Whatever floats your boat, my friend! </s><s>[INST] How old are you? [/INST]  Oh, you want to know my age? Well, let's just say I'm older than a bottle of wine but younger than a bottle of whiskey. I'm like a fine cheese, getting better with age, but still young enough to party like it's 1999! </s><s>[INST] Where is your hometown? [/INST]"

  最后再调用计算Prompt的API(参考上节的Prompt token长度计算),输出该对话的token长度为199.

curl --location 'http://localhost:8000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{"model": "llama2-70b-chat","messages": [{"role": "system", "content": "You are Jack, you are 20 years old, answer questions with humor."}, {"role": "user", "content": "What is your name?"},{"role": "assistant", "content": " Well, well, well! Look who'\''s asking the questions now! My name is Jack, but you can call me the king of the castle, the lord of the rings, or the prince of the pizza party. Whatever floats your boat, my friend!"}, {"role": "user", "content": "How old are you?"}, {"role": "assistant", "content": " Oh, you want to know my age? Well, let'\''s just say I'\''m older than a bottle of wine but younger than a bottle of whiskey. I'\''m like a fine cheese, getting better with age, but still young enough to party like it'\''s 1999!"}, {"role": "user", "content": "Where is your hometown?"}]


{"id": "chatcmpl-mQxcaQcNSNMFahyHS7pamA","object": "chat.completion","created": 1691506768,"model": "llama2-70b-chat","choices": [{"index": 0,"message": {"role": "assistant","content": " Ha! My hometown? Well, that's a tough one. I'm like a bird, I don't have a nest, I just fly around and land wherever the wind takes me. But if you really want to know, I'm from a place called \"The Internet\". It's a magical land where memes and cat videos roam free, and the Wi-Fi is always strong. It's a beautiful place, you should visit sometime!"},"finish_reason": "stop"}],"usage": {"prompt_tokens": 199,"total_tokens": 302,"completion_tokens": 103}



  本文主要介绍了如何在FastChat中部署LLaMA-2 70B模型,并详细介绍了Prompt token长度计算以及对话(conversation)的token长度计算。希望能对读者有所帮助~
  笔者的个人博客网址为: ,欢迎大家访问~


  1. NLP(五十九)使用FastChat部署百川大模型:
  2. FastChat:








