本文主要是介绍(章节 3.1) 本地运行 AI 有多慢 ? 大模型推理测速 (llama.cpp, Intel GPU A770),希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
由于本文太长, 分开发布, 方便阅读.
3.1 CPU (i5-6200U, 2C/4T/2.8GHz) x86_64 AVX2
在 4 号 PC (物理机) 上运行. 版本:
> ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli --version
version: 3617 (a07c32ea)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
运行模型 llama2-7B.q4
, 生成长度 100
:
> ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 100
Log start
main: build = 3617 (a07c32ea)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1724500181
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from llama-2-7b.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = LLaMA v2
llama_model_loader: - kv 2: llama.context_length u32 = 4096
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 15
llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 18: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format = GGUF V2
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 4096
llm_load_print_meta: n_embd_v_gqa = 4096
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 11008
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 6.74 B
llm_load_print_meta: model size = 3.80 GiB (4.84 BPW)
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size = 0.14 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors: CPU buffer size = 3891.24 MiB
..................................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 2048.00 MiB
llama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.12 MiB
llama_new_context_with_model: CPU compute buffer size = 296.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 1system_info: n_threads = 2 / 4 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 4096, n_batch = 2048, n_predict = 100, n_keep = 1hello, this is a very very long story. nobody wants to read this much. just tell me what happened.(此处省略一部分)llama_print_timings: load time = 2666.87 ms
llama_print_timings: sample time = 5.38 ms / 100 runs ( 0.05 ms per token, 18580.45 tokens per second)
llama_print_timings: prompt eval time = 1898.40 ms / 10 tokens ( 189.84 ms per token, 5.27 tokens per second)
llama_print_timings: eval time = 28113.06 ms / 99 runs ( 283.97 ms per token, 3.52 tokens per second)
llama_print_timings: total time = 30034.85 ms / 109 tokens
Log end
运行模型 llama2-7B.q4
, 生成长度 200
:
> ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 200(此处省略一部分)llama_print_timings: load time = 2703.62 ms
llama_print_timings: sample time = 12.85 ms / 200 runs ( 0.06 ms per token, 15560.57 tokens per second)
llama_print_timings: prompt eval time = 1873.80 ms / 10 tokens ( 187.38 ms per token, 5.34 tokens per second)
llama_print_timings: eval time = 59352.84 ms / 199 runs ( 298.26 ms per token, 3.35 tokens per second)
llama_print_timings: total time = 61281.14 ms / 209 tokens
运行模型 llama2-7B.q4
, 生成长度 500
:
> ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 500(此处省略一部分)llama_print_timings: load time = 2706.04 ms
llama_print_timings: sample time = 33.77 ms / 500 runs ( 0.07 ms per token, 14808.23 tokens per second)
llama_print_timings: prompt eval time = 1866.60 ms / 10 tokens ( 186.66 ms per token, 5.36 tokens per second)
llama_print_timings: eval time = 154145.54 ms / 499 runs ( 308.91 ms per token, 3.24 tokens per second)
llama_print_timings: total time = 156146.19 ms / 509 tokens
运行模型 llama2-7B.q4
, 生成长度 1000
:
> ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 1000(此处省略一部分)llama_print_timings: load time = 2912.39 ms
llama_print_timings: sample time = 60.76 ms / 1000 runs ( 0.06 ms per token, 16457.65 tokens per second)
llama_print_timings: prompt eval time = 1870.87 ms / 10 tokens ( 187.09 ms per token, 5.35 tokens per second)
llama_print_timings: eval time = 335019.17 ms / 999 runs ( 335.35 ms per token, 2.98 tokens per second)
llama_print_timings: total time = 337155.40 ms / 1009 tokens
运行模型 qwen2-7B.q8
, 生成长度 100
:
> ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 100
Log start
main: build = 3617 (a07c32ea)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1724501237
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from qwen2-7b-instruct-q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.name str = qwen2-7b-instruct
llama_model_loader: - kv 2: qwen2.block_count u32 = 28
llama_model_loader: - kv 3: qwen2.context_length u32 = 32768
llama_model_loader: - kv 4: qwen2.embedding_length u32 = 3584
llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 18944
llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 28
llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 4
llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: general.file_type u32 = 7
llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 19: tokenizer.chat_template str = {% for message in messages %}{% if lo...
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 21: general.quantization_version u32 = 2
llama_model_loader: - kv 22: quantize.imatrix.file str = ../Qwen2/gguf/qwen2-7b-imatrix/imatri...
llama_model_loader: - kv 23: quantize.imatrix.dataset str = ../sft_2406.txt
llama_model_loader: - kv 24: quantize.imatrix.entries_count i32 = 196
llama_model_loader: - kv 25: quantize.imatrix.chunks_count i32 = 1937
llama_model_loader: - type f32: 141 tensors
llama_model_loader: - type q8_0: 198 tensors
llm_load_vocab: special tokens cache size = 421
llm_load_vocab: token to piece cache size = 0.9352 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 152064
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 3584
llm_load_print_meta: n_layer = 28
llm_load_print_meta: n_head = 28
llm_load_print_meta: n_head_kv = 4
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 7
llm_load_print_meta: n_embd_k_gqa = 512
llm_load_print_meta: n_embd_v_gqa = 512
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 18944
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = Q8_0
llm_load_print_meta: model params = 7.62 B
llm_load_print_meta: model size = 7.54 GiB (8.50 BPW)
llm_load_print_meta: general.name = qwen2-7b-instruct
llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token = 151645 '<|im_end|>'
llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
llm_load_print_meta: LF token = 148848 'ÄĬ'
llm_load_print_meta: EOT token = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size = 0.15 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/29 layers to GPU
llm_load_tensors: CPU buffer size = 7717.68 MiB
........................................................................................
llama_new_context_with_model: n_ctx = 32768
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 1792.00 MiB
llama_new_context_with_model: KV self size = 1792.00 MiB, K (f16): 896.00 MiB, V (f16): 896.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.58 MiB
llama_new_context_with_model: CPU compute buffer size = 1884.01 MiB
llama_new_context_with_model: graph nodes = 986
llama_new_context_with_model: graph splits = 1system_info: n_threads = 2 / 4 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 32768, n_batch = 2048, n_predict = 100, n_keep = 0hello, this is a very very long story and it is very complicated.(此处省略一部分)llama_print_timings: load time = 5355.79 ms
llama_print_timings: sample time = 16.50 ms / 100 runs ( 0.17 ms per token, 6059.14 tokens per second)
llama_print_timings: prompt eval time = 1727.39 ms / 9 tokens ( 191.93 ms per token, 5.21 tokens per second)
llama_print_timings: eval time = 41066.65 ms / 99 runs ( 414.81 ms per token, 2.41 tokens per second)
llama_print_timings: total time = 42914.72 ms / 108 tokens
Log end
运行模型 qwen2-7B.q8
, 生成长度 200
:
> ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 200(此处省略一部分)llama_print_timings: load time = 4641.45 ms
llama_print_timings: sample time = 34.69 ms / 200 runs ( 0.17 ms per token, 5765.85 tokens per second)
llama_print_timings: prompt eval time = 1735.51 ms / 9 tokens ( 192.83 ms per token, 5.19 tokens per second)
llama_print_timings: eval time = 84374.46 ms / 199 runs ( 423.99 ms per token, 2.36 tokens per second)
llama_print_timings: total time = 86360.14 ms / 208 tokens
运行模型 qwen2-7B.q8
, 生成长度 500
:
> ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 500(此处省略一部分)llama_print_timings: load time = 5026.41 ms
llama_print_timings: sample time = 91.64 ms / 500 runs ( 0.18 ms per token, 5456.37 tokens per second)
llama_print_timings: prompt eval time = 1713.90 ms / 9 tokens ( 190.43 ms per token, 5.25 tokens per second)
llama_print_timings: eval time = 214729.88 ms / 499 runs ( 430.32 ms per token, 2.32 tokens per second)
llama_print_timings: total time = 217097.31 ms / 508 tokens
运行模型 qwen2-7B.q8
, 生成长度 1000
:
> ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 1000(此处省略一部分)llama_print_timings: load time = 4939.31 ms
llama_print_timings: sample time = 194.02 ms / 1000 runs ( 0.19 ms per token, 5154.00 tokens per second)
llama_print_timings: prompt eval time = 1879.29 ms / 9 tokens ( 208.81 ms per token, 4.79 tokens per second)
llama_print_timings: eval time = 440575.12 ms / 999 runs ( 441.02 ms per token, 2.27 tokens per second)
llama_print_timings: total time = 443841.74 ms / 1008 tokens
3.2 CPU (E5-2650v3, 10C/10T/3.0GHz) x86_64 AVX2
在 5 号 (物理机) 上运行. 版本:
fc-test@MiWiFi-RA74-srv:~/llama-cpp$ ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli --version
./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli: /lib64/libcurl.so.4: no version information available (required by ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli)
version: 3617 (a07c32ea)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
运行模型 llama2-7B.q4
, 生成长度 100
:
fc-test@MiWiFi-RA74-srv:~/llama-cpp$ ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 100
./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli: /lib64/libcurl.so.4: no version information available (required by ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli)
Log start
main: build = 3617 (a07c32ea)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1724498199(此处省略一部分)llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size = 0.14 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors: CPU buffer size = 3891.24 MiB
..................................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 2048.00 MiB
llama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.12 MiB
llama_new_context_with_model: CPU compute buffer size = 296.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 1system_info: n_threads = 10 / 10 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 4096, n_batch = 2048, n_predict = 100, n_keep = 1hello, this is a very very long story, but this is the only way I could explain what I did to solve this problem. everyone here said it cannot be done, but I did it. I don't know why I can solve it, but I did.(此处省略一部分)llama_print_timings: load time = 1542.10 ms
llama_print_timings: sample time = 4.82 ms / 100 runs ( 0.05 ms per token, 20768.43 tokens per second)
llama_print_timings: prompt eval time = 493.57 ms / 10 tokens ( 49.36 ms per token, 20.26 tokens per second)
llama_print_timings: eval time = 10175.47 ms / 99 runs ( 102.78 ms per token, 9.73 tokens per second)
llama_print_timings: total time = 10693.97 ms / 109 tokens
Log end
运行模型 llama2-7B.q4
, 生成长度 200
:
$ ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 200(此处省略一部分)llama_print_timings: load time = 1607.02 ms
llama_print_timings: sample time = 9.29 ms / 200 runs ( 0.05 ms per token, 21528.53 tokens per second)
llama_print_timings: prompt eval time = 494.35 ms / 10 tokens ( 49.44 ms per token, 20.23 tokens per second)
llama_print_timings: eval time = 20434.74 ms / 199 runs ( 102.69 ms per token, 9.74 tokens per second)
llama_print_timings: total time = 20978.91 ms / 209 tokens
运行模型 llama2-7B.q4
, 生成长度 500
:
$ ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 500(此处省略一部分)llama_print_timings: load time = 1583.59 ms
llama_print_timings: sample time = 23.55 ms / 500 runs ( 0.05 ms per token, 21226.92 tokens per second)
llama_print_timings: prompt eval time = 499.12 ms / 10 tokens ( 49.91 ms per token, 20.04 tokens per second)
llama_print_timings: eval time = 52358.53 ms / 499 runs ( 104.93 ms per token, 9.53 tokens per second)
llama_print_timings: total time = 52987.01 ms / 509 tokens
运行模型 llama2-7B.q4
, 生成长度 1000
:
$ ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 1000(此处省略一部分)llama_print_timings: load time = 3247.78 ms
llama_print_timings: sample time = 47.13 ms / 1000 runs ( 0.05 ms per token, 21218.81 tokens per second)
llama_print_timings: prompt eval time = 2596.30 ms / 10 tokens ( 259.63 ms per token, 3.85 tokens per second)
llama_print_timings: eval time = 118042.47 ms / 999 runs ( 118.16 ms per token, 8.46 tokens per second)
llama_print_timings: total time = 120896.74 ms / 1009 tokens
运行模型 qwen2-7B.q8
, 生成长度 100
:
fc-test@MiWiFi-RA74-srv:~/llama-cpp$ ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 100
./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli: /lib64/libcurl.so.4: no version information available (required by ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli)
Log start
main: build = 3617 (a07c32ea)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1724498632
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from qwen2-7b-instruct-q8_0.gguf (version GGUF V3 (latest))(此处省略一部分)llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size = 0.15 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/29 layers to GPU
llm_load_tensors: CPU buffer size = 7717.68 MiB
........................................................................................
llama_new_context_with_model: n_ctx = 32768
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 1792.00 MiB
llama_new_context_with_model: KV self size = 1792.00 MiB, K (f16): 896.00 MiB, V (f16): 896.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.58 MiB
llama_new_context_with_model: CPU compute buffer size = 1884.01 MiB
llama_new_context_with_model: graph nodes = 986
llama_new_context_with_model: graph splits = 1system_info: n_threads = 10 / 10 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 32768, n_batch = 2048, n_predict = 100, n_keep = 0hello, this is a very very long story, so i will split it into parts.(此处省略一部分)llama_print_timings: load time = 1626.44 ms
llama_print_timings: sample time = 14.31 ms / 100 runs ( 0.14 ms per token, 6987.63 tokens per second)
llama_print_timings: prompt eval time = 507.61 ms / 9 tokens ( 56.40 ms per token, 17.73 tokens per second)
llama_print_timings: eval time = 14615.79 ms / 99 runs ( 147.63 ms per token, 6.77 tokens per second)
llama_print_timings: total time = 15238.41 ms / 108 tokens
Log end
运行模型 qwen2-7B.q8
, 生成长度 200
:
$ ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 200(此处省略一部分)llama_print_timings: load time = 1577.00 ms
llama_print_timings: sample time = 28.41 ms / 200 runs ( 0.14 ms per token, 7039.03 tokens per second)
llama_print_timings: prompt eval time = 503.02 ms / 9 tokens ( 55.89 ms per token, 17.89 tokens per second)
llama_print_timings: eval time = 28940.41 ms / 199 runs ( 145.43 ms per token, 6.88 tokens per second)
llama_print_timings: total time = 29668.90 ms / 208 tokens
运行模型 qwen2-7B.q8
, 生成长度 500
:
$ ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 500(此处省略一部分)llama_print_timings: load time = 1598.72 ms
llama_print_timings: sample time = 72.10 ms / 500 runs ( 0.14 ms per token, 6935.01 tokens per second)
llama_print_timings: prompt eval time = 502.73 ms / 9 tokens ( 55.86 ms per token, 17.90 tokens per second)
llama_print_timings: eval time = 72983.23 ms / 499 runs ( 146.26 ms per token, 6.84 tokens per second)
llama_print_timings: total time = 74061.66 ms / 508 tokens
运行模型 qwen2-7B.q8
, 生成长度 1000
:
$ ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 1000(此处省略一部分)llama_print_timings: load time = 1602.06 ms
llama_print_timings: sample time = 144.15 ms / 1000 runs ( 0.14 ms per token, 6937.31 tokens per second)
llama_print_timings: prompt eval time = 509.66 ms / 9 tokens ( 56.63 ms per token, 17.66 tokens per second)
llama_print_timings: eval time = 149336.77 ms / 999 runs ( 149.49 ms per token, 6.69 tokens per second)
llama_print_timings: total time = 150983.01 ms / 1008 tokens
3.3 CPU (r5-5600g, 6C/12T/4.4GHz) x86_64 AVX2
在 6 号 PC (物理机) 上运行. 版本:
> ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli --version
version: 3617 (a07c32ea)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
运行模型 llama2-7B.q4
, 生成长度 100
:
> ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 100
Log start
main: build = 3617 (a07c32ea)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1724488187(此处省略一部分)llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size = 0.14 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors: CPU buffer size = 3891.24 MiB
..................................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 2048.00 MiB
llama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.12 MiB
llama_new_context_with_model: CPU compute buffer size = 296.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 1system_info: n_threads = 6 / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 4096, n_batch = 2048, n_predict = 100, n_keep = 1hello, this is a very very long story, but i think it's important to read.(此处省略一部分)llama_print_timings: load time = 649.76 ms
llama_print_timings: sample time = 2.40 ms / 100 runs ( 0.02 ms per token, 41701.42 tokens per second)
llama_print_timings: prompt eval time = 311.37 ms / 10 tokens ( 31.14 ms per token, 32.12 tokens per second)
llama_print_timings: eval time = 9771.88 ms / 99 runs ( 98.71 ms per token, 10.13 tokens per second)
llama_print_timings: total time = 10092.46 ms / 109 tokens
Log end
运行模型 llama2-7B.q4
, 生成长度 200
:
> ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 200(此处省略一部分)llama_print_timings: load time = 650.76 ms
llama_print_timings: sample time = 5.08 ms / 200 runs ( 0.03 ms per token, 39331.37 tokens per second)
llama_print_timings: prompt eval time = 308.01 ms / 10 tokens ( 30.80 ms per token, 32.47 tokens per second)
llama_print_timings: eval time = 19887.24 ms / 199 runs ( 99.94 ms per token, 10.01 tokens per second)
llama_print_timings: total time = 20214.70 ms / 209 tokens
运行模型 llama2-7B.q4
, 生成长度 500
:
> ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 500(此处省略一部分)llama_print_timings: load time = 648.51 ms
llama_print_timings: sample time = 12.16 ms / 500 runs ( 0.02 ms per token, 41128.57 tokens per second)
llama_print_timings: prompt eval time = 308.95 ms / 10 tokens ( 30.89 ms per token, 32.37 tokens per second)
llama_print_timings: eval time = 51687.76 ms / 499 runs ( 103.58 ms per token, 9.65 tokens per second)
llama_print_timings: total time = 52043.21 ms / 509 tokens
运行模型 llama2-7B.q4
, 生成长度 1000
:
> ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 1000(此处省略一部分)llama_print_timings: load time = 648.60 ms
llama_print_timings: sample time = 24.13 ms / 1000 runs ( 0.02 ms per token, 41438.75 tokens per second)
llama_print_timings: prompt eval time = 311.58 ms / 10 tokens ( 31.16 ms per token, 32.09 tokens per second)
llama_print_timings: eval time = 107409.32 ms / 999 runs ( 107.52 ms per token, 9.30 tokens per second)
llama_print_timings: total time = 107815.70 ms / 1009 tokens
运行模型 qwen2-7B.q8
, 生成长度 100
:
> ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 100
Log start
main: build = 3617 (a07c32ea)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1724489633
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from qwen2-7b-instruct-q8_0.gguf (version GGUF V3 (latest))(此处省略一部分)llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size = 0.15 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/29 layers to GPU
llm_load_tensors: CPU buffer size = 7717.68 MiB
........................................................................................
llama_new_context_with_model: n_ctx = 32768
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 1792.00 MiB
llama_new_context_with_model: KV self size = 1792.00 MiB, K (f16): 896.00 MiB, V (f16): 896.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.58 MiB
llama_new_context_with_model: CPU compute buffer size = 1884.01 MiB
llama_new_context_with_model: graph nodes = 986
llama_new_context_with_model: graph splits = 1system_info: n_threads = 6 / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 32768, n_batch = 2048, n_predict = 100, n_keep = 0hello, this is a very very long story about my friend and her husband, so please bear with me.(此处省略一部分)llama_print_timings: load time = 1158.78 ms
llama_print_timings: sample time = 8.32 ms / 100 runs ( 0.08 ms per token, 12025.01 tokens per second)
llama_print_timings: prompt eval time = 457.69 ms / 9 tokens ( 50.85 ms per token, 19.66 tokens per second)
llama_print_timings: eval time = 17878.08 ms / 99 runs ( 180.59 ms per token, 5.54 tokens per second)
llama_print_timings: total time = 18402.49 ms / 108 tokens
Log end
运行模型 qwen2-7B.q8
, 生成长度 200
:
> ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 200(此处省略一部分)llama_print_timings: load time = 1109.41 ms
llama_print_timings: sample time = 13.17 ms / 200 runs ( 0.07 ms per token, 15181.42 tokens per second)
llama_print_timings: prompt eval time = 496.57 ms / 9 tokens ( 55.17 ms per token, 18.12 tokens per second)
llama_print_timings: eval time = 35791.00 ms / 199 runs ( 179.85 ms per token, 5.56 tokens per second)
llama_print_timings: total time = 36411.02 ms / 208 tokens
运行模型 qwen2-7B.q8
, 生成长度 500
:
> ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 500(此处省略一部分)llama_print_timings: load time = 1061.77 ms
llama_print_timings: sample time = 40.61 ms / 500 runs ( 0.08 ms per token, 12311.03 tokens per second)
llama_print_timings: prompt eval time = 409.44 ms / 9 tokens ( 45.49 ms per token, 21.98 tokens per second)
llama_print_timings: eval time = 90250.99 ms / 499 runs ( 180.86 ms per token, 5.53 tokens per second)
llama_print_timings: total time = 90991.53 ms / 508 tokens
运行模型 qwen2-7B.q8
, 生成长度 1000
:
> ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 1000(此处省略一部分)llama_print_timings: load time = 977.25 ms
llama_print_timings: sample time = 60.87 ms / 1000 runs ( 0.06 ms per token, 16428.99 tokens per second)
llama_print_timings: prompt eval time = 479.25 ms / 9 tokens ( 53.25 ms per token, 18.78 tokens per second)
llama_print_timings: eval time = 182514.10 ms / 999 runs ( 182.70 ms per token, 5.47 tokens per second)
llama_print_timings: total time = 183593.03 ms / 1008 tokens
3.4 iGPU (Intel HD520, i5-6200U) vulkan
在 4 号 PC (物理机) 上运行. 版本:
> ./llama-cli-vulkan-b3617 --version
version: 1 (a07c32e)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
运行模型 llama2-7B.q4
, 生成长度 100
:
> ./llama-cli-vulkan-b3617 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -ngl 33 -n 100
Log start
main: build = 1 (a07c32e)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1724502840(此处省略一部分)llm_load_print_meta: max token length = 48
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: Intel(R) HD Graphics 520 (SKL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32
llm_load_tensors: ggml ctx size = 0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 70.31 MiB
llm_load_tensors: Intel(R) HD Graphics 520 (SKL GT2) buffer size = 3820.93 MiB
..................................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: Intel(R) HD Graphics 520 (SKL GT2) KV buffer size = 2048.00 MiB
llama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model: Vulkan_Host output buffer size = 0.12 MiB
llama_new_context_with_model: Intel(R) HD Graphics 520 (SKL GT2) compute buffer size = 296.00 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size = 16.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 2system_info: n_threads = 2 / 4 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 4096, n_batch = 2048, n_predict = 100, n_keep = 1hello, this is a very very long story but i will try and make it short(此处省略一部分)llama_print_timings: load time = 27305.92 ms
llama_print_timings: sample time = 20.64 ms / 100 runs ( 0.21 ms per token, 4844.49 tokens per second)
llama_print_timings: prompt eval time = 10725.27 ms / 10 tokens ( 1072.53 ms per token, 0.93 tokens per second)
llama_print_timings: eval time = 104246.69 ms / 99 runs ( 1053.00 ms per token, 0.95 tokens per second)
llama_print_timings: total time = 115065.04 ms / 109 tokens
Log end
运行模型 llama2-7B.q4
, 生成长度 200
:
> ./llama-cli-vulkan-b3617 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -ngl 33 -n 200(此处省略一部分)llama_print_timings: load time = 26358.11 ms
llama_print_timings: sample time = 43.34 ms / 200 runs ( 0.22 ms per token, 4615.21 tokens per second)
llama_print_timings: prompt eval time = 10579.07 ms / 10 tokens ( 1057.91 ms per token, 0.95 tokens per second)
llama_print_timings: eval time = 209900.70 ms / 199 runs ( 1054.78 ms per token, 0.95 tokens per second)
llama_print_timings: total time = 220666.27 ms / 209 tokens
运行模型 llama2-7B.q4
, 生成长度 500
:
> ./llama-cli-vulkan-b3617 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -ngl 33 -n 500(此处省略一部分)llama_print_timings: load time = 27769.47 ms
llama_print_timings: sample time = 100.38 ms / 500 runs ( 0.20 ms per token, 4981.17 tokens per second)
llama_print_timings: prompt eval time = 10573.54 ms / 10 tokens ( 1057.35 ms per token, 0.95 tokens per second)
llama_print_timings: eval time = 532338.80 ms / 499 runs ( 1066.81 ms per token, 0.94 tokens per second)
llama_print_timings: total time = 543350.42 ms / 509 tokens
运行模型 llama2-7B.q4
, 生成长度 1000
:
> ./llama-cli-vulkan-b3617 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -ngl 33 -n 1000(此处省略一部分)llama_print_timings: load time = 29646.65 ms
llama_print_timings: sample time = 179.74 ms / 1000 runs ( 0.18 ms per token, 5563.62 tokens per second)
llama_print_timings: prompt eval time = 10538.36 ms / 10 tokens ( 1053.84 ms per token, 0.95 tokens per second)
llama_print_timings: eval time = 1089916.74 ms / 999 runs ( 1091.01 ms per token, 0.92 tokens per second)
llama_print_timings: total time = 1101057.43 ms / 1009 tokens
运行模型 qwen2-7B.q8
. 错误, 无法运行, 提示内存不足:
> ./llama-cli-vulkan-b3617 -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -ngl 33 -n 100
Log start
main: build = 1 (a07c32e)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1724508115
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from qwen2-7b-instruct-q8_0.gguf (version GGUF V3 (latest))(此处省略一部分)llm_load_print_meta: max token length = 256
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: Intel(R) HD Graphics 520 (SKL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32
llm_load_tensors: ggml ctx size = 0.30 MiB
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors: CPU buffer size = 552.23 MiB
llm_load_tensors: Intel(R) HD Graphics 520 (SKL GT2) buffer size = 7165.44 MiB
........................................................................................
llama_new_context_with_model: n_ctx = 32768
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_vulkan: Device memory allocation of size 1879048192 failed.
ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache
llama_init_from_gpt_params: error: failed to create context with model 'qwen2-7b-instruct-q8_0.gguf'
main: error: unable to load model
3.5 iGPU (AMD Radeon Vega 7, r5-5600g) vulkan
在 6 号 PC (物理机) 上运行. 版本:
> ./llama-cli-vulkan-b3617 --version
version: 1 (a07c32e)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
运行模型 llama2-7B.q4
, 生成长度 100
:
> ./llama-cli-vulkan-b3617 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 100 -ngl 33
Log start
main: build = 1 (a07c32e)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1724488777(此处省略一部分)llm_load_print_meta: max token length = 48
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | warp size: 64
llm_load_tensors: ggml ctx size = 0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 70.31 MiB
llm_load_tensors: AMD Radeon Graphics (RADV RENOIR) buffer size = 3820.93 MiB
..................................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: AMD Radeon Graphics (RADV RENOIR) KV buffer size = 2048.00 MiB
llama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model: Vulkan_Host output buffer size = 0.12 MiB
llama_new_context_with_model: AMD Radeon Graphics (RADV RENOIR) compute buffer size = 296.00 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size = 16.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 2system_info: n_threads = 6 / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 4096, n_batch = 2048, n_predict = 100, n_keep = 1hello, this is a very very long story and it's only the first episode.(此处省略一部分)llama_print_timings: load time = 3300.29 ms
llama_print_timings: sample time = 4.41 ms / 100 runs ( 0.04 ms per token, 22686.03 tokens per second)
llama_print_timings: prompt eval time = 1028.22 ms / 10 tokens ( 102.82 ms per token, 9.73 tokens per second)
llama_print_timings: eval time = 23080.64 ms / 99 runs ( 233.14 ms per token, 4.29 tokens per second)
llama_print_timings: total time = 24122.46 ms / 109 tokens
Log end
运行模型 llama2-7B.q4
, 生成长度 200
:
> ./llama-cli-vulkan-b3617 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 200 -ngl 33(此处省略一部分)llama_print_timings: load time = 3410.94 ms
llama_print_timings: sample time = 8.64 ms / 200 runs ( 0.04 ms per token, 23153.51 tokens per second)
llama_print_timings: prompt eval time = 1027.37 ms / 10 tokens ( 102.74 ms per token, 9.73 tokens per second)
llama_print_timings: eval time = 46620.34 ms / 199 runs ( 234.27 ms per token, 4.27 tokens per second)
llama_print_timings: total time = 47674.32 ms / 209 tokens
运行模型 llama2-7B.q4
, 生成长度 500
:
> ./llama-cli-vulkan-b3617 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 500 -ngl 33(此处省略一部分)llama_print_timings: load time = 3389.70 ms
llama_print_timings: sample time = 21.42 ms / 500 runs ( 0.04 ms per token, 23339.40 tokens per second)
llama_print_timings: prompt eval time = 1026.09 ms / 10 tokens ( 102.61 ms per token, 9.75 tokens per second)
llama_print_timings: eval time = 118409.44 ms / 499 runs ( 237.29 ms per token, 4.21 tokens per second)
llama_print_timings: total time = 119502.95 ms / 509 tokens
运行模型 llama2-7B.q4
, 生成长度 1000
:
> ./llama-cli-vulkan-b3617 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 1000 -ngl 33(此处省略一部分)llama_print_timings: load time = 3362.42 ms
llama_print_timings: sample time = 43.25 ms / 1000 runs ( 0.04 ms per token, 23120.85 tokens per second)
llama_print_timings: prompt eval time = 1027.78 ms / 10 tokens ( 102.78 ms per token, 9.73 tokens per second)
llama_print_timings: eval time = 242531.02 ms / 999 runs ( 242.77 ms per token, 4.12 tokens per second)
llama_print_timings: total time = 243694.80 ms / 1009 tokens
运行模型 qwen2-7B.q8
, 生成长度 100
:
> ./llama-cli-vulkan-b3617 -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 100 -ngl 33
Log start
main: build = 1 (a07c32e)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1724490279
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from qwen2-7b-instruct-q8_0.gguf (version GGUF V3 (latest))(此处省略一部分)llm_load_print_meta: max token length = 256
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | warp size: 64
llm_load_tensors: ggml ctx size = 0.30 MiB
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors: CPU buffer size = 552.23 MiB
llm_load_tensors: AMD Radeon Graphics (RADV RENOIR) buffer size = 7165.44 MiB
........................................................................................
llama_new_context_with_model: n_ctx = 32768
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: AMD Radeon Graphics (RADV RENOIR) KV buffer size = 1792.00 MiB
llama_new_context_with_model: KV self size = 1792.00 MiB, K (f16): 896.00 MiB, V (f16): 896.00 MiB
llama_new_context_with_model: Vulkan_Host output buffer size = 0.58 MiB
llama_new_context_with_model: AMD Radeon Graphics (RADV RENOIR) compute buffer size = 1884.00 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size = 71.01 MiB
llama_new_context_with_model: graph nodes = 986
llama_new_context_with_model: graph splits = 2system_info: n_threads = 6 / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 32768, n_batch = 2048, n_predict = 100, n_keep = 0hello, this is a very very long story, but I'm going to do my best to explain in a concise manner:(此处省略一部分)llama_print_timings: load time = 8781.85 ms
llama_print_timings: sample time = 9.19 ms / 100 runs ( 0.09 ms per token, 10880.21 tokens per second)
llama_print_timings: prompt eval time = 913.76 ms / 9 tokens ( 101.53 ms per token, 9.85 tokens per second)
llama_print_timings: eval time = 34897.82 ms / 99 runs ( 352.50 ms per token, 2.84 tokens per second)
llama_print_timings: total time = 35889.02 ms / 108 tokens
Log end
运行模型 qwen2-7B.q8
, 生成长度 200
:
> ./llama-cli-vulkan-b3617 -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 200 -ngl 33(此处省略一部分)llama_print_timings: load time = 8249.67 ms
llama_print_timings: sample time = 17.88 ms / 200 runs ( 0.09 ms per token, 11185.68 tokens per second)
llama_print_timings: prompt eval time = 909.22 ms / 9 tokens ( 101.02 ms per token, 9.90 tokens per second)
llama_print_timings: eval time = 70426.63 ms / 199 runs ( 353.90 ms per token, 2.83 tokens per second)
llama_print_timings: total time = 71489.45 ms / 208 tokens
运行模型 qwen2-7B.q8
, 生成长度 500
:
> ./llama-cli-vulkan-b3617 -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 500 -ngl 33(此处省略一部分)llama_print_timings: load time = 6014.76 ms
llama_print_timings: sample time = 46.23 ms / 500 runs ( 0.09 ms per token, 10815.96 tokens per second)
llama_print_timings: prompt eval time = 916.14 ms / 9 tokens ( 101.79 ms per token, 9.82 tokens per second)
llama_print_timings: eval time = 177508.81 ms / 499 runs ( 355.73 ms per token, 2.81 tokens per second)
llama_print_timings: total time = 178809.12 ms / 508 tokens
运行模型 qwen2-7B.q8
, 生成长度 1000
:
> ./llama-cli-vulkan-b3617 -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 1000 -ngl 33(此处省略一部分)llama_print_timings: load time = 6662.38 ms
llama_print_timings: sample time = 89.55 ms / 1000 runs ( 0.09 ms per token, 11167.57 tokens per second)
llama_print_timings: prompt eval time = 916.79 ms / 9 tokens ( 101.87 ms per token, 9.82 tokens per second)
llama_print_timings: eval time = 358831.15 ms / 999 runs ( 359.19 ms per token, 2.78 tokens per second)
llama_print_timings: total time = 360504.90 ms / 1008 tokens
3.6 dGPU (A770) vulkan
在 6 号 (虚拟机) 上运行. 版本:
a2@a2s:~$ ./llama-cli-vulkan-b3617 --version
version: 1 (a07c32e)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
运行模型 llama2-7B.q4
, 生成长度 100
:
a2@a2s:~$ ./llama-cli-vulkan-b3617 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 100 -ngl 33
Log start
main: build = 1 (a07c32e)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1724492722(此处省略一部分)llm_load_print_meta: max token length = 48
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | warp size: 32
llm_load_tensors: ggml ctx size = 0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 70.31 MiB
llm_load_tensors: Intel(R) Arc(tm) A770 Graphics (DG2) buffer size = 3820.93 MiB
..................................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: Intel(R) Arc(tm) A770 Graphics (DG2) KV buffer size = 2048.00 MiB
llama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model: Vulkan_Host output buffer size = 0.12 MiB
llama_new_context_with_model: Intel(R) Arc(tm) A770 Graphics (DG2) compute buffer size = 296.00 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size = 16.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 2system_info: n_threads = 4 / 4 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 4096, n_batch = 2048, n_predict = 100, n_keep = 1hello, this is a very very long story, and you can ignore most of it, i'm just putting it here for reference in case anyone has a similar issue.(此处省略一部分)llama_print_timings: load time = 2274.09 ms
llama_print_timings: sample time = 4.14 ms / 100 runs ( 0.04 ms per token, 24148.76 tokens per second)
llama_print_timings: prompt eval time = 440.70 ms / 10 tokens ( 44.07 ms per token, 22.69 tokens per second)
llama_print_timings: eval time = 3809.51 ms / 99 runs ( 38.48 ms per token, 25.99 tokens per second)
llama_print_timings: total time = 4262.46 ms / 109 tokens
Log end
运行模型 llama2-7B.q4
, 生成长度 200
:
a2@a2s:~$ ./llama-cli-vulkan-b3617 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 200 -ngl 33(此处省略一部分)llama_print_timings: load time = 2308.60 ms
llama_print_timings: sample time = 8.50 ms / 200 runs ( 0.04 ms per token, 23518.34 tokens per second)
llama_print_timings: prompt eval time = 441.26 ms / 10 tokens ( 44.13 ms per token, 22.66 tokens per second)
llama_print_timings: eval time = 7704.86 ms / 199 runs ( 38.72 ms per token, 25.83 tokens per second)
llama_print_timings: total time = 8171.87 ms / 209 tokens
运行模型 llama2-7B.q4
, 生成长度 500
:
a2@a2s:~$ ./llama-cli-vulkan-b3617 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 500 -ngl 33(此处省略一部分)llama_print_timings: load time = 2296.68 ms
llama_print_timings: sample time = 21.31 ms / 500 runs ( 0.04 ms per token, 23460.96 tokens per second)
llama_print_timings: prompt eval time = 440.77 ms / 10 tokens ( 44.08 ms per token, 22.69 tokens per second)
llama_print_timings: eval time = 19597.74 ms / 499 runs ( 39.27 ms per token, 25.46 tokens per second)
llama_print_timings: total time = 20102.66 ms / 509 tokens
运行模型 llama2-7B.q4
, 生成长度 1000
:
a2@a2s:~$ ./llama-cli-vulkan-b3617 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 1000 -ngl 33(此处省略一部分)llama_print_timings: load time = 2273.46 ms
llama_print_timings: sample time = 42.10 ms / 1000 runs ( 0.04 ms per token, 23751.84 tokens per second)
llama_print_timings: prompt eval time = 441.47 ms / 10 tokens ( 44.15 ms per token, 22.65 tokens per second)
llama_print_timings: eval time = 40262.07 ms / 999 runs ( 40.30 ms per token, 24.81 tokens per second)
llama_print_timings: total time = 40827.46 ms / 1009 tokens
运行模型 qwen2-7B.q8
, 生成长度 100
:
a2@a2s:~$ ./llama-cli-vulkan-b3617 -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 100 -ngl 33
Log start
main: build = 1 (a07c32e)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1724493121
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from qwen2-7b-instruct-q8_0.gguf (version GGUF V3 (latest))(此处省略一部分)llm_load_print_meta: max token length = 256
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | warp size: 32
llm_load_tensors: ggml ctx size = 0.30 MiB
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors: CPU buffer size = 552.23 MiB
llm_load_tensors: Intel(R) Arc(tm) A770 Graphics (DG2) buffer size = 7165.44 MiB
........................................................................................
llama_new_context_with_model: n_ctx = 32768
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: Intel(R) Arc(tm) A770 Graphics (DG2) KV buffer size = 1792.00 MiB
llama_new_context_with_model: KV self size = 1792.00 MiB, K (f16): 896.00 MiB, V (f16): 896.00 MiB
llama_new_context_with_model: Vulkan_Host output buffer size = 0.58 MiB
llama_new_context_with_model: Intel(R) Arc(tm) A770 Graphics (DG2) compute buffer size = 1884.00 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size = 71.01 MiB
llama_new_context_with_model: graph nodes = 986
llama_new_context_with_model: graph splits = 2system_info: n_threads = 4 / 4 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 32768, n_batch = 2048, n_predict = 100, n_keep = 0hello, this is a very very long story with multiple characters, but i will try to write it in a way that makes it easy to follow. (此处省略一部分)llama_print_timings: load time = 8202.05 ms
llama_print_timings: sample time = 10.16 ms / 100 runs ( 0.10 ms per token, 9839.61 tokens per second)
llama_print_timings: prompt eval time = 587.73 ms / 9 tokens ( 65.30 ms per token, 15.31 tokens per second)
llama_print_timings: eval time = 4755.44 ms / 99 runs ( 48.03 ms per token, 20.82 tokens per second)
llama_print_timings: total time = 5460.46 ms / 108 tokens
Log end
运行模型 qwen2-7B.q8
, 生成长度 200
:
a2@a2s:~$ ./llama-cli-vulkan-b3617 -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 200 -ngl 33(此处省略一部分)llama_print_timings: load time = 6642.05 ms
llama_print_timings: sample time = 19.91 ms / 200 runs ( 0.10 ms per token, 10043.19 tokens per second)
llama_print_timings: prompt eval time = 587.07 ms / 9 tokens ( 65.23 ms per token, 15.33 tokens per second)
llama_print_timings: eval time = 9581.81 ms / 199 runs ( 48.15 ms per token, 20.77 tokens per second)
llama_print_timings: total time = 10348.91 ms / 208 tokens
运行模型 qwen2-7B.q8
, 生成长度 500
:
a2@a2s:~$ ./llama-cli-vulkan-b3617 -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 500 -ngl 33(此处省略一部分)llama_print_timings: load time = 6756.91 ms
llama_print_timings: sample time = 51.43 ms / 500 runs ( 0.10 ms per token, 9722.33 tokens per second)
llama_print_timings: prompt eval time = 588.10 ms / 9 tokens ( 65.34 ms per token, 15.30 tokens per second)
llama_print_timings: eval time = 24196.44 ms / 499 runs ( 48.49 ms per token, 20.62 tokens per second)
llama_print_timings: total time = 25212.38 ms / 508 tokens
运行模型 qwen2-7B.q8
, 生成长度 1000
:
a2@a2s:~$ ./llama-cli-vulkan-b3617 -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 1000 -ngl 33(此处省略一部分)llama_print_timings: load time = 6664.69 ms
llama_print_timings: sample time = 92.37 ms / 1000 runs ( 0.09 ms per token, 10825.91 tokens per second)
llama_print_timings: prompt eval time = 586.92 ms / 9 tokens ( 65.21 ms per token, 15.33 tokens per second)
llama_print_timings: eval time = 48610.18 ms / 999 runs ( 48.66 ms per token, 20.55 tokens per second)
llama_print_timings: total time = 49939.72 ms / 1008 tokens
3.7 dGPU (A770) SYCL
在 6 号 (虚拟机) 上运行. 准备工作:
a2@a2s:~$ source /opt/intel/oneapi/setvars.sh:: initializing oneAPI environment ...-bash: BASH_VERSION = 5.1.16(1)-releaseargs: Using "$@" for setvars.sh arguments:
:: ccl -- latest
:: compiler -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: mkl -- latest
:: mpi -- latest
:: tbb -- latest
:: oneAPI environment initialized ::a2@a2s:~$ export ZES_ENABLE_SYSMAN=1
a2@a2s:~$ export USE_XETLA=OFF
a2@a2s:~$ export SYCL_CACHE_PERSISTENT=1
a2@a2s:~$ export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
a2@a2s:~$ sycl-ls
[opencl:cpu][opencl:0] Intel(R) OpenCL, AMD Ryzen 5 5600G with Radeon Graphics OpenCL 3.0 (Build 0) [2024.18.7.0.11_160000]
[opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO [24.22.29735.27]
[level_zero:gpu][level_zero:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.29735]
版本:
a2@a2s:~$ ./llama-cli-sycl-b3617-f32 --version
version: 1 (a07c32e)
built with Intel(R) oneAPI DPC++/C++ Compiler 2024.2.1 (2024.2.1.20240711) for x86_64-unknown-linux-gnu
a2@a2s:~$ ./llama-cli-sycl-b3617-f16 --version
version: 1 (a07c32e)
built with Intel(R) oneAPI DPC++/C++ Compiler 2024.2.1 (2024.2.1.20240711) for x86_64-unknown-linux-gnu
运行模型 llama2-7B.q4
, 生成长度 100
(f32):
a2@a2s:~$ ./llama-cli-sycl-b3617-f32 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -ngl 33 -sm none -n 100
Log start
main: build = 1 (a07c32e)
main: built with Intel(R) oneAPI DPC++/C++ Compiler 2024.2.1 (2024.2.1.20240711) for x86_64-unknown-linux-gnu
main: seed = 1724493845(此处省略一部分)llm_load_print_meta: max token length = 48
ggml_sycl_init: GGML_SYCL_FORCE_MMQ: no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
llm_load_tensors: ggml ctx size = 0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: SYCL0 buffer size = 3820.94 MiB
llm_load_tensors: CPU buffer size = 70.31 MiB
..................................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
| | | | |Max | |Max |Global | |
| | | | |compute|Max work|sub |mem | |
|ID| Device Type| Name|Version|units |group |group|size | Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]| Intel Arc A770 Graphics| 1.3| 512| 1024| 32| 16225M| 1.3.29735|
llama_kv_cache_init: SYCL0 KV buffer size = 2048.00 MiB
llama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model: SYCL_Host output buffer size = 0.12 MiB
llama_new_context_with_model: SYCL0 compute buffer size = 296.00 MiB
llama_new_context_with_model: SYCL_Host compute buffer size = 16.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 2system_info: n_threads = 4 / 4 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 4096, n_batch = 2048, n_predict = 100, n_keep = 1hello, this is a very very long story, but i promise it's worth it.(此处省略一部分)llama_print_timings: load time = 2066.71 ms
llama_print_timings: sample time = 2.90 ms / 100 runs ( 0.03 ms per token, 34542.31 tokens per second)
llama_print_timings: prompt eval time = 180.84 ms / 10 tokens ( 18.08 ms per token, 55.30 tokens per second)
llama_print_timings: eval time = 2852.87 ms / 99 runs ( 28.82 ms per token, 34.70 tokens per second)
llama_print_timings: total time = 3044.63 ms / 109 tokens
Log end
运行模型 llama2-7B.q4
, 生成长度 200
(f32):
a2@a2s:~$ ./llama-cli-sycl-b3617-f32 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -ngl 33 -sm none -n 200(此处省略一部分)llama_print_timings: load time = 2040.98 ms
llama_print_timings: sample time = 5.98 ms / 200 runs ( 0.03 ms per token, 33450.41 tokens per second)
llama_print_timings: prompt eval time = 179.29 ms / 10 tokens ( 17.93 ms per token, 55.78 tokens per second)
llama_print_timings: eval time = 5765.54 ms / 199 runs ( 28.97 ms per token, 34.52 tokens per second)
llama_print_timings: total time = 5968.10 ms / 209 tokens
运行模型 llama2-7B.q4
, 生成长度 500
(f32):
a2@a2s:~$ ./llama-cli-sycl-b3617-f32 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -ngl 33 -sm none -n 500(此处省略一部分)llama_print_timings: load time = 1994.74 ms
llama_print_timings: sample time = 15.04 ms / 500 runs ( 0.03 ms per token, 33246.89 tokens per second)
llama_print_timings: prompt eval time = 177.09 ms / 10 tokens ( 17.71 ms per token, 56.47 tokens per second)
llama_print_timings: eval time = 14675.46 ms / 499 runs ( 29.41 ms per token, 34.00 tokens per second)
llama_print_timings: total time = 14911.41 ms / 509 tokens
运行模型 llama2-7B.q4
, 生成长度 1000
(f32):
a2@a2s:~$ ./llama-cli-sycl-b3617-f32 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -ngl 33 -sm none -n 1000(此处省略一部分)llama_print_timings: load time = 2071.28 ms
llama_print_timings: sample time = 28.05 ms / 1000 runs ( 0.03 ms per token, 35646.81 tokens per second)
llama_print_timings: prompt eval time = 178.45 ms / 10 tokens ( 17.85 ms per token, 56.04 tokens per second)
llama_print_timings: eval time = 30044.60 ms / 999 runs ( 30.07 ms per token, 33.25 tokens per second)
llama_print_timings: total time = 30329.49 ms / 1009 tokens
运行模型 qwen2-7B.q8
, 生成长度 100
(f32):
a2@a2s:~$ ./llama-cli-sycl-b3617-f32 -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -ngl 33 -sm none -n 100
Log start
main: build = 1 (a07c32e)
main: built with Intel(R) oneAPI DPC++/C++ Compiler 2024.2.1 (2024.2.1.20240711) for x86_64-unknown-linux-gnu
main: seed = 1724494148
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from qwen2-7b-instruct-q8_0.gguf (version GGUF V3 (latest))(此处省略一部分)llm_load_print_meta: max token length = 256
ggml_sycl_init: GGML_SYCL_FORCE_MMQ: no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
llm_load_tensors: ggml ctx size = 0.30 MiB
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors: SYCL0 buffer size = 7165.44 MiB
llm_load_tensors: CPU buffer size = 552.23 MiB
.......................................................................................
llama_new_context_with_model: n_ctx = 32768
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
| | | | |Max | |Max |Global | |
| | | | |compute|Max work|sub |mem | |
|ID| Device Type| Name|Version|units |group |group|size | Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]| Intel Arc A770 Graphics| 1.3| 512| 1024| 32| 16225M| 1.3.29735|
llama_kv_cache_init: SYCL0 KV buffer size = 1792.00 MiB
llama_new_context_with_model: KV self size = 1792.00 MiB, K (f16): 896.00 MiB, V (f16): 896.00 MiB
llama_new_context_with_model: SYCL_Host output buffer size = 0.58 MiB
llama_new_context_with_model: SYCL0 compute buffer size = 1884.00 MiB
llama_new_context_with_model: SYCL_Host compute buffer size = 71.01 MiB
llama_new_context_with_model: graph nodes = 986
llama_new_context_with_model: graph splits = 2system_info: n_threads = 4 / 4 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 32768, n_batch = 2048, n_predict = 100, n_keep = 0hello, this is a very very long story and I would like to ask for help.(此处省略一部分)llama_print_timings: load time = 9055.51 ms
llama_print_timings: sample time = 8.22 ms / 100 runs ( 0.08 ms per token, 12158.05 tokens per second)
llama_print_timings: prompt eval time = 395.27 ms / 9 tokens ( 43.92 ms per token, 22.77 tokens per second)
llama_print_timings: eval time = 5195.18 ms / 99 runs ( 52.48 ms per token, 19.06 tokens per second)
llama_print_timings: total time = 5679.84 ms / 108 tokens
Log end
运行模型 qwen2-7B.q8
, 生成长度 200
(f32):
a2@a2s:~$ ./llama-cli-sycl-b3617-f32 -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -ngl 33 -sm none -n 200(此处省略一部分)llama_print_timings: load time = 8413.38 ms
llama_print_timings: sample time = 16.47 ms / 200 runs ( 0.08 ms per token, 12141.08 tokens per second)
llama_print_timings: prompt eval time = 405.85 ms / 9 tokens ( 45.09 ms per token, 22.18 tokens per second)
llama_print_timings: eval time = 10455.78 ms / 199 runs ( 52.54 ms per token, 19.03 tokens per second)
llama_print_timings: total time = 11017.44 ms / 208 tokens
运行模型 qwen2-7B.q8
, 生成长度 500
(f32):
a2@a2s:~$ ./llama-cli-sycl-b3617-f32 -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -ngl 33 -sm none -n 500(此处省略一部分)llama_print_timings: load time = 9179.45 ms
llama_print_timings: sample time = 47.42 ms / 500 runs ( 0.09 ms per token, 10544.74 tokens per second)
llama_print_timings: prompt eval time = 402.42 ms / 9 tokens ( 44.71 ms per token, 22.36 tokens per second)
llama_print_timings: eval time = 26367.77 ms / 499 runs ( 52.84 ms per token, 18.92 tokens per second)
llama_print_timings: total time = 27130.93 ms / 508 tokens
运行模型 qwen2-7B.q8
, 生成长度 1000
(f32):
a2@a2s:~$ ./llama-cli-sycl-b3617-f32 -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -ngl 33 -sm none -n 1000(此处省略一部分)llama_print_timings: load time = 9531.60 ms
llama_print_timings: sample time = 96.63 ms / 1000 runs ( 0.10 ms per token, 10348.86 tokens per second)
llama_print_timings: prompt eval time = 401.50 ms / 9 tokens ( 44.61 ms per token, 22.42 tokens per second)
llama_print_timings: eval time = 53212.71 ms / 999 runs ( 53.27 ms per token, 18.77 tokens per second)
llama_print_timings: total time = 54321.34 ms / 1008 tokens
运行模型 llama2-7B.q4
, 生成长度 100
(f16):
a2@a2s:~$ ./llama-cli-sycl-b3617-f16 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -ngl 33 -sm none -n 100
Log start
main: build = 1 (a07c32e)
main: built with Intel(R) oneAPI DPC++/C++ Compiler 2024.2.1 (2024.2.1.20240711) for x86_64-unknown-linux-gnu
main: seed = 1724494475(此处省略一部分)llm_load_print_meta: max token length = 48
ggml_sycl_init: GGML_SYCL_FORCE_MMQ: no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
llm_load_tensors: ggml ctx size = 0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: SYCL0 buffer size = 3820.94 MiB
llm_load_tensors: CPU buffer size = 70.31 MiB
..................................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: yes
found 1 SYCL devices:
| | | | |Max | |Max |Global | |
| | | | |compute|Max work|sub |mem | |
|ID| Device Type| Name|Version|units |group |group|size | Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]| Intel Arc A770 Graphics| 1.3| 512| 1024| 32| 16225M| 1.3.29735|
llama_kv_cache_init: SYCL0 KV buffer size = 2048.00 MiB
llama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model: SYCL_Host output buffer size = 0.12 MiB
llama_new_context_with_model: SYCL0 compute buffer size = 296.00 MiB
llama_new_context_with_model: SYCL_Host compute buffer size = 16.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 2system_info: n_threads = 4 / 4 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 4096, n_batch = 2048, n_predict = 100, n_keep = 1hello, this is a very very long story, and I hope you will read it, because I need to tell you something.(此处省略一部分)llama_print_timings: load time = 1866.40 ms
llama_print_timings: sample time = 3.23 ms / 100 runs ( 0.03 ms per token, 30998.14 tokens per second)
llama_print_timings: prompt eval time = 187.70 ms / 10 tokens ( 18.77 ms per token, 53.28 tokens per second)
llama_print_timings: eval time = 2873.84 ms / 99 runs ( 29.03 ms per token, 34.45 tokens per second)
llama_print_timings: total time = 3074.08 ms / 109 tokens
Log end
运行模型 llama2-7B.q4
, 生成长度 200
(f16):
a2@a2s:~$ ./llama-cli-sycl-b3617-f16 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -ngl 33 -sm none -n 200(此处省略一部分)llama_print_timings: load time = 1867.46 ms
llama_print_timings: sample time = 5.99 ms / 200 runs ( 0.03 ms per token, 33411.29 tokens per second)
llama_print_timings: prompt eval time = 194.39 ms / 10 tokens ( 19.44 ms per token, 51.44 tokens per second)
llama_print_timings: eval time = 5783.95 ms / 199 runs ( 29.07 ms per token, 34.41 tokens per second)
llama_print_timings: total time = 6003.07 ms / 209 tokens
运行模型 llama2-7B.q4
, 生成长度 500
(f16):
a2@a2s:~$ ./llama-cli-sycl-b3617-f16 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -ngl 33 -sm none -n 500(此处省略一部分)llama_print_timings: load time = 1909.92 ms
llama_print_timings: sample time = 15.56 ms / 500 runs ( 0.03 ms per token, 32123.35 tokens per second)
llama_print_timings: prompt eval time = 186.10 ms / 10 tokens ( 18.61 ms per token, 53.73 tokens per second)
llama_print_timings: eval time = 14680.81 ms / 499 runs ( 29.42 ms per token, 33.99 tokens per second)
llama_print_timings: total time = 14925.64 ms / 509 tokens
运行模型 llama2-7B.q4
, 生成长度 1000
(f16):
a2@a2s:~$ ./llama-cli-sycl-b3617-f16 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -ngl 33 -sm none -n 1000(此处省略一部分)llama_print_timings: load time = 2017.43 ms
llama_print_timings: sample time = 13.53 ms / 461 runs ( 0.03 ms per token, 34067.40 tokens per second)
llama_print_timings: prompt eval time = 189.74 ms / 10 tokens ( 18.97 ms per token, 52.70 tokens per second)
llama_print_timings: eval time = 13480.19 ms / 460 runs ( 29.30 ms per token, 34.12 tokens per second)
llama_print_timings: total time = 13722.36 ms / 470 tokens
运行模型 qwen2-7B.q8
, 生成长度 100
(f16):
a2@a2s:~$ ./llama-cli-sycl-b3617-f16 -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -ngl 33 -sm none -n 100
Log start
main: build = 1 (a07c32e)
main: built with Intel(R) oneAPI DPC++/C++ Compiler 2024.2.1 (2024.2.1.20240711) for x86_64-unknown-linux-gnu
main: seed = 1724494717
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from qwen2-7b-instruct-q8_0.gguf (version GGUF V3 (latest))(此处省略一部分)llm_load_print_meta: max token length = 256
ggml_sycl_init: GGML_SYCL_FORCE_MMQ: no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
llm_load_tensors: ggml ctx size = 0.30 MiB
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors: SYCL0 buffer size = 7165.44 MiB
llm_load_tensors: CPU buffer size = 552.23 MiB
.......................................................................................
llama_new_context_with_model: n_ctx = 32768
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: yes
found 1 SYCL devices:
| | | | |Max | |Max |Global | |
| | | | |compute|Max work|sub |mem | |
|ID| Device Type| Name|Version|units |group |group|size | Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]| Intel Arc A770 Graphics| 1.3| 512| 1024| 32| 16225M| 1.3.29735|
llama_kv_cache_init: SYCL0 KV buffer size = 1792.00 MiB
llama_new_context_with_model: KV self size = 1792.00 MiB, K (f16): 896.00 MiB, V (f16): 896.00 MiB
llama_new_context_with_model: SYCL_Host output buffer size = 0.58 MiB
llama_new_context_with_model: SYCL0 compute buffer size = 1884.00 MiB
llama_new_context_with_model: SYCL_Host compute buffer size = 71.01 MiB
llama_new_context_with_model: graph nodes = 986
llama_new_context_with_model: graph splits = 2system_info: n_threads = 4 / 4 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 32768, n_batch = 2048, n_predict = 100, n_keep = 0hello, this is a very very long story but I will try to make it as short as possible.(此处省略一部分)llama_print_timings: load time = 8893.71 ms
llama_print_timings: sample time = 9.80 ms / 100 runs ( 0.10 ms per token, 10204.08 tokens per second)
llama_print_timings: prompt eval time = 295.81 ms / 9 tokens ( 32.87 ms per token, 30.42 tokens per second)
llama_print_timings: eval time = 5931.59 ms / 99 runs ( 59.91 ms per token, 16.69 tokens per second)
llama_print_timings: total time = 6305.29 ms / 108 tokens
Log end
运行模型 qwen2-7B.q8
, 生成长度 200
(f16):
a2@a2s:~$ ./llama-cli-sycl-b3617-f16 -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -ngl 33 -sm none -n 200(此处省略一部分)llama_print_timings: load time = 8474.98 ms
llama_print_timings: sample time = 18.22 ms / 200 runs ( 0.09 ms per token, 10978.15 tokens per second)
llama_print_timings: prompt eval time = 298.85 ms / 9 tokens ( 33.21 ms per token, 30.12 tokens per second)
llama_print_timings: eval time = 11935.47 ms / 199 runs ( 59.98 ms per token, 16.67 tokens per second)
llama_print_timings: total time = 12379.13 ms / 208 tokens
运行模型 qwen2-7B.q8
, 生成长度 500
(f16):
a2@a2s:~$ ./llama-cli-sycl-b3617-f16 -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -ngl 33 -sm none -n 500(此处省略一部分)llama_print_timings: load time = 8836.66 ms
llama_print_timings: sample time = 41.76 ms / 500 runs ( 0.08 ms per token, 11972.32 tokens per second)
llama_print_timings: prompt eval time = 304.28 ms / 9 tokens ( 33.81 ms per token, 29.58 tokens per second)
llama_print_timings: eval time = 30052.85 ms / 499 runs ( 60.23 ms per token, 16.60 tokens per second)
llama_print_timings: total time = 30722.98 ms / 508 tokens
运行模型 qwen2-7B.q8
, 生成长度 1000
(f16):
a2@a2s:~$ ./llama-cli-sycl-b3617-f16 -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -ngl 33 -sm none -n 1000(此处省略一部分)llama_print_timings: load time = 8206.19 ms
llama_print_timings: sample time = 96.05 ms / 1000 runs ( 0.10 ms per token, 10411.24 tokens per second)
llama_print_timings: prompt eval time = 312.47 ms / 9 tokens ( 34.72 ms per token, 28.80 tokens per second)
llama_print_timings: eval time = 60716.89 ms / 999 runs ( 60.78 ms per token, 16.45 tokens per second)
llama_print_timings: total time = 61768.29 ms / 1008 tokens
3.8 Windows (CPU) r5-5600g AVX2
在 6 号 PC (物理机) 上运行. 版本:
>.\llama-b3617-bin-win-avx2-x64\llama-cli.exe --version
version: 3617 (a07c32ea)
built with MSVC 19.29.30154.0 for x64
运行模型 llama2-7B.q4
, 生成长度 100
:
p>.\llama-b3617-bin-win-avx2-x64\llama-cli.exe -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 100
Log start
main: build = 3617 (a07c32ea)
main: built with MSVC 19.29.30154.0 for x64
main: seed = 1724480697llama_print_timings: load time = 1005.41 ms
llama_print_timings: sample time = 4.11 ms / 100 runs ( 0.04 ms per token, 24354.60 tokens per second)
llama_print_timings: prompt eval time = 399.08 ms / 10 tokens ( 39.91 ms per token, 25.06 tokens per second)
llama_print_timings: eval time = 9688.39 ms / 99 runs ( 97.86 ms per token, 10.22 tokens per second)
llama_print_timings: total time = 10110.42 ms / 109 tokens
运行模型 llama2-7B.q4
, 生成长度 200
:
>.\llama-b3617-bin-win-avx2-x64\llama-cli.exe -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 200llama_print_timings: load time = 1045.93 ms
llama_print_timings: sample time = 8.82 ms / 200 runs ( 0.04 ms per token, 22673.17 tokens per second)
llama_print_timings: prompt eval time = 436.84 ms / 10 tokens ( 43.68 ms per token, 22.89 tokens per second)
llama_print_timings: eval time = 19960.35 ms / 199 runs ( 100.30 ms per token, 9.97 tokens per second)
llama_print_timings: total time = 20439.79 ms / 209 tokens
运行模型 llama2-7B.q4
, 生成长度 500
:
>.\llama-b3617-bin-win-avx2-x64\llama-cli.exe -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 500llama_print_timings: load time = 1028.02 ms
llama_print_timings: sample time = 18.32 ms / 500 runs ( 0.04 ms per token, 27300.03 tokens per second)
llama_print_timings: prompt eval time = 382.15 ms / 10 tokens ( 38.22 ms per token, 26.17 tokens per second)
llama_print_timings: eval time = 51622.99 ms / 499 runs ( 103.45 ms per token, 9.67 tokens per second)
llama_print_timings: total time = 52107.10 ms / 509 tokens
运行模型 llama2-7B.q4
, 生成长度 1000
:
>.\llama-b3617-bin-win-avx2-x64\llama-cli.exe -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 1000llama_print_timings: load time = 1241.78 ms
llama_print_timings: sample time = 41.52 ms / 1000 runs ( 0.04 ms per token, 24084.78 tokens per second)
llama_print_timings: prompt eval time = 484.10 ms / 10 tokens ( 48.41 ms per token, 20.66 tokens per second)
llama_print_timings: eval time = 114393.05 ms / 999 runs ( 114.51 ms per token, 8.73 tokens per second)
llama_print_timings: total time = 115084.29 ms / 1009 tokens
运行模型 qwen2-7B.q8
, 生成长度 100
:
>.\llama-b3617-bin-win-avx2-x64\llama-cli.exe -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 100llama_print_timings: load time = 1429.29 ms
llama_print_timings: sample time = 15.21 ms / 100 runs ( 0.15 ms per token, 6572.89 tokens per second)
llama_print_timings: prompt eval time = 523.07 ms / 9 tokens ( 58.12 ms per token, 17.21 tokens per second)
llama_print_timings: eval time = 17786.69 ms / 99 runs ( 179.66 ms per token, 5.57 tokens per second)
llama_print_timings: total time = 18409.82 ms / 108 tokens
运行模型 qwen2-7B.q8
, 生成长度 200
:
>.\llama-b3617-bin-win-avx2-x64\llama-cli.exe -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 200llama_print_timings: load time = 1424.62 ms
llama_print_timings: sample time = 31.78 ms / 200 runs ( 0.16 ms per token, 6292.47 tokens per second)
llama_print_timings: prompt eval time = 564.79 ms / 9 tokens ( 62.75 ms per token, 15.93 tokens per second)
llama_print_timings: eval time = 36148.33 ms / 199 runs ( 181.65 ms per token, 5.51 tokens per second)
llama_print_timings: total time = 36919.37 ms / 208 tokens
运行模型 qwen2-7B.q8
, 生成长度 500
:
>.\llama-b3617-bin-win-avx2-x64\llama-cli.exe -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 500llama_print_timings: load time = 1462.26 ms
llama_print_timings: sample time = 80.31 ms / 500 runs ( 0.16 ms per token, 6225.64 tokens per second)
llama_print_timings: prompt eval time = 720.86 ms / 9 tokens ( 80.10 ms per token, 12.49 tokens per second)
llama_print_timings: eval time = 90566.92 ms / 499 runs ( 181.50 ms per token, 5.51 tokens per second)
llama_print_timings: total time = 91801.55 ms / 508 tokens
运行模型 qwen2-7B.q8
, 生成长度 1000
:
>.\llama-b3617-bin-win-avx2-x64\llama-cli.exe -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 1000llama_print_timings: load time = 1439.21 ms
llama_print_timings: sample time = 165.06 ms / 1000 runs ( 0.17 ms per token, 6058.48 tokens per second)
llama_print_timings: prompt eval time = 555.15 ms / 9 tokens ( 61.68 ms per token, 16.21 tokens per second)
llama_print_timings: eval time = 184706.64 ms / 999 runs ( 184.89 ms per token, 5.41 tokens per second)
llama_print_timings: total time = 186313.82 ms / 1008 tokens
3.9 Windows (GPU) A770 vulkan
在 6 号 PC (物理机) 上运行. 版本:
>.\llama-b3617-bin-win-vulkan-x64\llama-cli.exe --version
version: 3617 (a07c32ea)
built with MSVC 19.29.30154.0 for x64
运行模型 llama2-7B.q4
, 生成长度 100
:
>.\llama-b3617-bin-win-vulkan-x64\llama-cli.exe -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 100 -ngl 33
Log start
main: build = 3617 (a07c32ea)
main: built with MSVC 19.29.30154.0 for x64
main: seed = 1724482103llama_print_timings: load time = 3375.14 ms
llama_print_timings: sample time = 4.04 ms / 100 runs ( 0.04 ms per token, 24764.74 tokens per second)
llama_print_timings: prompt eval time = 471.87 ms / 10 tokens ( 47.19 ms per token, 21.19 tokens per second)
llama_print_timings: eval time = 5913.11 ms / 99 runs ( 59.73 ms per token, 16.74 tokens per second)
llama_print_timings: total time = 6408.49 ms / 109 tokens
运行模型 llama2-7B.q4
, 生成长度 200
:
>.\llama-b3617-bin-win-vulkan-x64\llama-cli.exe -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 200 -ngl 33llama_print_timings: load time = 2932.55 ms
llama_print_timings: sample time = 8.03 ms / 200 runs ( 0.04 ms per token, 24915.91 tokens per second)
llama_print_timings: prompt eval time = 471.34 ms / 10 tokens ( 47.13 ms per token, 21.22 tokens per second)
llama_print_timings: eval time = 11931.98 ms / 199 runs ( 59.96 ms per token, 16.68 tokens per second)
llama_print_timings: total time = 12452.04 ms / 209 tokens
运行模型 llama2-7B.q4
, 生成长度 500
:
>.\llama-b3617-bin-win-vulkan-x64\llama-cli.exe -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 500 -ngl 33llama_print_timings: load time = 2913.84 ms
llama_print_timings: sample time = 19.84 ms / 500 runs ( 0.04 ms per token, 25204.15 tokens per second)
llama_print_timings: prompt eval time = 471.64 ms / 10 tokens ( 47.16 ms per token, 21.20 tokens per second)
llama_print_timings: eval time = 30253.41 ms / 499 runs ( 60.63 ms per token, 16.49 tokens per second)
llama_print_timings: total time = 30844.12 ms / 509 tokens
运行模型 llama2-7B.q4
, 生成长度 1000
:
>.\llama-b3617-bin-win-vulkan-x64\llama-cli.exe -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 1000 -ngl 33llama_print_timings: load time = 2909.30 ms
llama_print_timings: sample time = 40.91 ms / 1000 runs ( 0.04 ms per token, 24443.90 tokens per second)
llama_print_timings: prompt eval time = 471.58 ms / 10 tokens ( 47.16 ms per token, 21.21 tokens per second)
llama_print_timings: eval time = 61725.41 ms / 999 runs ( 61.79 ms per token, 16.18 tokens per second)
llama_print_timings: total time = 62433.39 ms / 1009 tokens
运行模型 qwen2-7B.q8
, 生成长度 100
:
>.\llama-b3617-bin-win-vulkan-x64\llama-cli.exe -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 100 -ngl 33llama_print_timings: load time = 4785.92 ms
llama_print_timings: sample time = 9.08 ms / 100 runs ( 0.09 ms per token, 11016.86 tokens per second)
llama_print_timings: prompt eval time = 609.77 ms / 9 tokens ( 67.75 ms per token, 14.76 tokens per second)
llama_print_timings: eval time = 6401.98 ms / 99 runs ( 64.67 ms per token, 15.46 tokens per second)
llama_print_timings: total time = 7100.18 ms / 108 tokens
运行模型 qwen2-7B.q8
, 生成长度 200
:
>.\llama-b3617-bin-win-vulkan-x64\llama-cli.exe -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 200 -ngl 33llama_print_timings: load time = 4783.54 ms
llama_print_timings: sample time = 18.63 ms / 200 runs ( 0.09 ms per token, 10735.37 tokens per second)
llama_print_timings: prompt eval time = 610.60 ms / 9 tokens ( 67.84 ms per token, 14.74 tokens per second)
llama_print_timings: eval time = 12910.01 ms / 199 runs ( 64.87 ms per token, 15.41 tokens per second)
llama_print_timings: total time = 13698.94 ms / 208 tokens
运行模型 qwen2-7B.q8
, 生成长度 500
:
>.\llama-b3617-bin-win-vulkan-x64\llama-cli.exe -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 500 -ngl 33llama_print_timings: load time = 4798.07 ms
llama_print_timings: sample time = 46.32 ms / 500 runs ( 0.09 ms per token, 10794.47 tokens per second)
llama_print_timings: prompt eval time = 610.28 ms / 9 tokens ( 67.81 ms per token, 14.75 tokens per second)
llama_print_timings: eval time = 32517.07 ms / 499 runs ( 65.16 ms per token, 15.35 tokens per second)
llama_print_timings: total time = 33565.60 ms / 508 tokens
运行模型 qwen2-7B.q8
, 生成长度 1000
:
>.\llama-b3617-bin-win-vulkan-x64\llama-cli.exe -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 1000 -ngl 33llama_print_timings: load time = 4802.01 ms
llama_print_timings: sample time = 93.21 ms / 989 runs ( 0.09 ms per token, 10610.22 tokens per second)
llama_print_timings: prompt eval time = 610.76 ms / 9 tokens ( 67.86 ms per token, 14.74 tokens per second)
llama_print_timings: eval time = 64868.89 ms / 988 runs ( 65.66 ms per token, 15.23 tokens per second)
llama_print_timings: total time = 66351.20 ms / 997 tokens
(未完待续)
这篇关于(章节 3.1) 本地运行 AI 有多慢 ? 大模型推理测速 (llama.cpp, Intel GPU A770)的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!