(章节 3.1) 本地运行 AI 有多慢 ? 大模型推理测速 (llama.cpp, Intel GPU A770)

本文主要是介绍(章节 3.1) 本地运行 AI 有多慢 ? 大模型推理测速 (llama.cpp, Intel GPU A770),希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

由于本文太长, 分开发布, 方便阅读.


3.1 CPU (i5-6200U, 2C/4T/2.8GHz) x86_64 AVX2

在 4 号 PC (物理机) 上运行. 版本:

> ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli --version
version: 3617 (a07c32ea)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

运行模型 llama2-7B.q4, 生成长度 100:

> ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 100
Log start
main: build = 3617 (a07c32ea)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1724500181
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from llama-2-7b.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 15
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.80 GiB (4.84 BPW) 
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  3891.24 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  2048.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:        CPU compute buffer size =   296.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1system_info: n_threads = 2 / 4 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 100, n_keep = 1hello, this is a very very long story. nobody wants to read this much. just tell me what happened.(此处省略一部分)llama_print_timings:        load time =    2666.87 ms
llama_print_timings:      sample time =       5.38 ms /   100 runs   (    0.05 ms per token, 18580.45 tokens per second)
llama_print_timings: prompt eval time =    1898.40 ms /    10 tokens (  189.84 ms per token,     5.27 tokens per second)
llama_print_timings:        eval time =   28113.06 ms /    99 runs   (  283.97 ms per token,     3.52 tokens per second)
llama_print_timings:       total time =   30034.85 ms /   109 tokens
Log end

运行模型 llama2-7B.q4, 生成长度 200:

> ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 200(此处省略一部分)llama_print_timings:        load time =    2703.62 ms
llama_print_timings:      sample time =      12.85 ms /   200 runs   (    0.06 ms per token, 15560.57 tokens per second)
llama_print_timings: prompt eval time =    1873.80 ms /    10 tokens (  187.38 ms per token,     5.34 tokens per second)
llama_print_timings:        eval time =   59352.84 ms /   199 runs   (  298.26 ms per token,     3.35 tokens per second)
llama_print_timings:       total time =   61281.14 ms /   209 tokens

运行模型 llama2-7B.q4, 生成长度 500:

> ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 500(此处省略一部分)llama_print_timings:        load time =    2706.04 ms
llama_print_timings:      sample time =      33.77 ms /   500 runs   (    0.07 ms per token, 14808.23 tokens per second)
llama_print_timings: prompt eval time =    1866.60 ms /    10 tokens (  186.66 ms per token,     5.36 tokens per second)
llama_print_timings:        eval time =  154145.54 ms /   499 runs   (  308.91 ms per token,     3.24 tokens per second)
llama_print_timings:       total time =  156146.19 ms /   509 tokens

运行模型 llama2-7B.q4, 生成长度 1000:

> ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 1000(此处省略一部分)llama_print_timings:        load time =    2912.39 ms
llama_print_timings:      sample time =      60.76 ms /  1000 runs   (    0.06 ms per token, 16457.65 tokens per second)
llama_print_timings: prompt eval time =    1870.87 ms /    10 tokens (  187.09 ms per token,     5.35 tokens per second)
llama_print_timings:        eval time =  335019.17 ms /   999 runs   (  335.35 ms per token,     2.98 tokens per second)
llama_print_timings:       total time =  337155.40 ms /  1009 tokens

运行模型 qwen2-7B.q8, 生成长度 100:

> ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 100
Log start
main: build = 3617 (a07c32ea)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1724501237
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from qwen2-7b-instruct-q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.name str              = qwen2-7b-instruct
llama_model_loader: - kv   2:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv   6:                 qwen2.attention.head_count u32              = 28
llama_model_loader: - kv   7:              qwen2.attention.head_count_kv u32              = 4
llama_model_loader: - kv   8:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv   9:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 7
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  12:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  17:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  19:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - kv  22:                      quantize.imatrix.file str              = ../Qwen2/gguf/qwen2-7b-imatrix/imatri...
llama_model_loader: - kv  23:                   quantize.imatrix.dataset str              = ../sft_2406.txt
llama_model_loader: - kv  24:             quantize.imatrix.entries_count i32              = 196
llama_model_loader: - kv  25:              quantize.imatrix.chunks_count i32              = 1937
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q8_0:  198 tensors
llm_load_vocab: special tokens cache size = 421
llm_load_vocab: token to piece cache size = 0.9352 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 3584
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 28
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18944
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 7.62 B
llm_load_print_meta: model size       = 7.54 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = qwen2-7b-instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/29 layers to GPU
llm_load_tensors:        CPU buffer size =  7717.68 MiB
........................................................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  1792.00 MiB
llama_new_context_with_model: KV self size  = 1792.00 MiB, K (f16):  896.00 MiB, V (f16):  896.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.58 MiB
llama_new_context_with_model:        CPU compute buffer size =  1884.01 MiB
llama_new_context_with_model: graph nodes  = 986
llama_new_context_with_model: graph splits = 1system_info: n_threads = 2 / 4 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 32768, n_batch = 2048, n_predict = 100, n_keep = 0hello, this is a very very long story and it is very complicated.(此处省略一部分)llama_print_timings:        load time =    5355.79 ms
llama_print_timings:      sample time =      16.50 ms /   100 runs   (    0.17 ms per token,  6059.14 tokens per second)
llama_print_timings: prompt eval time =    1727.39 ms /     9 tokens (  191.93 ms per token,     5.21 tokens per second)
llama_print_timings:        eval time =   41066.65 ms /    99 runs   (  414.81 ms per token,     2.41 tokens per second)
llama_print_timings:       total time =   42914.72 ms /   108 tokens
Log end

运行模型 qwen2-7B.q8, 生成长度 200:

> ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 200(此处省略一部分)llama_print_timings:        load time =    4641.45 ms
llama_print_timings:      sample time =      34.69 ms /   200 runs   (    0.17 ms per token,  5765.85 tokens per second)
llama_print_timings: prompt eval time =    1735.51 ms /     9 tokens (  192.83 ms per token,     5.19 tokens per second)
llama_print_timings:        eval time =   84374.46 ms /   199 runs   (  423.99 ms per token,     2.36 tokens per second)
llama_print_timings:       total time =   86360.14 ms /   208 tokens

运行模型 qwen2-7B.q8, 生成长度 500:

> ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 500(此处省略一部分)llama_print_timings:        load time =    5026.41 ms
llama_print_timings:      sample time =      91.64 ms /   500 runs   (    0.18 ms per token,  5456.37 tokens per second)
llama_print_timings: prompt eval time =    1713.90 ms /     9 tokens (  190.43 ms per token,     5.25 tokens per second)
llama_print_timings:        eval time =  214729.88 ms /   499 runs   (  430.32 ms per token,     2.32 tokens per second)
llama_print_timings:       total time =  217097.31 ms /   508 tokens

运行模型 qwen2-7B.q8, 生成长度 1000:

> ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 1000(此处省略一部分)llama_print_timings:        load time =    4939.31 ms
llama_print_timings:      sample time =     194.02 ms /  1000 runs   (    0.19 ms per token,  5154.00 tokens per second)
llama_print_timings: prompt eval time =    1879.29 ms /     9 tokens (  208.81 ms per token,     4.79 tokens per second)
llama_print_timings:        eval time =  440575.12 ms /   999 runs   (  441.02 ms per token,     2.27 tokens per second)
llama_print_timings:       total time =  443841.74 ms /  1008 tokens

3.2 CPU (E5-2650v3, 10C/10T/3.0GHz) x86_64 AVX2

在 5 号 (物理机) 上运行. 版本:

fc-test@MiWiFi-RA74-srv:~/llama-cpp$ ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli --version
./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli: /lib64/libcurl.so.4: no version information available (required by ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli)
version: 3617 (a07c32ea)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

运行模型 llama2-7B.q4, 生成长度 100:

fc-test@MiWiFi-RA74-srv:~/llama-cpp$ ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 100
./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli: /lib64/libcurl.so.4: no version information available (required by ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli)
Log start
main: build = 3617 (a07c32ea)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1724498199(此处省略一部分)llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  3891.24 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  2048.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:        CPU compute buffer size =   296.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1system_info: n_threads = 10 / 10 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 100, n_keep = 1hello, this is a very very long story, but this is the only way I could explain what I did to solve this problem. everyone here said it cannot be done, but I did it. I don't know why I can solve it, but I did.(此处省略一部分)llama_print_timings:        load time =    1542.10 ms
llama_print_timings:      sample time =       4.82 ms /   100 runs   (    0.05 ms per token, 20768.43 tokens per second)
llama_print_timings: prompt eval time =     493.57 ms /    10 tokens (   49.36 ms per token,    20.26 tokens per second)
llama_print_timings:        eval time =   10175.47 ms /    99 runs   (  102.78 ms per token,     9.73 tokens per second)
llama_print_timings:       total time =   10693.97 ms /   109 tokens
Log end

运行模型 llama2-7B.q4, 生成长度 200:

$ ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 200(此处省略一部分)llama_print_timings:        load time =    1607.02 ms
llama_print_timings:      sample time =       9.29 ms /   200 runs   (    0.05 ms per token, 21528.53 tokens per second)
llama_print_timings: prompt eval time =     494.35 ms /    10 tokens (   49.44 ms per token,    20.23 tokens per second)
llama_print_timings:        eval time =   20434.74 ms /   199 runs   (  102.69 ms per token,     9.74 tokens per second)
llama_print_timings:       total time =   20978.91 ms /   209 tokens

运行模型 llama2-7B.q4, 生成长度 500:

$ ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 500(此处省略一部分)llama_print_timings:        load time =    1583.59 ms
llama_print_timings:      sample time =      23.55 ms /   500 runs   (    0.05 ms per token, 21226.92 tokens per second)
llama_print_timings: prompt eval time =     499.12 ms /    10 tokens (   49.91 ms per token,    20.04 tokens per second)
llama_print_timings:        eval time =   52358.53 ms /   499 runs   (  104.93 ms per token,     9.53 tokens per second)
llama_print_timings:       total time =   52987.01 ms /   509 tokens

运行模型 llama2-7B.q4, 生成长度 1000:

$ ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 1000(此处省略一部分)llama_print_timings:        load time =    3247.78 ms
llama_print_timings:      sample time =      47.13 ms /  1000 runs   (    0.05 ms per token, 21218.81 tokens per second)
llama_print_timings: prompt eval time =    2596.30 ms /    10 tokens (  259.63 ms per token,     3.85 tokens per second)
llama_print_timings:        eval time =  118042.47 ms /   999 runs   (  118.16 ms per token,     8.46 tokens per second)
llama_print_timings:       total time =  120896.74 ms /  1009 tokens

运行模型 qwen2-7B.q8, 生成长度 100:

fc-test@MiWiFi-RA74-srv:~/llama-cpp$ ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 100
./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli: /lib64/libcurl.so.4: no version information available (required by ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli)
Log start
main: build = 3617 (a07c32ea)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1724498632
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from qwen2-7b-instruct-q8_0.gguf (version GGUF V3 (latest))(此处省略一部分)llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/29 layers to GPU
llm_load_tensors:        CPU buffer size =  7717.68 MiB
........................................................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  1792.00 MiB
llama_new_context_with_model: KV self size  = 1792.00 MiB, K (f16):  896.00 MiB, V (f16):  896.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.58 MiB
llama_new_context_with_model:        CPU compute buffer size =  1884.01 MiB
llama_new_context_with_model: graph nodes  = 986
llama_new_context_with_model: graph splits = 1system_info: n_threads = 10 / 10 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 32768, n_batch = 2048, n_predict = 100, n_keep = 0hello, this is a very very long story, so i will split it into parts.(此处省略一部分)llama_print_timings:        load time =    1626.44 ms
llama_print_timings:      sample time =      14.31 ms /   100 runs   (    0.14 ms per token,  6987.63 tokens per second)
llama_print_timings: prompt eval time =     507.61 ms /     9 tokens (   56.40 ms per token,    17.73 tokens per second)
llama_print_timings:        eval time =   14615.79 ms /    99 runs   (  147.63 ms per token,     6.77 tokens per second)
llama_print_timings:       total time =   15238.41 ms /   108 tokens
Log end

运行模型 qwen2-7B.q8, 生成长度 200:

$ ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 200(此处省略一部分)llama_print_timings:        load time =    1577.00 ms
llama_print_timings:      sample time =      28.41 ms /   200 runs   (    0.14 ms per token,  7039.03 tokens per second)
llama_print_timings: prompt eval time =     503.02 ms /     9 tokens (   55.89 ms per token,    17.89 tokens per second)
llama_print_timings:        eval time =   28940.41 ms /   199 runs   (  145.43 ms per token,     6.88 tokens per second)
llama_print_timings:       total time =   29668.90 ms /   208 tokens

运行模型 qwen2-7B.q8, 生成长度 500:

$ ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 500(此处省略一部分)llama_print_timings:        load time =    1598.72 ms
llama_print_timings:      sample time =      72.10 ms /   500 runs   (    0.14 ms per token,  6935.01 tokens per second)
llama_print_timings: prompt eval time =     502.73 ms /     9 tokens (   55.86 ms per token,    17.90 tokens per second)
llama_print_timings:        eval time =   72983.23 ms /   499 runs   (  146.26 ms per token,     6.84 tokens per second)
llama_print_timings:       total time =   74061.66 ms /   508 tokens

运行模型 qwen2-7B.q8, 生成长度 1000:

$ ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 1000(此处省略一部分)llama_print_timings:        load time =    1602.06 ms
llama_print_timings:      sample time =     144.15 ms /  1000 runs   (    0.14 ms per token,  6937.31 tokens per second)
llama_print_timings: prompt eval time =     509.66 ms /     9 tokens (   56.63 ms per token,    17.66 tokens per second)
llama_print_timings:        eval time =  149336.77 ms /   999 runs   (  149.49 ms per token,     6.69 tokens per second)
llama_print_timings:       total time =  150983.01 ms /  1008 tokens

3.3 CPU (r5-5600g, 6C/12T/4.4GHz) x86_64 AVX2

在 6 号 PC (物理机) 上运行. 版本:

> ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli --version
version: 3617 (a07c32ea)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

运行模型 llama2-7B.q4, 生成长度 100:

> ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 100
Log start
main: build = 3617 (a07c32ea)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1724488187(此处省略一部分)llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  3891.24 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  2048.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:        CPU compute buffer size =   296.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1system_info: n_threads = 6 / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 100, n_keep = 1hello, this is a very very long story, but i think it's important to read.(此处省略一部分)llama_print_timings:        load time =     649.76 ms
llama_print_timings:      sample time =       2.40 ms /   100 runs   (    0.02 ms per token, 41701.42 tokens per second)
llama_print_timings: prompt eval time =     311.37 ms /    10 tokens (   31.14 ms per token,    32.12 tokens per second)
llama_print_timings:        eval time =    9771.88 ms /    99 runs   (   98.71 ms per token,    10.13 tokens per second)
llama_print_timings:       total time =   10092.46 ms /   109 tokens
Log end

运行模型 llama2-7B.q4, 生成长度 200:

> ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 200(此处省略一部分)llama_print_timings:        load time =     650.76 ms
llama_print_timings:      sample time =       5.08 ms /   200 runs   (    0.03 ms per token, 39331.37 tokens per second)
llama_print_timings: prompt eval time =     308.01 ms /    10 tokens (   30.80 ms per token,    32.47 tokens per second)
llama_print_timings:        eval time =   19887.24 ms /   199 runs   (   99.94 ms per token,    10.01 tokens per second)
llama_print_timings:       total time =   20214.70 ms /   209 tokens

运行模型 llama2-7B.q4, 生成长度 500:

> ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 500(此处省略一部分)llama_print_timings:        load time =     648.51 ms
llama_print_timings:      sample time =      12.16 ms /   500 runs   (    0.02 ms per token, 41128.57 tokens per second)
llama_print_timings: prompt eval time =     308.95 ms /    10 tokens (   30.89 ms per token,    32.37 tokens per second)
llama_print_timings:        eval time =   51687.76 ms /   499 runs   (  103.58 ms per token,     9.65 tokens per second)
llama_print_timings:       total time =   52043.21 ms /   509 tokens

运行模型 llama2-7B.q4, 生成长度 1000:

> ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 1000(此处省略一部分)llama_print_timings:        load time =     648.60 ms
llama_print_timings:      sample time =      24.13 ms /  1000 runs   (    0.02 ms per token, 41438.75 tokens per second)
llama_print_timings: prompt eval time =     311.58 ms /    10 tokens (   31.16 ms per token,    32.09 tokens per second)
llama_print_timings:        eval time =  107409.32 ms /   999 runs   (  107.52 ms per token,     9.30 tokens per second)
llama_print_timings:       total time =  107815.70 ms /  1009 tokens

运行模型 qwen2-7B.q8, 生成长度 100:

> ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 100
Log start
main: build = 3617 (a07c32ea)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1724489633
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from qwen2-7b-instruct-q8_0.gguf (version GGUF V3 (latest))(此处省略一部分)llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/29 layers to GPU
llm_load_tensors:        CPU buffer size =  7717.68 MiB
........................................................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  1792.00 MiB
llama_new_context_with_model: KV self size  = 1792.00 MiB, K (f16):  896.00 MiB, V (f16):  896.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.58 MiB
llama_new_context_with_model:        CPU compute buffer size =  1884.01 MiB
llama_new_context_with_model: graph nodes  = 986
llama_new_context_with_model: graph splits = 1system_info: n_threads = 6 / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 32768, n_batch = 2048, n_predict = 100, n_keep = 0hello, this is a very very long story about my friend and her husband, so please bear with me.(此处省略一部分)llama_print_timings:        load time =    1158.78 ms
llama_print_timings:      sample time =       8.32 ms /   100 runs   (    0.08 ms per token, 12025.01 tokens per second)
llama_print_timings: prompt eval time =     457.69 ms /     9 tokens (   50.85 ms per token,    19.66 tokens per second)
llama_print_timings:        eval time =   17878.08 ms /    99 runs   (  180.59 ms per token,     5.54 tokens per second)
llama_print_timings:       total time =   18402.49 ms /   108 tokens
Log end

运行模型 qwen2-7B.q8, 生成长度 200:

> ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 200(此处省略一部分)llama_print_timings:        load time =    1109.41 ms
llama_print_timings:      sample time =      13.17 ms /   200 runs   (    0.07 ms per token, 15181.42 tokens per second)
llama_print_timings: prompt eval time =     496.57 ms /     9 tokens (   55.17 ms per token,    18.12 tokens per second)
llama_print_timings:        eval time =   35791.00 ms /   199 runs   (  179.85 ms per token,     5.56 tokens per second)
llama_print_timings:       total time =   36411.02 ms /   208 tokens

运行模型 qwen2-7B.q8, 生成长度 500:

> ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 500(此处省略一部分)llama_print_timings:        load time =    1061.77 ms
llama_print_timings:      sample time =      40.61 ms /   500 runs   (    0.08 ms per token, 12311.03 tokens per second)
llama_print_timings: prompt eval time =     409.44 ms /     9 tokens (   45.49 ms per token,    21.98 tokens per second)
llama_print_timings:        eval time =   90250.99 ms /   499 runs   (  180.86 ms per token,     5.53 tokens per second)
llama_print_timings:       total time =   90991.53 ms /   508 tokens

运行模型 qwen2-7B.q8, 生成长度 1000:

> ./llama-b3617-bin-ubuntu-x64/build/bin/llama-cli -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 1000(此处省略一部分)llama_print_timings:        load time =     977.25 ms
llama_print_timings:      sample time =      60.87 ms /  1000 runs   (    0.06 ms per token, 16428.99 tokens per second)
llama_print_timings: prompt eval time =     479.25 ms /     9 tokens (   53.25 ms per token,    18.78 tokens per second)
llama_print_timings:        eval time =  182514.10 ms /   999 runs   (  182.70 ms per token,     5.47 tokens per second)
llama_print_timings:       total time =  183593.03 ms /  1008 tokens

3.4 iGPU (Intel HD520, i5-6200U) vulkan

在 4 号 PC (物理机) 上运行. 版本:

> ./llama-cli-vulkan-b3617 --version
version: 1 (a07c32e)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

运行模型 llama2-7B.q4, 生成长度 100:

> ./llama-cli-vulkan-b3617 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -ngl 33 -n 100
Log start
main: build = 1 (a07c32e)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1724502840(此处省略一部分)llm_load_print_meta: max token length = 48
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: Intel(R) HD Graphics 520 (SKL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    70.31 MiB
llm_load_tensors: Intel(R) HD Graphics 520 (SKL GT2) buffer size =  3820.93 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: Intel(R) HD Graphics 520 (SKL GT2) KV buffer size =  2048.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model: Vulkan_Host  output buffer size =     0.12 MiB
llama_new_context_with_model: Intel(R) HD Graphics 520 (SKL GT2) compute buffer size =   296.00 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size =    16.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2system_info: n_threads = 2 / 4 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 100, n_keep = 1hello, this is a very very long story but i will try and make it short(此处省略一部分)llama_print_timings:        load time =   27305.92 ms
llama_print_timings:      sample time =      20.64 ms /   100 runs   (    0.21 ms per token,  4844.49 tokens per second)
llama_print_timings: prompt eval time =   10725.27 ms /    10 tokens ( 1072.53 ms per token,     0.93 tokens per second)
llama_print_timings:        eval time =  104246.69 ms /    99 runs   ( 1053.00 ms per token,     0.95 tokens per second)
llama_print_timings:       total time =  115065.04 ms /   109 tokens
Log end

运行模型 llama2-7B.q4, 生成长度 200:

> ./llama-cli-vulkan-b3617 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -ngl 33 -n 200(此处省略一部分)llama_print_timings:        load time =   26358.11 ms
llama_print_timings:      sample time =      43.34 ms /   200 runs   (    0.22 ms per token,  4615.21 tokens per second)
llama_print_timings: prompt eval time =   10579.07 ms /    10 tokens ( 1057.91 ms per token,     0.95 tokens per second)
llama_print_timings:        eval time =  209900.70 ms /   199 runs   ( 1054.78 ms per token,     0.95 tokens per second)
llama_print_timings:       total time =  220666.27 ms /   209 tokens

运行模型 llama2-7B.q4, 生成长度 500:

> ./llama-cli-vulkan-b3617 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -ngl 33 -n 500(此处省略一部分)llama_print_timings:        load time =   27769.47 ms
llama_print_timings:      sample time =     100.38 ms /   500 runs   (    0.20 ms per token,  4981.17 tokens per second)
llama_print_timings: prompt eval time =   10573.54 ms /    10 tokens ( 1057.35 ms per token,     0.95 tokens per second)
llama_print_timings:        eval time =  532338.80 ms /   499 runs   ( 1066.81 ms per token,     0.94 tokens per second)
llama_print_timings:       total time =  543350.42 ms /   509 tokens

运行模型 llama2-7B.q4, 生成长度 1000:

> ./llama-cli-vulkan-b3617 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -ngl 33 -n 1000(此处省略一部分)llama_print_timings:        load time =   29646.65 ms
llama_print_timings:      sample time =     179.74 ms /  1000 runs   (    0.18 ms per token,  5563.62 tokens per second)
llama_print_timings: prompt eval time =   10538.36 ms /    10 tokens ( 1053.84 ms per token,     0.95 tokens per second)
llama_print_timings:        eval time = 1089916.74 ms /   999 runs   ( 1091.01 ms per token,     0.92 tokens per second)
llama_print_timings:       total time = 1101057.43 ms /  1009 tokens

运行模型 qwen2-7B.q8. 错误, 无法运行, 提示内存不足:

> ./llama-cli-vulkan-b3617 -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -ngl 33 -n 100
Log start
main: build = 1 (a07c32e)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1724508115
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from qwen2-7b-instruct-q8_0.gguf (version GGUF V3 (latest))(此处省略一部分)llm_load_print_meta: max token length = 256
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: Intel(R) HD Graphics 520 (SKL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:        CPU buffer size =   552.23 MiB
llm_load_tensors: Intel(R) HD Graphics 520 (SKL GT2) buffer size =  7165.44 MiB
........................................................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_vulkan: Device memory allocation of size 1879048192 failed.
ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache
llama_init_from_gpt_params: error: failed to create context with model 'qwen2-7b-instruct-q8_0.gguf'
main: error: unable to load model

3.5 iGPU (AMD Radeon Vega 7, r5-5600g) vulkan

在 6 号 PC (物理机) 上运行. 版本:

> ./llama-cli-vulkan-b3617 --version
version: 1 (a07c32e)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

运行模型 llama2-7B.q4, 生成长度 100:

> ./llama-cli-vulkan-b3617 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 100 -ngl 33
Log start
main: build = 1 (a07c32e)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1724488777(此处省略一部分)llm_load_print_meta: max token length = 48
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | warp size: 64
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    70.31 MiB
llm_load_tensors: AMD Radeon Graphics (RADV RENOIR) buffer size =  3820.93 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: AMD Radeon Graphics (RADV RENOIR) KV buffer size =  2048.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model: Vulkan_Host  output buffer size =     0.12 MiB
llama_new_context_with_model: AMD Radeon Graphics (RADV RENOIR) compute buffer size =   296.00 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size =    16.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2system_info: n_threads = 6 / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 100, n_keep = 1hello, this is a very very long story and it's only the first episode.(此处省略一部分)llama_print_timings:        load time =    3300.29 ms
llama_print_timings:      sample time =       4.41 ms /   100 runs   (    0.04 ms per token, 22686.03 tokens per second)
llama_print_timings: prompt eval time =    1028.22 ms /    10 tokens (  102.82 ms per token,     9.73 tokens per second)
llama_print_timings:        eval time =   23080.64 ms /    99 runs   (  233.14 ms per token,     4.29 tokens per second)
llama_print_timings:       total time =   24122.46 ms /   109 tokens
Log end

运行模型 llama2-7B.q4, 生成长度 200:

> ./llama-cli-vulkan-b3617 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 200 -ngl 33(此处省略一部分)llama_print_timings:        load time =    3410.94 ms
llama_print_timings:      sample time =       8.64 ms /   200 runs   (    0.04 ms per token, 23153.51 tokens per second)
llama_print_timings: prompt eval time =    1027.37 ms /    10 tokens (  102.74 ms per token,     9.73 tokens per second)
llama_print_timings:        eval time =   46620.34 ms /   199 runs   (  234.27 ms per token,     4.27 tokens per second)
llama_print_timings:       total time =   47674.32 ms /   209 tokens

运行模型 llama2-7B.q4, 生成长度 500:

> ./llama-cli-vulkan-b3617 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 500 -ngl 33(此处省略一部分)llama_print_timings:        load time =    3389.70 ms
llama_print_timings:      sample time =      21.42 ms /   500 runs   (    0.04 ms per token, 23339.40 tokens per second)
llama_print_timings: prompt eval time =    1026.09 ms /    10 tokens (  102.61 ms per token,     9.75 tokens per second)
llama_print_timings:        eval time =  118409.44 ms /   499 runs   (  237.29 ms per token,     4.21 tokens per second)
llama_print_timings:       total time =  119502.95 ms /   509 tokens

运行模型 llama2-7B.q4, 生成长度 1000:

> ./llama-cli-vulkan-b3617 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 1000 -ngl 33(此处省略一部分)llama_print_timings:        load time =    3362.42 ms
llama_print_timings:      sample time =      43.25 ms /  1000 runs   (    0.04 ms per token, 23120.85 tokens per second)
llama_print_timings: prompt eval time =    1027.78 ms /    10 tokens (  102.78 ms per token,     9.73 tokens per second)
llama_print_timings:        eval time =  242531.02 ms /   999 runs   (  242.77 ms per token,     4.12 tokens per second)
llama_print_timings:       total time =  243694.80 ms /  1009 tokens

运行模型 qwen2-7B.q8, 生成长度 100:

> ./llama-cli-vulkan-b3617 -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 100 -ngl 33
Log start
main: build = 1 (a07c32e)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1724490279
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from qwen2-7b-instruct-q8_0.gguf (version GGUF V3 (latest))(此处省略一部分)llm_load_print_meta: max token length = 256
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | warp size: 64
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:        CPU buffer size =   552.23 MiB
llm_load_tensors: AMD Radeon Graphics (RADV RENOIR) buffer size =  7165.44 MiB
........................................................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: AMD Radeon Graphics (RADV RENOIR) KV buffer size =  1792.00 MiB
llama_new_context_with_model: KV self size  = 1792.00 MiB, K (f16):  896.00 MiB, V (f16):  896.00 MiB
llama_new_context_with_model: Vulkan_Host  output buffer size =     0.58 MiB
llama_new_context_with_model: AMD Radeon Graphics (RADV RENOIR) compute buffer size =  1884.00 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size =    71.01 MiB
llama_new_context_with_model: graph nodes  = 986
llama_new_context_with_model: graph splits = 2system_info: n_threads = 6 / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 32768, n_batch = 2048, n_predict = 100, n_keep = 0hello, this is a very very long story, but I'm going to do my best to explain in a concise manner:(此处省略一部分)llama_print_timings:        load time =    8781.85 ms
llama_print_timings:      sample time =       9.19 ms /   100 runs   (    0.09 ms per token, 10880.21 tokens per second)
llama_print_timings: prompt eval time =     913.76 ms /     9 tokens (  101.53 ms per token,     9.85 tokens per second)
llama_print_timings:        eval time =   34897.82 ms /    99 runs   (  352.50 ms per token,     2.84 tokens per second)
llama_print_timings:       total time =   35889.02 ms /   108 tokens
Log end

运行模型 qwen2-7B.q8, 生成长度 200:

> ./llama-cli-vulkan-b3617 -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 200 -ngl 33(此处省略一部分)llama_print_timings:        load time =    8249.67 ms
llama_print_timings:      sample time =      17.88 ms /   200 runs   (    0.09 ms per token, 11185.68 tokens per second)
llama_print_timings: prompt eval time =     909.22 ms /     9 tokens (  101.02 ms per token,     9.90 tokens per second)
llama_print_timings:        eval time =   70426.63 ms /   199 runs   (  353.90 ms per token,     2.83 tokens per second)
llama_print_timings:       total time =   71489.45 ms /   208 tokens

运行模型 qwen2-7B.q8, 生成长度 500:

> ./llama-cli-vulkan-b3617 -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 500 -ngl 33(此处省略一部分)llama_print_timings:        load time =    6014.76 ms
llama_print_timings:      sample time =      46.23 ms /   500 runs   (    0.09 ms per token, 10815.96 tokens per second)
llama_print_timings: prompt eval time =     916.14 ms /     9 tokens (  101.79 ms per token,     9.82 tokens per second)
llama_print_timings:        eval time =  177508.81 ms /   499 runs   (  355.73 ms per token,     2.81 tokens per second)
llama_print_timings:       total time =  178809.12 ms /   508 tokens

运行模型 qwen2-7B.q8, 生成长度 1000:

> ./llama-cli-vulkan-b3617 -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 1000 -ngl 33(此处省略一部分)llama_print_timings:        load time =    6662.38 ms
llama_print_timings:      sample time =      89.55 ms /  1000 runs   (    0.09 ms per token, 11167.57 tokens per second)
llama_print_timings: prompt eval time =     916.79 ms /     9 tokens (  101.87 ms per token,     9.82 tokens per second)
llama_print_timings:        eval time =  358831.15 ms /   999 runs   (  359.19 ms per token,     2.78 tokens per second)
llama_print_timings:       total time =  360504.90 ms /  1008 tokens

3.6 dGPU (A770) vulkan

在 6 号 (虚拟机) 上运行. 版本:

a2@a2s:~$ ./llama-cli-vulkan-b3617 --version
version: 1 (a07c32e)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

运行模型 llama2-7B.q4, 生成长度 100:

a2@a2s:~$ ./llama-cli-vulkan-b3617 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 100 -ngl 33
Log start
main: build = 1 (a07c32e)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1724492722(此处省略一部分)llm_load_print_meta: max token length = 48
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | warp size: 32
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    70.31 MiB
llm_load_tensors: Intel(R) Arc(tm) A770 Graphics (DG2) buffer size =  3820.93 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: Intel(R) Arc(tm) A770 Graphics (DG2) KV buffer size =  2048.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model: Vulkan_Host  output buffer size =     0.12 MiB
llama_new_context_with_model: Intel(R) Arc(tm) A770 Graphics (DG2) compute buffer size =   296.00 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size =    16.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2system_info: n_threads = 4 / 4 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 100, n_keep = 1hello, this is a very very long story, and you can ignore most of it, i'm just putting it here for reference in case anyone has a similar issue.(此处省略一部分)llama_print_timings:        load time =    2274.09 ms
llama_print_timings:      sample time =       4.14 ms /   100 runs   (    0.04 ms per token, 24148.76 tokens per second)
llama_print_timings: prompt eval time =     440.70 ms /    10 tokens (   44.07 ms per token,    22.69 tokens per second)
llama_print_timings:        eval time =    3809.51 ms /    99 runs   (   38.48 ms per token,    25.99 tokens per second)
llama_print_timings:       total time =    4262.46 ms /   109 tokens
Log end

运行模型 llama2-7B.q4, 生成长度 200:

a2@a2s:~$ ./llama-cli-vulkan-b3617 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 200 -ngl 33(此处省略一部分)llama_print_timings:        load time =    2308.60 ms
llama_print_timings:      sample time =       8.50 ms /   200 runs   (    0.04 ms per token, 23518.34 tokens per second)
llama_print_timings: prompt eval time =     441.26 ms /    10 tokens (   44.13 ms per token,    22.66 tokens per second)
llama_print_timings:        eval time =    7704.86 ms /   199 runs   (   38.72 ms per token,    25.83 tokens per second)
llama_print_timings:       total time =    8171.87 ms /   209 tokens

运行模型 llama2-7B.q4, 生成长度 500:

a2@a2s:~$ ./llama-cli-vulkan-b3617 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 500 -ngl 33(此处省略一部分)llama_print_timings:        load time =    2296.68 ms
llama_print_timings:      sample time =      21.31 ms /   500 runs   (    0.04 ms per token, 23460.96 tokens per second)
llama_print_timings: prompt eval time =     440.77 ms /    10 tokens (   44.08 ms per token,    22.69 tokens per second)
llama_print_timings:        eval time =   19597.74 ms /   499 runs   (   39.27 ms per token,    25.46 tokens per second)
llama_print_timings:       total time =   20102.66 ms /   509 tokens

运行模型 llama2-7B.q4, 生成长度 1000:

a2@a2s:~$ ./llama-cli-vulkan-b3617 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 1000 -ngl 33(此处省略一部分)llama_print_timings:        load time =    2273.46 ms
llama_print_timings:      sample time =      42.10 ms /  1000 runs   (    0.04 ms per token, 23751.84 tokens per second)
llama_print_timings: prompt eval time =     441.47 ms /    10 tokens (   44.15 ms per token,    22.65 tokens per second)
llama_print_timings:        eval time =   40262.07 ms /   999 runs   (   40.30 ms per token,    24.81 tokens per second)
llama_print_timings:       total time =   40827.46 ms /  1009 tokens

运行模型 qwen2-7B.q8, 生成长度 100:

a2@a2s:~$ ./llama-cli-vulkan-b3617 -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 100 -ngl 33
Log start
main: build = 1 (a07c32e)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1724493121
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from qwen2-7b-instruct-q8_0.gguf (version GGUF V3 (latest))(此处省略一部分)llm_load_print_meta: max token length = 256
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | warp size: 32
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:        CPU buffer size =   552.23 MiB
llm_load_tensors: Intel(R) Arc(tm) A770 Graphics (DG2) buffer size =  7165.44 MiB
........................................................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: Intel(R) Arc(tm) A770 Graphics (DG2) KV buffer size =  1792.00 MiB
llama_new_context_with_model: KV self size  = 1792.00 MiB, K (f16):  896.00 MiB, V (f16):  896.00 MiB
llama_new_context_with_model: Vulkan_Host  output buffer size =     0.58 MiB
llama_new_context_with_model: Intel(R) Arc(tm) A770 Graphics (DG2) compute buffer size =  1884.00 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size =    71.01 MiB
llama_new_context_with_model: graph nodes  = 986
llama_new_context_with_model: graph splits = 2system_info: n_threads = 4 / 4 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 32768, n_batch = 2048, n_predict = 100, n_keep = 0hello, this is a very very long story with multiple characters, but i will try to write it in a way that makes it easy to follow. (此处省略一部分)llama_print_timings:        load time =    8202.05 ms
llama_print_timings:      sample time =      10.16 ms /   100 runs   (    0.10 ms per token,  9839.61 tokens per second)
llama_print_timings: prompt eval time =     587.73 ms /     9 tokens (   65.30 ms per token,    15.31 tokens per second)
llama_print_timings:        eval time =    4755.44 ms /    99 runs   (   48.03 ms per token,    20.82 tokens per second)
llama_print_timings:       total time =    5460.46 ms /   108 tokens
Log end

运行模型 qwen2-7B.q8, 生成长度 200:

a2@a2s:~$ ./llama-cli-vulkan-b3617 -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 200 -ngl 33(此处省略一部分)llama_print_timings:        load time =    6642.05 ms
llama_print_timings:      sample time =      19.91 ms /   200 runs   (    0.10 ms per token, 10043.19 tokens per second)
llama_print_timings: prompt eval time =     587.07 ms /     9 tokens (   65.23 ms per token,    15.33 tokens per second)
llama_print_timings:        eval time =    9581.81 ms /   199 runs   (   48.15 ms per token,    20.77 tokens per second)
llama_print_timings:       total time =   10348.91 ms /   208 tokens

运行模型 qwen2-7B.q8, 生成长度 500:

a2@a2s:~$ ./llama-cli-vulkan-b3617 -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 500 -ngl 33(此处省略一部分)llama_print_timings:        load time =    6756.91 ms
llama_print_timings:      sample time =      51.43 ms /   500 runs   (    0.10 ms per token,  9722.33 tokens per second)
llama_print_timings: prompt eval time =     588.10 ms /     9 tokens (   65.34 ms per token,    15.30 tokens per second)
llama_print_timings:        eval time =   24196.44 ms /   499 runs   (   48.49 ms per token,    20.62 tokens per second)
llama_print_timings:       total time =   25212.38 ms /   508 tokens

运行模型 qwen2-7B.q8, 生成长度 1000:

a2@a2s:~$ ./llama-cli-vulkan-b3617 -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 1000 -ngl 33(此处省略一部分)llama_print_timings:        load time =    6664.69 ms
llama_print_timings:      sample time =      92.37 ms /  1000 runs   (    0.09 ms per token, 10825.91 tokens per second)
llama_print_timings: prompt eval time =     586.92 ms /     9 tokens (   65.21 ms per token,    15.33 tokens per second)
llama_print_timings:        eval time =   48610.18 ms /   999 runs   (   48.66 ms per token,    20.55 tokens per second)
llama_print_timings:       total time =   49939.72 ms /  1008 tokens

3.7 dGPU (A770) SYCL

在 6 号 (虚拟机) 上运行. 准备工作:

a2@a2s:~$ source /opt/intel/oneapi/setvars.sh:: initializing oneAPI environment ...-bash: BASH_VERSION = 5.1.16(1)-releaseargs: Using "$@" for setvars.sh arguments: 
:: ccl -- latest
:: compiler -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: mkl -- latest
:: mpi -- latest
:: tbb -- latest
:: oneAPI environment initialized ::a2@a2s:~$ export ZES_ENABLE_SYSMAN=1
a2@a2s:~$ export USE_XETLA=OFF
a2@a2s:~$ export SYCL_CACHE_PERSISTENT=1
a2@a2s:~$ export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
a2@a2s:~$ sycl-ls
[opencl:cpu][opencl:0] Intel(R) OpenCL, AMD Ryzen 5 5600G with Radeon Graphics          OpenCL 3.0 (Build 0) [2024.18.7.0.11_160000]
[opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO  [24.22.29735.27]
[level_zero:gpu][level_zero:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.29735]

版本:

a2@a2s:~$ ./llama-cli-sycl-b3617-f32 --version
version: 1 (a07c32e)
built with Intel(R) oneAPI DPC++/C++ Compiler 2024.2.1 (2024.2.1.20240711) for x86_64-unknown-linux-gnu
a2@a2s:~$ ./llama-cli-sycl-b3617-f16 --version
version: 1 (a07c32e)
built with Intel(R) oneAPI DPC++/C++ Compiler 2024.2.1 (2024.2.1.20240711) for x86_64-unknown-linux-gnu

运行模型 llama2-7B.q4, 生成长度 100 (f32):

a2@a2s:~$ ./llama-cli-sycl-b3617-f32 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -ngl 33 -sm none -n 100
Log start
main: build = 1 (a07c32e)
main: built with Intel(R) oneAPI DPC++/C++ Compiler 2024.2.1 (2024.2.1.20240711) for x86_64-unknown-linux-gnu
main: seed  = 1724493845(此处省略一部分)llm_load_print_meta: max token length = 48
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  3820.94 MiB
llm_load_tensors:        CPU buffer size =    70.31 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|    1.3|    512|    1024|   32| 16225M|            1.3.29735|
llama_kv_cache_init:      SYCL0 KV buffer size =  2048.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =   296.00 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    16.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2system_info: n_threads = 4 / 4 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 100, n_keep = 1hello, this is a very very long story, but i promise it's worth it.(此处省略一部分)llama_print_timings:        load time =    2066.71 ms
llama_print_timings:      sample time =       2.90 ms /   100 runs   (    0.03 ms per token, 34542.31 tokens per second)
llama_print_timings: prompt eval time =     180.84 ms /    10 tokens (   18.08 ms per token,    55.30 tokens per second)
llama_print_timings:        eval time =    2852.87 ms /    99 runs   (   28.82 ms per token,    34.70 tokens per second)
llama_print_timings:       total time =    3044.63 ms /   109 tokens
Log end

运行模型 llama2-7B.q4, 生成长度 200 (f32):

a2@a2s:~$ ./llama-cli-sycl-b3617-f32 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -ngl 33 -sm none -n 200(此处省略一部分)llama_print_timings:        load time =    2040.98 ms
llama_print_timings:      sample time =       5.98 ms /   200 runs   (    0.03 ms per token, 33450.41 tokens per second)
llama_print_timings: prompt eval time =     179.29 ms /    10 tokens (   17.93 ms per token,    55.78 tokens per second)
llama_print_timings:        eval time =    5765.54 ms /   199 runs   (   28.97 ms per token,    34.52 tokens per second)
llama_print_timings:       total time =    5968.10 ms /   209 tokens

运行模型 llama2-7B.q4, 生成长度 500 (f32):

a2@a2s:~$ ./llama-cli-sycl-b3617-f32 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -ngl 33 -sm none -n 500(此处省略一部分)llama_print_timings:        load time =    1994.74 ms
llama_print_timings:      sample time =      15.04 ms /   500 runs   (    0.03 ms per token, 33246.89 tokens per second)
llama_print_timings: prompt eval time =     177.09 ms /    10 tokens (   17.71 ms per token,    56.47 tokens per second)
llama_print_timings:        eval time =   14675.46 ms /   499 runs   (   29.41 ms per token,    34.00 tokens per second)
llama_print_timings:       total time =   14911.41 ms /   509 tokens

运行模型 llama2-7B.q4, 生成长度 1000 (f32):

a2@a2s:~$ ./llama-cli-sycl-b3617-f32 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -ngl 33 -sm none -n 1000(此处省略一部分)llama_print_timings:        load time =    2071.28 ms
llama_print_timings:      sample time =      28.05 ms /  1000 runs   (    0.03 ms per token, 35646.81 tokens per second)
llama_print_timings: prompt eval time =     178.45 ms /    10 tokens (   17.85 ms per token,    56.04 tokens per second)
llama_print_timings:        eval time =   30044.60 ms /   999 runs   (   30.07 ms per token,    33.25 tokens per second)
llama_print_timings:       total time =   30329.49 ms /  1009 tokens

运行模型 qwen2-7B.q8, 生成长度 100 (f32):

a2@a2s:~$ ./llama-cli-sycl-b3617-f32 -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -ngl 33 -sm none -n 100
Log start
main: build = 1 (a07c32e)
main: built with Intel(R) oneAPI DPC++/C++ Compiler 2024.2.1 (2024.2.1.20240711) for x86_64-unknown-linux-gnu
main: seed  = 1724494148
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from qwen2-7b-instruct-q8_0.gguf (version GGUF V3 (latest))(此处省略一部分)llm_load_print_meta: max token length = 256
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  7165.44 MiB
llm_load_tensors:        CPU buffer size =   552.23 MiB
.......................................................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|    1.3|    512|    1024|   32| 16225M|            1.3.29735|
llama_kv_cache_init:      SYCL0 KV buffer size =  1792.00 MiB
llama_new_context_with_model: KV self size  = 1792.00 MiB, K (f16):  896.00 MiB, V (f16):  896.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =  1884.00 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    71.01 MiB
llama_new_context_with_model: graph nodes  = 986
llama_new_context_with_model: graph splits = 2system_info: n_threads = 4 / 4 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 32768, n_batch = 2048, n_predict = 100, n_keep = 0hello, this is a very very long story and I would like to ask for help.(此处省略一部分)llama_print_timings:        load time =    9055.51 ms
llama_print_timings:      sample time =       8.22 ms /   100 runs   (    0.08 ms per token, 12158.05 tokens per second)
llama_print_timings: prompt eval time =     395.27 ms /     9 tokens (   43.92 ms per token,    22.77 tokens per second)
llama_print_timings:        eval time =    5195.18 ms /    99 runs   (   52.48 ms per token,    19.06 tokens per second)
llama_print_timings:       total time =    5679.84 ms /   108 tokens
Log end

运行模型 qwen2-7B.q8, 生成长度 200 (f32):

a2@a2s:~$ ./llama-cli-sycl-b3617-f32 -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -ngl 33 -sm none -n 200(此处省略一部分)llama_print_timings:        load time =    8413.38 ms
llama_print_timings:      sample time =      16.47 ms /   200 runs   (    0.08 ms per token, 12141.08 tokens per second)
llama_print_timings: prompt eval time =     405.85 ms /     9 tokens (   45.09 ms per token,    22.18 tokens per second)
llama_print_timings:        eval time =   10455.78 ms /   199 runs   (   52.54 ms per token,    19.03 tokens per second)
llama_print_timings:       total time =   11017.44 ms /   208 tokens

运行模型 qwen2-7B.q8, 生成长度 500 (f32):

a2@a2s:~$ ./llama-cli-sycl-b3617-f32 -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -ngl 33 -sm none -n 500(此处省略一部分)llama_print_timings:        load time =    9179.45 ms
llama_print_timings:      sample time =      47.42 ms /   500 runs   (    0.09 ms per token, 10544.74 tokens per second)
llama_print_timings: prompt eval time =     402.42 ms /     9 tokens (   44.71 ms per token,    22.36 tokens per second)
llama_print_timings:        eval time =   26367.77 ms /   499 runs   (   52.84 ms per token,    18.92 tokens per second)
llama_print_timings:       total time =   27130.93 ms /   508 tokens

运行模型 qwen2-7B.q8, 生成长度 1000 (f32):

a2@a2s:~$ ./llama-cli-sycl-b3617-f32 -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -ngl 33 -sm none -n 1000(此处省略一部分)llama_print_timings:        load time =    9531.60 ms
llama_print_timings:      sample time =      96.63 ms /  1000 runs   (    0.10 ms per token, 10348.86 tokens per second)
llama_print_timings: prompt eval time =     401.50 ms /     9 tokens (   44.61 ms per token,    22.42 tokens per second)
llama_print_timings:        eval time =   53212.71 ms /   999 runs   (   53.27 ms per token,    18.77 tokens per second)
llama_print_timings:       total time =   54321.34 ms /  1008 tokens

运行模型 llama2-7B.q4, 生成长度 100 (f16):

a2@a2s:~$ ./llama-cli-sycl-b3617-f16 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -ngl 33 -sm none -n 100
Log start
main: build = 1 (a07c32e)
main: built with Intel(R) oneAPI DPC++/C++ Compiler 2024.2.1 (2024.2.1.20240711) for x86_64-unknown-linux-gnu
main: seed  = 1724494475(此处省略一部分)llm_load_print_meta: max token length = 48
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  3820.94 MiB
llm_load_tensors:        CPU buffer size =    70.31 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: yes
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|    1.3|    512|    1024|   32| 16225M|            1.3.29735|
llama_kv_cache_init:      SYCL0 KV buffer size =  2048.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =   296.00 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    16.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2system_info: n_threads = 4 / 4 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 100, n_keep = 1hello, this is a very very long story, and I hope you will read it, because I need to tell you something.(此处省略一部分)llama_print_timings:        load time =    1866.40 ms
llama_print_timings:      sample time =       3.23 ms /   100 runs   (    0.03 ms per token, 30998.14 tokens per second)
llama_print_timings: prompt eval time =     187.70 ms /    10 tokens (   18.77 ms per token,    53.28 tokens per second)
llama_print_timings:        eval time =    2873.84 ms /    99 runs   (   29.03 ms per token,    34.45 tokens per second)
llama_print_timings:       total time =    3074.08 ms /   109 tokens
Log end

运行模型 llama2-7B.q4, 生成长度 200 (f16):

a2@a2s:~$ ./llama-cli-sycl-b3617-f16 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -ngl 33 -sm none -n 200(此处省略一部分)llama_print_timings:        load time =    1867.46 ms
llama_print_timings:      sample time =       5.99 ms /   200 runs   (    0.03 ms per token, 33411.29 tokens per second)
llama_print_timings: prompt eval time =     194.39 ms /    10 tokens (   19.44 ms per token,    51.44 tokens per second)
llama_print_timings:        eval time =    5783.95 ms /   199 runs   (   29.07 ms per token,    34.41 tokens per second)
llama_print_timings:       total time =    6003.07 ms /   209 tokens

运行模型 llama2-7B.q4, 生成长度 500 (f16):

a2@a2s:~$ ./llama-cli-sycl-b3617-f16 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -ngl 33 -sm none -n 500(此处省略一部分)llama_print_timings:        load time =    1909.92 ms
llama_print_timings:      sample time =      15.56 ms /   500 runs   (    0.03 ms per token, 32123.35 tokens per second)
llama_print_timings: prompt eval time =     186.10 ms /    10 tokens (   18.61 ms per token,    53.73 tokens per second)
llama_print_timings:        eval time =   14680.81 ms /   499 runs   (   29.42 ms per token,    33.99 tokens per second)
llama_print_timings:       total time =   14925.64 ms /   509 tokens

运行模型 llama2-7B.q4, 生成长度 1000 (f16):

a2@a2s:~$ ./llama-cli-sycl-b3617-f16 -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -ngl 33 -sm none -n 1000(此处省略一部分)llama_print_timings:        load time =    2017.43 ms
llama_print_timings:      sample time =      13.53 ms /   461 runs   (    0.03 ms per token, 34067.40 tokens per second)
llama_print_timings: prompt eval time =     189.74 ms /    10 tokens (   18.97 ms per token,    52.70 tokens per second)
llama_print_timings:        eval time =   13480.19 ms /   460 runs   (   29.30 ms per token,    34.12 tokens per second)
llama_print_timings:       total time =   13722.36 ms /   470 tokens

运行模型 qwen2-7B.q8, 生成长度 100 (f16):

a2@a2s:~$ ./llama-cli-sycl-b3617-f16 -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -ngl 33 -sm none -n 100
Log start
main: build = 1 (a07c32e)
main: built with Intel(R) oneAPI DPC++/C++ Compiler 2024.2.1 (2024.2.1.20240711) for x86_64-unknown-linux-gnu
main: seed  = 1724494717
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from qwen2-7b-instruct-q8_0.gguf (version GGUF V3 (latest))(此处省略一部分)llm_load_print_meta: max token length = 256
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  7165.44 MiB
llm_load_tensors:        CPU buffer size =   552.23 MiB
.......................................................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: yes
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|    1.3|    512|    1024|   32| 16225M|            1.3.29735|
llama_kv_cache_init:      SYCL0 KV buffer size =  1792.00 MiB
llama_new_context_with_model: KV self size  = 1792.00 MiB, K (f16):  896.00 MiB, V (f16):  896.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =  1884.00 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    71.01 MiB
llama_new_context_with_model: graph nodes  = 986
llama_new_context_with_model: graph splits = 2system_info: n_threads = 4 / 4 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 32768, n_batch = 2048, n_predict = 100, n_keep = 0hello, this is a very very long story but I will try to make it as short as possible.(此处省略一部分)llama_print_timings:        load time =    8893.71 ms
llama_print_timings:      sample time =       9.80 ms /   100 runs   (    0.10 ms per token, 10204.08 tokens per second)
llama_print_timings: prompt eval time =     295.81 ms /     9 tokens (   32.87 ms per token,    30.42 tokens per second)
llama_print_timings:        eval time =    5931.59 ms /    99 runs   (   59.91 ms per token,    16.69 tokens per second)
llama_print_timings:       total time =    6305.29 ms /   108 tokens
Log end

运行模型 qwen2-7B.q8, 生成长度 200 (f16):

a2@a2s:~$ ./llama-cli-sycl-b3617-f16 -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -ngl 33 -sm none -n 200(此处省略一部分)llama_print_timings:        load time =    8474.98 ms
llama_print_timings:      sample time =      18.22 ms /   200 runs   (    0.09 ms per token, 10978.15 tokens per second)
llama_print_timings: prompt eval time =     298.85 ms /     9 tokens (   33.21 ms per token,    30.12 tokens per second)
llama_print_timings:        eval time =   11935.47 ms /   199 runs   (   59.98 ms per token,    16.67 tokens per second)
llama_print_timings:       total time =   12379.13 ms /   208 tokens

运行模型 qwen2-7B.q8, 生成长度 500 (f16):

a2@a2s:~$ ./llama-cli-sycl-b3617-f16 -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -ngl 33 -sm none -n 500(此处省略一部分)llama_print_timings:        load time =    8836.66 ms
llama_print_timings:      sample time =      41.76 ms /   500 runs   (    0.08 ms per token, 11972.32 tokens per second)
llama_print_timings: prompt eval time =     304.28 ms /     9 tokens (   33.81 ms per token,    29.58 tokens per second)
llama_print_timings:        eval time =   30052.85 ms /   499 runs   (   60.23 ms per token,    16.60 tokens per second)
llama_print_timings:       total time =   30722.98 ms /   508 tokens

运行模型 qwen2-7B.q8, 生成长度 1000 (f16):

a2@a2s:~$ ./llama-cli-sycl-b3617-f16 -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -ngl 33 -sm none -n 1000(此处省略一部分)llama_print_timings:        load time =    8206.19 ms
llama_print_timings:      sample time =      96.05 ms /  1000 runs   (    0.10 ms per token, 10411.24 tokens per second)
llama_print_timings: prompt eval time =     312.47 ms /     9 tokens (   34.72 ms per token,    28.80 tokens per second)
llama_print_timings:        eval time =   60716.89 ms /   999 runs   (   60.78 ms per token,    16.45 tokens per second)
llama_print_timings:       total time =   61768.29 ms /  1008 tokens

3.8 Windows (CPU) r5-5600g AVX2

在 6 号 PC (物理机) 上运行. 版本:

>.\llama-b3617-bin-win-avx2-x64\llama-cli.exe --version
version: 3617 (a07c32ea)
built with MSVC 19.29.30154.0 for x64

运行模型 llama2-7B.q4, 生成长度 100:

p>.\llama-b3617-bin-win-avx2-x64\llama-cli.exe -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 100
Log start
main: build = 3617 (a07c32ea)
main: built with MSVC 19.29.30154.0 for x64
main: seed  = 1724480697llama_print_timings:        load time =    1005.41 ms
llama_print_timings:      sample time =       4.11 ms /   100 runs   (    0.04 ms per token, 24354.60 tokens per second)
llama_print_timings: prompt eval time =     399.08 ms /    10 tokens (   39.91 ms per token,    25.06 tokens per second)
llama_print_timings:        eval time =    9688.39 ms /    99 runs   (   97.86 ms per token,    10.22 tokens per second)
llama_print_timings:       total time =   10110.42 ms /   109 tokens

运行模型 llama2-7B.q4, 生成长度 200:

>.\llama-b3617-bin-win-avx2-x64\llama-cli.exe -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 200llama_print_timings:        load time =    1045.93 ms
llama_print_timings:      sample time =       8.82 ms /   200 runs   (    0.04 ms per token, 22673.17 tokens per second)
llama_print_timings: prompt eval time =     436.84 ms /    10 tokens (   43.68 ms per token,    22.89 tokens per second)
llama_print_timings:        eval time =   19960.35 ms /   199 runs   (  100.30 ms per token,     9.97 tokens per second)
llama_print_timings:       total time =   20439.79 ms /   209 tokens

运行模型 llama2-7B.q4, 生成长度 500:

>.\llama-b3617-bin-win-avx2-x64\llama-cli.exe -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 500llama_print_timings:        load time =    1028.02 ms
llama_print_timings:      sample time =      18.32 ms /   500 runs   (    0.04 ms per token, 27300.03 tokens per second)
llama_print_timings: prompt eval time =     382.15 ms /    10 tokens (   38.22 ms per token,    26.17 tokens per second)
llama_print_timings:        eval time =   51622.99 ms /   499 runs   (  103.45 ms per token,     9.67 tokens per second)
llama_print_timings:       total time =   52107.10 ms /   509 tokens

运行模型 llama2-7B.q4, 生成长度 1000:

>.\llama-b3617-bin-win-avx2-x64\llama-cli.exe -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 1000llama_print_timings:        load time =    1241.78 ms
llama_print_timings:      sample time =      41.52 ms /  1000 runs   (    0.04 ms per token, 24084.78 tokens per second)
llama_print_timings: prompt eval time =     484.10 ms /    10 tokens (   48.41 ms per token,    20.66 tokens per second)
llama_print_timings:        eval time =  114393.05 ms /   999 runs   (  114.51 ms per token,     8.73 tokens per second)
llama_print_timings:       total time =  115084.29 ms /  1009 tokens

运行模型 qwen2-7B.q8, 生成长度 100:

>.\llama-b3617-bin-win-avx2-x64\llama-cli.exe -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 100llama_print_timings:        load time =    1429.29 ms
llama_print_timings:      sample time =      15.21 ms /   100 runs   (    0.15 ms per token,  6572.89 tokens per second)
llama_print_timings: prompt eval time =     523.07 ms /     9 tokens (   58.12 ms per token,    17.21 tokens per second)
llama_print_timings:        eval time =   17786.69 ms /    99 runs   (  179.66 ms per token,     5.57 tokens per second)
llama_print_timings:       total time =   18409.82 ms /   108 tokens

运行模型 qwen2-7B.q8, 生成长度 200:

>.\llama-b3617-bin-win-avx2-x64\llama-cli.exe -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 200llama_print_timings:        load time =    1424.62 ms
llama_print_timings:      sample time =      31.78 ms /   200 runs   (    0.16 ms per token,  6292.47 tokens per second)
llama_print_timings: prompt eval time =     564.79 ms /     9 tokens (   62.75 ms per token,    15.93 tokens per second)
llama_print_timings:        eval time =   36148.33 ms /   199 runs   (  181.65 ms per token,     5.51 tokens per second)
llama_print_timings:       total time =   36919.37 ms /   208 tokens

运行模型 qwen2-7B.q8, 生成长度 500:

>.\llama-b3617-bin-win-avx2-x64\llama-cli.exe -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 500llama_print_timings:        load time =    1462.26 ms
llama_print_timings:      sample time =      80.31 ms /   500 runs   (    0.16 ms per token,  6225.64 tokens per second)
llama_print_timings: prompt eval time =     720.86 ms /     9 tokens (   80.10 ms per token,    12.49 tokens per second)
llama_print_timings:        eval time =   90566.92 ms /   499 runs   (  181.50 ms per token,     5.51 tokens per second)
llama_print_timings:       total time =   91801.55 ms /   508 tokens

运行模型 qwen2-7B.q8, 生成长度 1000:

>.\llama-b3617-bin-win-avx2-x64\llama-cli.exe -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 1000llama_print_timings:        load time =    1439.21 ms
llama_print_timings:      sample time =     165.06 ms /  1000 runs   (    0.17 ms per token,  6058.48 tokens per second)
llama_print_timings: prompt eval time =     555.15 ms /     9 tokens (   61.68 ms per token,    16.21 tokens per second)
llama_print_timings:        eval time =  184706.64 ms /   999 runs   (  184.89 ms per token,     5.41 tokens per second)
llama_print_timings:       total time =  186313.82 ms /  1008 tokens

3.9 Windows (GPU) A770 vulkan

在 6 号 PC (物理机) 上运行. 版本:

>.\llama-b3617-bin-win-vulkan-x64\llama-cli.exe --version
version: 3617 (a07c32ea)
built with MSVC 19.29.30154.0 for x64

运行模型 llama2-7B.q4, 生成长度 100:

>.\llama-b3617-bin-win-vulkan-x64\llama-cli.exe -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 100 -ngl 33
Log start
main: build = 3617 (a07c32ea)
main: built with MSVC 19.29.30154.0 for x64
main: seed  = 1724482103llama_print_timings:        load time =    3375.14 ms
llama_print_timings:      sample time =       4.04 ms /   100 runs   (    0.04 ms per token, 24764.74 tokens per second)
llama_print_timings: prompt eval time =     471.87 ms /    10 tokens (   47.19 ms per token,    21.19 tokens per second)
llama_print_timings:        eval time =    5913.11 ms /    99 runs   (   59.73 ms per token,    16.74 tokens per second)
llama_print_timings:       total time =    6408.49 ms /   109 tokens

运行模型 llama2-7B.q4, 生成长度 200:

>.\llama-b3617-bin-win-vulkan-x64\llama-cli.exe -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 200 -ngl 33llama_print_timings:        load time =    2932.55 ms
llama_print_timings:      sample time =       8.03 ms /   200 runs   (    0.04 ms per token, 24915.91 tokens per second)
llama_print_timings: prompt eval time =     471.34 ms /    10 tokens (   47.13 ms per token,    21.22 tokens per second)
llama_print_timings:        eval time =   11931.98 ms /   199 runs   (   59.96 ms per token,    16.68 tokens per second)
llama_print_timings:       total time =   12452.04 ms /   209 tokens

运行模型 llama2-7B.q4, 生成长度 500:

>.\llama-b3617-bin-win-vulkan-x64\llama-cli.exe -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 500 -ngl 33llama_print_timings:        load time =    2913.84 ms
llama_print_timings:      sample time =      19.84 ms /   500 runs   (    0.04 ms per token, 25204.15 tokens per second)
llama_print_timings: prompt eval time =     471.64 ms /    10 tokens (   47.16 ms per token,    21.20 tokens per second)
llama_print_timings:        eval time =   30253.41 ms /   499 runs   (   60.63 ms per token,    16.49 tokens per second)
llama_print_timings:       total time =   30844.12 ms /   509 tokens

运行模型 llama2-7B.q4, 生成长度 1000:

>.\llama-b3617-bin-win-vulkan-x64\llama-cli.exe -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 1000 -ngl 33llama_print_timings:        load time =    2909.30 ms
llama_print_timings:      sample time =      40.91 ms /  1000 runs   (    0.04 ms per token, 24443.90 tokens per second)
llama_print_timings: prompt eval time =     471.58 ms /    10 tokens (   47.16 ms per token,    21.21 tokens per second)
llama_print_timings:        eval time =   61725.41 ms /   999 runs   (   61.79 ms per token,    16.18 tokens per second)
llama_print_timings:       total time =   62433.39 ms /  1009 tokens

运行模型 qwen2-7B.q8, 生成长度 100:

>.\llama-b3617-bin-win-vulkan-x64\llama-cli.exe -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 100 -ngl 33llama_print_timings:        load time =    4785.92 ms
llama_print_timings:      sample time =       9.08 ms /   100 runs   (    0.09 ms per token, 11016.86 tokens per second)
llama_print_timings: prompt eval time =     609.77 ms /     9 tokens (   67.75 ms per token,    14.76 tokens per second)
llama_print_timings:        eval time =    6401.98 ms /    99 runs   (   64.67 ms per token,    15.46 tokens per second)
llama_print_timings:       total time =    7100.18 ms /   108 tokens

运行模型 qwen2-7B.q8, 生成长度 200:

>.\llama-b3617-bin-win-vulkan-x64\llama-cli.exe -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 200 -ngl 33llama_print_timings:        load time =    4783.54 ms
llama_print_timings:      sample time =      18.63 ms /   200 runs   (    0.09 ms per token, 10735.37 tokens per second)
llama_print_timings: prompt eval time =     610.60 ms /     9 tokens (   67.84 ms per token,    14.74 tokens per second)
llama_print_timings:        eval time =   12910.01 ms /   199 runs   (   64.87 ms per token,    15.41 tokens per second)
llama_print_timings:       total time =   13698.94 ms /   208 tokens

运行模型 qwen2-7B.q8, 生成长度 500:

>.\llama-b3617-bin-win-vulkan-x64\llama-cli.exe -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 500 -ngl 33llama_print_timings:        load time =    4798.07 ms
llama_print_timings:      sample time =      46.32 ms /   500 runs   (    0.09 ms per token, 10794.47 tokens per second)
llama_print_timings: prompt eval time =     610.28 ms /     9 tokens (   67.81 ms per token,    14.75 tokens per second)
llama_print_timings:        eval time =   32517.07 ms /   499 runs   (   65.16 ms per token,    15.35 tokens per second)
llama_print_timings:       total time =   33565.60 ms /   508 tokens

运行模型 qwen2-7B.q8, 生成长度 1000:

>.\llama-b3617-bin-win-vulkan-x64\llama-cli.exe -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 1000 -ngl 33llama_print_timings:        load time =    4802.01 ms
llama_print_timings:      sample time =      93.21 ms /   989 runs   (    0.09 ms per token, 10610.22 tokens per second)
llama_print_timings: prompt eval time =     610.76 ms /     9 tokens (   67.86 ms per token,    14.74 tokens per second)
llama_print_timings:        eval time =   64868.89 ms /   988 runs   (   65.66 ms per token,    15.23 tokens per second)
llama_print_timings:       total time =   66351.20 ms /   997 tokens

(未完待续)

这篇关于(章节 3.1) 本地运行 AI 有多慢 ? 大模型推理测速 (llama.cpp, Intel GPU A770)的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1109026

相关文章

Linux使用nohup命令在后台运行脚本

《Linux使用nohup命令在后台运行脚本》在Linux或类Unix系统中,后台运行脚本是一项非常实用的技能,尤其适用于需要长时间运行的任务或服务,本文我们来看看如何使用nohup命令在后台... 目录nohup 命令简介基本用法输出重定向& 符号的作用后台进程的特点注意事项实际应用场景长时间运行的任务服

使用JavaScript操作本地存储

《使用JavaScript操作本地存储》这篇文章主要为大家详细介绍了JavaScript中操作本地存储的相关知识,文中的示例代码讲解详细,具有一定的借鉴价值,有需要的小伙伴可以参考一下... 目录本地存储:localStorage 和 sessionStorage基本使用方法1. localStorage

如何在一台服务器上使用docker运行kafka集群

《如何在一台服务器上使用docker运行kafka集群》文章详细介绍了如何在一台服务器上使用Docker运行Kafka集群,包括拉取镜像、创建网络、启动Kafka容器、检查运行状态、编写启动和关闭脚本... 目录1.拉取镜像2.创建集群之间通信的网络3.将zookeeper加入到网络中4.启动kafka集群

Python基于火山引擎豆包大模型搭建QQ机器人详细教程(2024年最新)

《Python基于火山引擎豆包大模型搭建QQ机器人详细教程(2024年最新)》:本文主要介绍Python基于火山引擎豆包大模型搭建QQ机器人详细的相关资料,包括开通模型、配置APIKEY鉴权和SD... 目录豆包大模型概述开通模型付费安装 SDK 环境配置 API KEY 鉴权Ark 模型接口Prompt

Nacos客户端本地缓存和故障转移方式

《Nacos客户端本地缓存和故障转移方式》Nacos客户端在从Server获得服务时,若出现故障,会通过ServiceInfoHolder和FailoverReactor进行故障转移,ServiceI... 目录1. ServiceInfoHolder本地缓存目录2. FailoverReactorinit

PostgreSQL如何用psql运行SQL文件

《PostgreSQL如何用psql运行SQL文件》文章介绍了两种运行预写好的SQL文件的方式:首先连接数据库后执行,或者直接通过psql命令执行,需要注意的是,文件路径在Linux系统中应使用斜杠/... 目录PostgreSQ编程L用psql运行SQL文件方式一方式二总结PostgreSQL用psql运

Ilya-AI分享的他在OpenAI学习到的15个提示工程技巧

Ilya(不是本人,claude AI)在社交媒体上分享了他在OpenAI学习到的15个Prompt撰写技巧。 以下是详细的内容: 提示精确化:在编写提示时,力求表达清晰准确。清楚地阐述任务需求和概念定义至关重要。例:不用"分析文本",而用"判断这段话的情感倾向:积极、消极还是中性"。 快速迭代:善于快速连续调整提示。熟练的提示工程师能够灵活地进行多轮优化。例:从"总结文章"到"用

AI绘图怎么变现?想做点副业的小白必看!

在科技飞速发展的今天,AI绘图作为一种新兴技术,不仅改变了艺术创作的方式,也为创作者提供了多种变现途径。本文将详细探讨几种常见的AI绘图变现方式,帮助创作者更好地利用这一技术实现经济收益。 更多实操教程和AI绘画工具,可以扫描下方,免费获取 定制服务:个性化的创意商机 个性化定制 AI绘图技术能够根据用户需求生成个性化的头像、壁纸、插画等作品。例如,姓氏头像在电商平台上非常受欢迎,

大模型研发全揭秘:客服工单数据标注的完整攻略

在人工智能(AI)领域,数据标注是模型训练过程中至关重要的一步。无论你是新手还是有经验的从业者,掌握数据标注的技术细节和常见问题的解决方案都能为你的AI项目增添不少价值。在电信运营商的客服系统中,工单数据是客户问题和解决方案的重要记录。通过对这些工单数据进行有效标注,不仅能够帮助提升客服自动化系统的智能化水平,还能优化客户服务流程,提高客户满意度。本文将详细介绍如何在电信运营商客服工单的背景下进行

从去中心化到智能化:Web3如何与AI共同塑造数字生态

在数字时代的演进中,Web3和人工智能(AI)正成为塑造未来互联网的两大核心力量。Web3的去中心化理念与AI的智能化技术,正相互交织,共同推动数字生态的变革。本文将探讨Web3与AI的融合如何改变数字世界,并展望这一新兴组合如何重塑我们的在线体验。 Web3的去中心化愿景 Web3代表了互联网的第三代发展,它基于去中心化的区块链技术,旨在创建一个开放、透明且用户主导的数字生态。不同于传统