Replies: 1 comment 1 reply
-
That is because they are still trying to figure out how to allocate more than half of the physical memory for Metal. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I tried to load and inference on llama-7b merged with a Chinese LoRA adapter in CLI mode, but got different completion with or without 'ngl 1' option.
Output with no 'ngl' option:
> Capital of United States
Washington, D.C.
>
The corrupted output with 'ngl 1':
> Capital of United States
W胥其中包括zat droit胥胥其中包括胥胥胥其中包括胥胥胥胥胥zat胥胥zatstronom胥zat胥胥zat胥 varying胥其中包括胥胥胥胥胥胥zat胥胥胥其中包括其中包括胥胥胥胥胥胥胥胥胥胥胥 droit胥其中包括胥胥zatzat胥 varying其中包括胥胥胥胥avanozat胥胥胥胥胥胥zat胥胥胥其中包括胥胥胥其中包括胥胥胥zat胥胥胥胥胥zat胥胥胥胥胥其中包括胥胥胥胥胥胥胥胥胥其中包括zat胥胥胥zat胥胥胥胥其中包括胥zat胥胥胥胥 droit胥其中包括 varying胥其中包括zat胥胥胥胥胥胥胥avanozat胥胥其中包括胥胥胥胥胥其中包括胥胥胥胥胥胥胥胥其中包括胥zat胥胥其中包括胥胥胥zat其中包括其中包括胥胥胥胥胥胥胥胥其中包括胥其中包括胥胥胥胥zat胥其中包括胥其中包括胥zatzat胥胥其中包括胥胥 droit其中包括胥其中包括其中包括胥zat胥其中包括胥zat胥其中包括诙胥胥胥 varying胥zat其中包括胥胥胥胥胥其中包括胥胥胥其中包括胥胥胥其中包括胥胥zat胥胥胥胥 droitzat droit胥胥其中包括zat droit
>
Thanks in advance for any help.
The following are the model loading trace, it looks good? :
main: build = 669 (9254920)
main: seed = 1686735230
llama.cpp: loading model from zh-models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 49954
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.07 MB
llama_model_load_internal: mem required = 5536.92 MB (+ 1026.00 MB per state)
...............................................................................................
llama_init_from_file: kv self size = 1024.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/wujianmin/bak-from-mac/Code/git/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x132904f10
ggml_metal_init: loaded kernel_mul 0x145e0aba0
ggml_metal_init: loaded kernel_mul_row 0x145e0b1e0
ggml_metal_init: loaded kernel_scale 0x145e0b700
ggml_metal_init: loaded kernel_silu 0x145e0bc20
ggml_metal_init: loaded kernel_relu 0x145e0c140
ggml_metal_init: loaded kernel_gelu 0x145e0c660
ggml_metal_init: loaded kernel_soft_max 0x145e0cd10
ggml_metal_init: loaded kernel_diag_mask_inf 0x145e0d370
ggml_metal_init: loaded kernel_get_rows_f16 0x132905790
ggml_metal_init: loaded kernel_get_rows_q4_0 0x132905f30
ggml_metal_init: loaded kernel_get_rows_q4_1 0x132906720
ggml_metal_init: loaded kernel_get_rows_q2_k 0x132906da0
ggml_metal_init: loaded kernel_get_rows_q3_k 0x145f04510
ggml_metal_init: loaded kernel_get_rows_q4_k 0x145f05360
ggml_metal_init: loaded kernel_get_rows_q5_k 0x145f059e0
ggml_metal_init: loaded kernel_get_rows_q6_k 0x145e0d8d0
ggml_metal_init: loaded kernel_rms_norm 0x145e0e0a0
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x145e0e900
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x145e0f270
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x145e0f950
ggml_metal_init: loaded kernel_mul_mat_q2_k_f32 0x145e10030
ggml_metal_init: loaded kernel_mul_mat_q3_k_f32 0x145e10730
ggml_metal_init: loaded kernel_mul_mat_q4_k_f32 0x145e10f90
ggml_metal_init: loaded kernel_mul_mat_q5_k_f32 0x145e11670
ggml_metal_init: loaded kernel_mul_mat_q6_k_f32 0x145e11d50
ggml_metal_init: loaded kernel_rope 0x145e12640
ggml_metal_init: loaded kernel_cpy_f32_f16 0x145e130d0
ggml_metal_init: loaded kernel_cpy_f32_f32 0x145e13960
ggml_metal_add_buffer: allocated 'data ' buffer, size = 3745.52 MB
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 768.00 MB
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 1026.00 MB
ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 512.00 MB
ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 512.00 MB
system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
main: interactive mode on.
... ...
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.200000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = 256, n_keep = 21`
-Jianmin
Beta Was this translation helpful? Give feedback.
All reactions