Qwen 千问
Mac
Mac电脑配置:M2芯片(arm架构)。
创建环境:
1 2 3 4
| conda create -n qwen python=3.11 -y conda activate qwen python -m pip install -U pip setuptools wheel pip install mlx mlx-lm huggingface_hub
|
下载:
1 2
| python -m mlx_lm.chat --model Qwen/Qwen3.5-0.8B python -m mlx_lm.chat --model Qwen/Qwen3.5-4B
|
测试:
1
| mlx_lm.generate --model Qwen/Qwen3.5-0.8B --prompt "你好"
|
1
| mlx_lm.generate --model Qwen/Qwen3.5-4B --prompt "你好"
|
在python里调用:
Qwen3.5-0.8B模型如果开启思考容易陷入死循环,所以建议不开。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
| from mlx_lm import load, generate
MODEL_ID = "Qwen/Qwen3.5-0.8B"
model, tokenizer = load(MODEL_ID)
messages = [ {"role": "user", "content": "你好,请用一句话介绍你自己。"} ]
prompt = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, enable_thinking=False, )
text = generate( model, tokenizer, prompt=prompt, max_tokens=256, verbose=False, )
print(text)
|
Qwen3.5-0.8B模型默认是开启思考的,但是因为思考过程容易很长,所以max_tokens一定要设置地大一点,不然会输出到一半就停止了:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
| from mlx_lm import load, generate
MODEL_ID = "Qwen/Qwen3.5-4B"
model, tokenizer = load(MODEL_ID)
messages = [ {"role": "user", "content": "你好,请用一句话介绍你自己。"} ]
prompt = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, )
text = generate( model, tokenizer, prompt=prompt, max_tokens=16384, verbose=False, )
print(text)
|
如果只需要最终思考后的输出:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
| from mlx_lm import load, generate
MODEL_ID = "Qwen/Qwen3.5-4B"
model, tokenizer = load(MODEL_ID)
messages = [ {"role": "user", "content": "你好,请用一句话介绍你自己。"} ]
prompt = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, )
text = generate( model, tokenizer, prompt=prompt, max_tokens=16384, verbose=False, )
if "</think>" in text: thinking, answer = text.split("</think>", 1) else: thinking, answer = "", text
print(answer)
|