部署DeepSeek蒸馏模型推理服务

为解决DeepSeek推理服务对GPU规格需求越来越高的问题，您可以通过ACK Edge集群管理本地IDC的GPU机器，并借助集群的虚拟节点快速接入云上ACS Serverless GPU算力。该方案可以使推理任务优先在IDC GPU上运行，当本地IDC GPU资源不足时，任务将自动调度至云上的ACS Serverless GPU，满足业务扩展需求的同时降低成本。

背景介绍

DeepSeek-R1模型

DeepSeek-R1模型是DeepSeek推出的第一代推理模型，旨在通过大规模强化学习提升大语言模型的推理能力。实验结果表明，DeepSeek-R1在数学推理、编程竞赛等多个任务上表现出色，不仅超过了其他闭源模型，而且在某些任务上接近或超越了OpenAI-O1系列模型。DeepSeek-R1在知识类任务和其他广泛的任务类型中也表现出色，包括创意写作、一般问答等。DeepSeek还将推理能力蒸馏到小模型上，通过对已有模型（Qwen、Llama等）微调提升模型推理能力。蒸馏后的14B模型显著超越了现有的开源模型QwQ-32B，而蒸馏后的32B和70B模型均刷新纪录。更多关于DeepSeek模型的信息，请参见DeepSeek AI GitHub仓库。

vLLM

vLLM是一个高效易用的大语言模型推理服务框架，vLLM支持包括通义千问在内的多种常见大语言模型。vLLM通过PagedAttention优化、动态批量推理（continuous batching）、模型量化等优化技术，可以取得较好的大语言模型推理效率。更多关于vLLM框架的信息，请参见vLLM GitHub代码库。

Arena

Arena是基于Kubernetes的机器学习轻量级解决方案，支持数据准备、模型开发、模型训练、模型预测的完整生命周期，提升数据科学家的工作效率。同时和阿里云的基础云服务深度集成，支持GPU共享、CPFS等服务，可以运行阿里云优化的深度学习框架，最大化利用阿里云异构设备的性能和成本的效益。更多关于Arena的信息，请参见Arena GitHub代码库。

方案介绍

整体架构

该方案采用ACK Edge集群的云边一体化管理能力，云上托管Kubernetes控制面，将IDC机器作为Kubernetes集群数据面节点。实现IDC机器的Kubernetes容器化管理，并通过集群的虚拟节点快速接入云上ACS Serverless GPU算力，统一纳管云上云下计算资源，实现计算任务的动态分配。

将本地IDC的资源与云上VPC通过专线打通。
将本地IDC资源以边缘节点形式接入ACK Edge集群，实现从云上对IDC业务的统一管理和调度。
为业务配置自定义调度策略ResourcePolicy，使任务优先调度到本地IDC资源，本地资源不足时再调度到云上虚拟节点。
为业务配置HPA（Horizontal Pod Autoscaler），当资源使用达到阈值时，自动触发扩容。

方案优势

极致弹性：可以提供大规模秒级的弹性伸缩能力，快速应对流量高峰场景。
成本精细化：无需自购服务器，按量付费，成本透明可控。
弹性资源多样化：支持CPU、GPU等不同的机型。

前提条件

选择一个地域作为中心地域，在该地域下创建ACK Edge集群。
安装ack-virtual-node组件。
安装ack-kserve️组件。
配置Arena客户端。
部署监控组件并配置GPU监控指标。
创建专用网络的边缘节点池，并将IDC的资源添加到边缘节点池中。

操作步骤

步骤一：准备DeepSeek-R1-Distill-Qwen-7B模型文件

说明

通常下载和上传模型文件需要1-2小时，您可以通过提交工单快速将模型文件复制到您的OSS Bucket。

执行以下命令从ModelScope下载DeepSeek-R1-Distill-Qwen-7B模型。
说明
请确认是否已安装git-lfs插件，如未安装可执行yum install git-lfs或者apt-get install git-lfs安装。更多的安装方式，请参见安装git-lfs。
```
git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B.git
cd DeepSeek-R1-Distill-Qwen-7B/
git lfs pull
```

在OSS中创建目录，将模型上传至OSS。

说明

关于ossutil工具的安装和使用方法，请参见安装ossutil。

ossutil mkdir oss://<your-bucket-name>/models/DeepSeek-R1-Distill-Qwen-7B
ossutil cp -r ./DeepSeek-R1-Distill-Qwen-7B oss://<your-bucket-name>/models/DeepSeek-R1-Distill-Qwen-7B

创建PV和PVC。为目标集群配置名为llm-model的存储卷PV和存储声明PVC。具体操作，请参见静态挂载OSS存储卷。

以下为示例PV的基本配置信息：

配置项	说明
存储卷类型	OSS
名称	llm-model
访问证书	配置用于访问OSS的AccessKey ID和AccessKey Secret。
Bucket ID	选择上一步所创建的OSS Bucket。
OSS Path	选择模型所在的路径，如`/models/DeepSeek-R1-Distill-Qwen-7B`。

以下为示例PVC的基本配置信息：

配置项	说明
存储声明类型	OSS
名称	llm-model
分配模式	选择已有存储卷。
已有存储卷	单击选择已有存储卷链接，选择已创建的存储卷PV。

以下为示例YAML：

apiVersion: v1
kind: Secret
metadata:
  name: oss-secret
stringData:
  akId: <your-oss-ak> # 配置用于访问OSS的AccessKey ID
  akSecret: <your-oss-sk> # 配置用于访问OSS的AccessKey Secret
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: llm-model
  labels:
    alicloud-pvname: llm-model
spec:
  capacity:
    storage: 30Gi 
  accessModes:
    - ReadOnlyMany
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: ossplugin.csi.alibabacloud.com
    volumeHandle: llm-model
    nodePublishSecretRef:
      name: oss-secret
      namespace: default
    volumeAttributes:
      bucket: <your-bucket-name> # bucket名称
      url: <your-bucket-endpoint> # Endpoint信息，如oss-cn-hangzhou-internal.aliyuncs.com
      otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
      path: <your-model-path> # 本示例中为/models/DeepSeek-R1-Distill-Qwen-7B/
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: llm-model
spec:
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 30Gi
  selector:
    matchLabels:
      alicloud-pvname: llm-model

步骤二：创建自定义调度策略ResourcePolicy

通过创建ResourcePolicy CRD来定义弹性资源优先级调度规则。本示例中，labelSelector匹配了isvc.deepseek-predictor的应用来定义规则，此规则明确应用应该优先调度到边缘IDC资源池，如果边缘IDC资源不足，则调度到云上虚拟节点上。更多ResourcePolicy使用说明，请参见自定义弹性资源优先级调度。

重要

后续创建应用Pod时，需要为其添加与以下labelSelector一致的Label，用于关联此处定义的调度策略。

创建ResourcePolicy CRD，并保存为nginx-resourcepolicy.yaml文件。

apiVersion: scheduling.alibabacloud.com/v1alpha1
kind: ResourcePolicy
metadata:
  name: deepseek
  namespace: default
spec:
  selector:
    app: isvc.deepseek-predictor # 此处要与后续创建的Pod的label相关联。
  strategy: prefer
  units:
  - resource: ecs
    nodeSelector:
      alibabacloud.com/nodepool-id: np*********  #边缘节点池ID。
  - resource: eci

在集群中部署自定义调度策略，定义调度优先级。
```
kubectl create -f nginx-resourcepolicy.yaml
```

步骤三：部署模型

查询集群中节点的状态。

kubectl get nodes -owide

预期输出：

NAME                            STATUS   ROLES    AGE     VERSION            INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                              KERNEL-VERSION           CONTAINER-RUNTIME
cn-hangzhou.10.4.XX.25           Ready    <none>   10d     v1.30.7-aliyun.1   10.4.0.25     <none>        Alibaba Cloud Linux 3.2104 U11 (OpenAnolis Edition)   5.10.134-18.al8.x86_64   containerd://1.6.36
cn-hangzhou.10.4.XX.26           Ready    <none>   10d     v1.30.7-aliyun.1   10.4.0.26     <none>        Alibaba Cloud Linux 3.2104 U11 (OpenAnolis Edition)   5.10.134-18.al8.x86_64   containerd://1.6.36
idc001                           Ready    <none>   31s     v1.30.7-aliyun.1   10.4.0.185    <none>        Alibaba Cloud Linux 3.2104 U11 (OpenAnolis Edition)   5.10.134-18.al8.x86_64   containerd://1.6.36
virtual-kubelet-cn-hangzhou-b    Ready    agent    7d21h   v1.30.7-aliyun.1   10.4.0.180    <none>        <unknown>                                             <unknown>                <unknown>

预期输出表明，集群中有一个IDC节点（idc001）和一个虚拟节点（virtual-kubelet-cn-hangzhou-b）。该IDC节点有一张V100的GPU卡。

基于vLLM模型推理框架部署DeepSeek模型推理服务。

arena serve kserve \
    --name=deepseek \
    --annotation=k8s.aliyun.com/eci-use-specs=ecs.gn6e-c12g1.3xlarge \
    --annotation=k8s.aliyun.com/eci-vswitch=vsw-*********,vsw-********* \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.6.6 \
    --gpus=1 \
    --cpu=4 \
    --memory=12Gi \
    --scale-metric=DCGM_CUSTOM_PROCESS_SM_UTIL \
    --scale-target=50 \
    --min-replicas=1  \
    --max-replicas=3  \
    --data=llm-model:/model/DeepSeek-R1-Distill-Qwen-7B \
    "vllm serve /model/DeepSeek-R1-Distill-Qwen-7B --port 8080 --trust-remote-code --served-model-name deepseek-r1 --max-model-len 32768 --gpu-memory-utilization 0.95 --enforce-eager --dtype=half"

主要参数说明如下所示：

参数	说明	示例值
`--name`	提交的推理服务名称，全局唯一。	deepseek
`--image`	推理服务的镜像地址。本示例使用vllm推理框架。	kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.6.6
`--gpus`	推理服务需要使用的GPU卡数。默认值为0。	1
`--cpu`	推理服务需要使用的CPU数量。	4
`--memory`	推理服务需要使用的内存数量。	12Gi
`--scale-metric`	应用弹性伸缩标准。本示例使用GPU卡利用率`DCGM_CUSTOM_PROCESS_SM_UTIL`这个指标进行应用伸缩。更多指标，请参见二、配置HPA。	DCGM_CUSTOM_PROCESS_SM_UTIL
`--scale-target`	应用伸缩目标。当GPU利用率超过50%时，开始扩容副本。	50
`--min-replicas`	最小副本数。	1
`--max-replicas`	最大副本数。	3
`--data`	服务的模型地址，本示例指定的模型存储在llm-model中，挂载到容器的/mnt/models/目录下。	llm-model:/model/DeepSeek-R1-Distill-Qwen-7B \ "vllm serve /model/DeepSeek-R1-Distill-Qwen-7B --port 8080 --trust-remote-code --served-model-name deepseek-r1 --max-model-len 32768 --gpu-memory-utilization 0.95 --enforce-eager --dtype=half"

预期输出：

WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /Users/bingchang/.kube/config
WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /Users/bingchang/.kube/config
horizontalpodautoscaler.autoscaling/deepseek-hpa created
inferenceservice.serving.kserve.io/deepseek created
INFO[0002] The Job deepseek has been submitted successfully
INFO[0002] You can run `arena serve get deepseek --type kserve -n default` to check the job status

查看推理服务详细信息。

arena serve get deepseek

预期输出：

Name:       deepseek
Namespace:  default
Type:       KServe
Version:    1
Desired:    1
Available:  1
Age:        1m
Address:    http://deepseek-default.example.com
Port:       :80
GPU:        1


Instances:
  NAME                                 STATUS   AGE  READY  RESTARTS  GPU  NODE
  ----                                 ------   ---  -----  --------  ---  ----
  deepseek-predictor-6b9455f8c5-wl5lc  Running  1m   1/1    0         1    idc001

从结果可以看到，推理服务的业务Pod被调度到了IDC节点，符合自定义调度的优先级。

通过以下请求服务来验证推理服务已部署成功，请求地址可以从KServe自动创建的Ingress资源详情中获取。

curl -H "Host: deepseek-default.example.com" -H "Content-Type: application/json" http://<idc-node-ip>:<ingress-svc-nodeport>/v1/chat/completions -d '{"model": "deepseek-r1", "messages": [{"role": "user", "content": "Say this is a test!"}], "max_tokens": 512, "temperature": 0.7, "top_p": 0.9, "seed": 10}'

预期输出：

{"id":"chatcmpl-efc1225ad2f33cc39a8ddbc4039a41b9","object":"chat.completion","created":1739861087,"model":"deepseek-r1","choices":[{"index":0,"message":{"role":"assistant","content":"Okay, so I need to figure out how to say \"This is a test!\" in Spanish. Hmm, I'm not super fluent in Spanish, but I know some basic phrases. Let me think about how to approach this.\n\nFirst, I remember that \"test\" is \"prueba\" in Spanish. So maybe I can start with \"Esto es una prueba.\" But I'm not sure if that's the best way to say it. Maybe there's a more common expression or a different structure.\n\nWait, I think there's a phrase that's commonly used in tests. Isn't it something like \"This is a test.\" or \"This is a quiz.\"? I think the Spanish equivalent would be \"Este es un test.\" That sounds more natural. Let me check if that makes sense.\n\nI can also think about how people use phrases in tests. Maybe they use \"This is the test\" or \"This is an exam.\" So perhaps \"Este es el test.\" or \"Este es el examen.\" I'm not sure which one is more appropriate.\n\nI should also consider the grammar. \"This is a test\" is a simple statement, so the subject is \"this\" (using \"este\"), the verb is \"is\" (using \"es\"), and the object is \"a test\" (using \"un test\"). So putting it together, it would be \"Este es un test.\"\n\nWait, but sometimes people use \"This is the test\" when referring to an important one, so maybe \"Este es el test.\" But I'm not entirely sure if that's the correct structure. Let me think about other similar phrases.\n\nI also recall that in some contexts, people might say \"This is a practice test\" or \"This is a sample test.\" But since the user just said \"This is a test,\" the most straightforward translation would be \"Este es un test.\"\n\nI should also consider if there are any idiomatic expressions or common phrases that are used in this context. For example, \"This is the test\" is often used to mean a significant exam or evaluation, so \"Este es el test\" might be more appropriate in that context.\n\nBut I'm a bit confused because I'm not 100% sure about the correct structure. Maybe I should look up some examples. Oh, wait, I can't look things up right now, so I'll have to rely on my memory.\n\nI think the basic structure is subject + verb + object. So \"this\" (this is \"este","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":11,"total_tokens":523,"completion_tokens":512,"prompt_tokens_details":null},"prompt_logprobs":null}

步骤四：模拟业务高峰请求以触发云上弹性

通过压测工具Hey发送大量的请求到已部署的推理服务中。

hey -z 5m -c 5 \
-m POST -host deepseek-default.example.com \
-H "Content-Type: application/json" \
-d '{"model": "deepseek-r1", "messages": [{"role": "user", "content": "Say this is a test!"}], "max_tokens": 512, "temperature": 0.7, "top_p": 0.9, "seed": 10}' \
http://<idc-node-ip>:<ingress-svc-nodeport>/v1/chat/completions

以上请求会发送至现有Pod，但由于请求太多，当GPU使用率上升超过阈值50%时，会触发Pod扩容。

查看推理服务详情。

arena serve get deepseek

预期输出：

Name:       deepseek
Namespace:  default
Type:       KServe
Version:    1
Desired:    3
Available:  2
Age:        18m
Address:    http://deepseek-default.example.com
Port:       :80
GPU:        3


Instances:
  NAME                                 STATUS   AGE  READY  RESTARTS  GPU  NODE
  ----                                 ------   ---  -----  --------  ---  ----
  deepseek-predictor-6b9455f8c5-dtzdv  Running  1m   0/1    0         1    virtual-kubelet-cn-hangzhou-h
  deepseek-predictor-6b9455f8c5-wl5lc  Running  18m  1/1    0         1    idc001
  deepseek-predictor-6b9455f8c5-zmpg8  Running  5m   1/1    0         1    virtual-kubelet-cn-hangzhou-h

此时，已在虚拟节点上扩容出了推理任务的两个Pod副本。

背景介绍

DeepSeek-R1模型

vLLM

Arena

方案介绍

整体架构

方案优势

前提条件

操作步骤

步骤一：准备DeepSeek-R1-Distill-Qwen-7B模型文件

步骤二：创建自定义调度策略ResourcePolicy

步骤三：部署模型

步骤四：模拟业务高峰请求以触发云上弹性

部署DeepSeek蒸馏模型推理服务 2025-04-21 11:14

ACK Edge集群GPU资源监控最佳实践 2025-04-21 11:14

部署混合云场景下的LLM弹性推理 2025-04-21 11:14

配置Terway Edge实现容器通信 2025-04-21 11:14

统一管理多地域的ECS资源 2025-04-21 11:14

目录