赛尔校园公共服务平台 Logo
平台使用
阿里云
百度云
移动云
智算服务
教育生态
登录 →
赛尔校园公共服务平台 Logo
平台使用 阿里云 百度云 移动云 智算服务 教育生态
登录
  1. 首页
  2. 阿里云
  3. 分布式云容器平台ACK One
  4. 操作指南
  5. 多集群舰队
  6. 监控管理
  7. 多集群统一报警管理

多集群统一报警管理

  • 监控管理
  • 发布于 2025-04-18
  • 0 次阅读
文档编辑
文档编辑

您可以通过多集群统一报警管理能力,在Fleet实例中配置或修改报警规则。Fleet实例会将报警规则统一下发到指定关联集群,并保证各集群中的规则一致性。当有新集群关联时,Fleet实例会自动同步报警规则。本文介绍如何实现舰队中多集群的统一报警管理。

前提条件

  • 已开启舰队管理功能。

  • 舰队的Fleet实例已添加两个关联集群(服务提供者集群、服务消费者集群)。

  • 已安装最新版本阿里云CLI并配置阿里云CLI。

背景信息

在多集群管理场景中,若所有集群采用统一的告警规则配置,则传统的逐个集群登录控制台修改方式不仅流程繁琐,还存在配置不一致的潜在风险。通过Fleet实例提供的多集群统一告警管理功能,管理员可在中心化界面集中定义告警规则,包括需触发告警的异常类型和关联的通知对象。更多信息,请参见容器服务报警管理。多集群管理统一报警架构如下图所示:

image

步骤一:创建报警联系人与联系人分组

您可以通过以下步骤创建报警联系人和联系人分组,报警联系人和联系人分组创建一次即可,在所有容器服务集群内共享。

  1. 登录容器服务管理控制台,在左侧导航栏选择集群列表。

  2. 在集群列表页面,单击目标集群名称,然后在左侧导航栏,选择运维管理 > 报警配置。

  3. 在报警配置页面,单击开始安装,控制台会自动检查条件,进行安装、升级组件。

  4. 在报警配置页面,按照以下步骤完成联系人创建和联系人分组创建。

    1. 单击联系人管理页签,然后单击创建。

    2. 在创建联系人页面,输入姓名、电话和邮箱。然后单击确定。

      联系人创建完成后,您将会收到验证激活短信或验证激活邮件,请按相应提示进行激活操作。

    3. 单击联系人分组管理页签,然后单击创建。

    4. 在创建分组页面,输入分组名称,然后选择分组联系人,最后单击确定。

      选择分组联系人时,将可选联系人添加到已选联系人列表,也可移除已选联系人。

步骤二:获取报警联系人分组ID

  1. 使用如下Aliyun CLI查询联系人分组,获取其在其他云服务中的内部ID,用于后续配置报警规则。

    aliyun cs GET /alert/contact_groups

    预期输出:

    {
        "contact_groups": [
            {
                "ali_uid": 14783****,
                "binding_info": "{\"sls_id\":\"ack_14783****_***\",\"cms_contact_group_name\":\"ack_Default Contact Group\",\"arms_id\":\"1****\"}",
                "contacts": null,
                "created": "2021-07-21T12:18:34+08:00",
                "group_contact_ids": [
                    2***
                ],
                "group_name": "Default Contact Group",
                "id": 3***,
                "updated": "2022-09-19T19:23:57+08:00"
            }
        ],
        "page_info": {
            "page_number": 1,
            "page_size": 100,
            "total_count": 1
        }
    }
  2. 在查询结果信息中提取信息,构建contactGroups。

    contactGroups:
    - arms_contact_group_id: "1****"                       #从上步查询结果的contact_groups.binding_info.arms_id获取。
      cms_contact_group_name: ack_Default Contact Group    #从上步查询结果的contact_groups.binding_info.cms_contact_group_name获取。
      id: "3***"                                           #从上步查询结果的contact_groups.id获取。

步骤三:创建报警规则

您可以使用如下模板创建报警规则,模板中预置了所有容器服务ACK支持的报警规则,下面以开启error-events报警规则为例说明报警规则开启步骤。

说明
  • 报警规则的名称必须为default,命名空间必须为kube-system。详细的规则描述,请参见容器服务报警管理。

  • 您在Fleet实例中创建报警规则后,报警实际并未生效,还需要创建分发规则将报警规则分发到关联集群中,使得报警规则在各关联集群中生效。

  1. 修改error-events报警规则对应的rules.enable为enable。

  2. 添加从上一步生成的contactGroups字段。将修改后的报警规则模板另存为ackalertrule.yaml。

  3. 执行命令kubectl apply -f ackalertrule.yaml,在Fleet实例中创建报警规则。

    报警规则模板如下:

    apiVersion: alert.alibabacloud.com/v1beta1
    kind: AckAlertRule
    metadata:
      name: default
      namespace: kube-system
    spec:
      groups:
      - name: error-events
        rules:
        - enable: enable
          contactGroups:
          - arms_contact_group_id: "1****"
            cms_contact_group_name: ack_Default Contact Group
            id: "3***"
          expression: sls.app.ack.error
          name: error-event
          notification:
            message: kubernetes cluster error event.
          type: event
      - name: warn-events
        rules:
        - enable: disable
          expression: sls.app.ack.warn
          name: warn-event
          notification:
            message: kubernetes cluster warn event.
          type: event
      - name: cluster-core-error
        rules:
        - enable: disable
          expression: prom.apiserver.notHealthy.down
          name: apiserver-unhealthy
          notification:
            message: "Cluster APIServer not healthy. \nPromQL: ((sum(up{job=\"apiserver\"})
              <= 0) or (absent(sum(up{job=\"apiserver\"})))) > 0"
          type: metric-prometheus
        - enable: disable
          expression: prom.etcd.notHealthy.down
          name: etcd-unhealthy
          notification:
            message: "Cluster ETCD not healthy. \nPromQL: ((sum(up{job=\"etcd\"}) <= 0)
              or (absent(sum(up{job=\"etcd\"})))) > 0"
          type: metric-prometheus
        - enable: disable
          expression: prom.scheduler.notHealthy.down
          name: scheduler-unhealthy
          notification:
            message: "Cluster Scheduler not healthy. \nPromQL: ((sum(up{job=\"ack-scheduler\"})
              <= 0) or (absent(sum(up{job=\"ack-scheduler\"})))) > 0"
          type: metric-prometheus
        - enable: disable
          expression: prom.kcm.notHealthy.down
          name: kcm-unhealthy
          notification:
            message: "Custer kube-controller-manager not healthy. \nPromQL: ((sum(up{job=\"ack-kube-controller-manager\"})
              <= 0) or (absent(sum(up{job=\"ack-kube-controller-manager\"})))) > 0"
          type: metric-prometheus
        - enable: disable
          expression: prom.ccm.notHealthy.down
          name: ccm-unhealthy
          notification:
            message: "Cluster cloud-controller-manager not healthy. \nPromQL: ((sum(up{job=\"ack-cloud-controller-manager\"})
              <= 0) or (absent(sum(up{job=\"ack-cloud-controller-manager\"})))) > 0"
          type: metric-prometheus
        - enable: disable
          expression: prom.coredns.notHealthy.requestdown
          name: coredns-unhealthy-requestdown
          notification:
            message: "Cluster CoreDNS not healthy, continuously request down. \nPromQL:
              (sum(rate(coredns_dns_request_count_total{}[1m]))by(server,zone)<=0) or
              (sum(rate(coredns_dns_requests_total{}[1m]))by(server,zone)<=0)"
          type: metric-prometheus
        - enable: disable
          expression: prom.coredns.notHealthy.panic
          name: coredns-unhealthy-panic
          notification:
            message: "Cluster CoreDNS not healthy, continuously panic. \nPromQL: sum(rate(coredns_panic_count_total{}[3m]))
              > 0"
          type: metric-prometheus
        - enable: disable
          expression: prom.ingress.request.errorRateHigh
          name: ingress-err-request
          notification:
            message: Cluster Ingress Controller request error rate high (default error
              rate is 85%).
          type: metric-prometheus
        - enable: disable
          expression: prom.ingress.ssl.expire
          name: ingress-ssl-expire
          notification:
            message: "Cluster Ingress Controller SSL will expire in a few days (default
              14 days). \nPromQL: ((nginx_ingress_controller_ssl_expire_time_seconds -
              time()) / 24 / 3600) < 14"
          type: metric-prometheus
      - name: cluster-error
        rules:
        - enable: disable
          expression: sls.app.ack.docker.hang
          name: docker-hang
          notification:
            message: kubernetes node docker hang.
          type: event
        - enable: disable
          expression: sls.app.ack.eviction
          name: eviction-event
          notification:
            message: kubernetes eviction event.
          type: event
        - enable: disable
          expression: sls.app.ack.gpu.xid_error
          name: gpu-xid-error
          notification:
            message: kubernetes gpu xid error event.
          type: event
        - enable: disable
          expression: sls.app.ack.image.pull_back_off
          name: image-pull-back-off
          notification:
            message: kubernetes image pull back off event.
          type: event
        - enable: disable
          expression: sls.app.ack.node.down
          name: node-down
          notification:
            message: kubernetes node down event.
          type: event
        - enable: disable
          expression: sls.app.ack.node.restart
          name: node-restart
          notification:
            message: kubernetes node restart event.
          type: event
        - enable: disable
          expression: sls.app.ack.ntp.down
          name: node-ntp-down
          notification:
            message: kubernetes node ntp down.
          type: event
        - enable: disable
          expression: sls.app.ack.node.pleg_error
          name: node-pleg-error
          notification:
            message: kubernetes node pleg error event.
          type: event
        - enable: disable
          expression: sls.app.ack.ps.hang
          name: ps-hang
          notification:
            message: kubernetes ps hang event.
          type: event
        - enable: disable
          expression: sls.app.ack.node.fd_pressure
          name: node-fd-pressure
          notification:
            message: kubernetes node fd pressure event.
          type: event
        - enable: disable
          expression: sls.app.ack.node.pid_pressure
          name: node-pid-pressure
          notification:
            message: kubernetes node pid pressure event.
          type: event
        - enable: disable
          expression: sls.app.ack.ccm.del_node_failed
          name: node-del-err
          notification:
            message: kubernetes delete node failed.
          type: event
        - enable: disable
          expression: sls.app.ack.ccm.add_node_failed
          name: node-add-err
          notification:
            message: kubernetes add node failed.
          type: event
        - enable: disable
          expression: sls.app.ack.nlc.run_command_fail
          name: nlc-run-cmd-err
          notification:
            message: kubernetes node pool nlc run command failed.
          type: event
        - enable: disable
          expression: sls.app.ack.nlc.empty_task_cmd
          name: nlc-empty-cmd
          notification:
            message: kubernetes node pool nlc delete node failed.
          type: event
        - enable: disable
          expression: sls.app.ack.nlc.url_mode_unimpl
          name: nlc-url-m-unimp
          notification:
            message: kubernetes nodde pool nlc delete node failed.
          type: event
        - enable: disable
          expression: sls.app.ack.nlc.op_not_found
          name: nlc-opt-no-found
          notification:
            message: kubernetes node pool nlc delete node failed.
          type: event
        - enable: disable
          expression: sls.app.ack.nlc.destroy_node_fail
          name: nlc-des-node-err
          notification:
            message: kubernetes node pool nlc destory node failed.
          type: event
        - enable: disable
          expression: sls.app.ack.nlc.drain_node_fail
          name: nlc-drain-node-err
          notification:
            message: kubernetes node pool nlc drain node failed.
          type: event
        - enable: disable
          expression: sls.app.ack.nlc.restart_ecs_wait_fail
          name: nlc-restart-ecs-wait
          notification:
            message: kubernetes node pool nlc restart ecs wait timeout.
          type: event
        - enable: disable
          expression: sls.app.ack.nlc.restart_ecs_fail
          name: nlc-restart-ecs-err
          notification:
            message: kubernetes node pool nlc restart ecs failed.
          type: event
        - enable: disable
          expression: sls.app.ack.nlc.reset_ecs_fail
          name: nlc-reset-ecs-err
          notification:
            message: kubernetes node pool nlc reset ecs failed.
          type: event
        - enable: disable
          expression: sls.app.ack.nlc.repair_fail
          name: nlc-sel-repair-err
          notification:
            message: kubernetes node pool nlc self repair failed.
          type: event
      - name: res-exceptions
        rules:
        - enable: disable
          expression: cms.host.cpu.utilization
          name: node_cpu_util_high
          notification:
            message: kubernetes cluster node cpu utilization too high.
          thresholds:
          - key: CMS_ESCALATIONS_CRITICAL_Threshold
            unit: percent
            value: "85"
          type: metric-cms
        - enable: disable
          expression: cms.host.memory.utilization
          name: node_mem_util_high
          notification:
            message: kubernetes cluster node memory utilization too high.
          thresholds:
          - key: CMS_ESCALATIONS_CRITICAL_Threshold
            unit: percent
            value: "85"
          type: metric-cms
        - enable: disable
          expression: cms.host.disk.utilization
          name: node_disk_util_high
          notification:
            message: kubernetes cluster node disk utilization too high.
          thresholds:
          - key: CMS_ESCALATIONS_CRITICAL_Threshold
            unit: percent
            value: "85"
          type: metric-cms
        - enable: disable
          expression: cms.host.public.network.utilization
          name: node_public_net_util_high
          notification:
            message: kubernetes cluster node public network utilization too high.
          thresholds:
          - key: CMS_ESCALATIONS_CRITICAL_Threshold
            unit: percent
            value: "85"
          type: metric-cms
        - enable: disable
          expression: cms.host.fs.inode.utilization
          name: node_fs_inode_util_high
          notification:
            message: kubernetes cluster node file system inode utilization too high.
          thresholds:
          - key: CMS_ESCALATIONS_CRITICAL_Threshold
            unit: percent
            value: "85"
          type: metric-cms
        - enable: disable
          expression: cms.slb.qps.utilization
          name: slb_qps_util_high
          notification:
            message: kubernetes cluster slb qps utilization too high.
          thresholds:
          - key: CMS_ESCALATIONS_CRITICAL_Threshold
            unit: percent
            value: "85"
          type: metric-cms
        - enable: disable
          expression: cms.slb.traffic.tx.utilization
          name: slb_traff_tx_util_high
          notification:
            message: kubernetes cluster slb traffic utilization too high.
          thresholds:
          - key: CMS_ESCALATIONS_CRITICAL_Threshold
            unit: percent
            value: "85"
          type: metric-cms
        - enable: disable
          expression: cms.slb.max.connection.utilization
          name: slb_max_con_util_high
          notification:
            message: kubernetes cluster max connection utilization too high.
          thresholds:
          - key: CMS_ESCALATIONS_CRITICAL_Threshold
            unit: percent
            value: "85"
          type: metric-cms
        - enable: disable
          expression: cms.slb.drop.connection
          name: slb_drop_con_high
          notification:
            message: kubernetes cluster drop connection count per second too high.
          thresholds:
          - key: CMS_ESCALATIONS_CRITICAL_Threshold
            unit: count
            value: "1"
          type: metric-cms
        - enable: disable
          expression: sls.app.ack.node.disk_pressure
          name: node-disk-pressure
          notification:
            message: kubernetes node disk pressure event.
          type: event
        - enable: disable
          expression: sls.app.ack.resource.insufficient
          name: node-res-insufficient
          notification:
            message: kubernetes node resource insufficient.
          type: event
        - enable: disable
          expression: sls.app.ack.ip.not_enough
          name: node-ip-pressure
          notification:
            message: kubernetes ip not enough event.
          type: event
        - enable: disable
          expression: sls.app.ack.csi.no_enough_disk_space
          name: disk_space_press
          notification:
            message: kubernetes csi not enough disk space.
          type: event
      - name: cluster-scale
        rules:
        - enable: disable
          expression: sls.app.ack.autoscaler.scaleup_group
          name: autoscaler-scaleup
          notification:
            message: kubernetes autoscaler scale up.
          type: event
        - enable: disable
          expression: sls.app.ack.autoscaler.scaledown
          name: autoscaler-scaledown
          notification:
            message: kubernetes autoscaler scale down.
          type: event
        - enable: disable
          expression: sls.app.ack.autoscaler.scaleup_timeout
          name: autoscaler-scaleup-timeout
          notification:
            message: kubernetes autoscaler scale up timeout.
          type: event
        - enable: disable
          expression: sls.app.ack.autoscaler.scaledown_empty
          name: autoscaler-scaledown-empty
          notification:
            message: kubernetes autoscaler scale down empty node.
          type: event
        - enable: disable
          expression: sls.app.ack.autoscaler.scaleup_group_failed
          name: autoscaler-up-group-failed
          notification:
            message: kubernetes autoscaler scale up failed.
          type: event
        - enable: disable
          expression: sls.app.ack.autoscaler.cluster_unhealthy
          name: autoscaler-cluster-unhealthy
          notification:
            message: kubernetes autoscaler error, cluster not healthy.
          type: event
        - enable: disable
          expression: sls.app.ack.autoscaler.delete_started_timeout
          name: autoscaler-del-started
          notification:
            message: kubernetes autoscaler delete node started long ago.
          type: event
        - enable: disable
          expression: sls.app.ack.autoscaler.delete_unregistered
          name: autoscaler-del-unregistered
          notification:
            message: kubernetes autoscaler delete unregistered node.
          type: event
        - enable: disable
          expression: sls.app.ack.autoscaler.scaledown_failed
          name: autoscaler-scale-down-failed
          notification:
            message: kubernetes autoscaler scale down failed.
          type: event
        - enable: disable
          expression: sls.app.ack.autoscaler.instance_expired
          name: autoscaler-instance-expired
          notification:
            message: kubernetes autoscaler scale down instance expired.
          type: event
      - name: workload-exceptions
        rules:
        - enable: disable
          expression: prom.job.failed
          name: job-failed
          notification:
            message: "Cluster Job failed. \nPromQL: kube_job_status_failed{job=\"_kube-state-metrics\"}
              > 0"
          type: metric-prometheus
        - enable: disable
          expression: prom.deployment.replicaError
          name: deployment-rep-err
          notification:
            message: "Cluster Deployment replication status error. \nPromQL: kube_deployment_spec_replicas{job=\"_kube-state-metrics\"}
              != kube_deployment_status_replicas_available{job=\"_kube-state-metrics\"}"
          type: metric-prometheus
        - enable: disable
          expression: prom.daemonset.scheduledError
          name: daemonset-status-err
          notification:
            message: "Cluster Daemonset pod status or scheduled error. \nPromQL: ((100
              - kube_daemonset_status_number_ready{} / kube_daemonset_status_desired_number_scheduled{}
              * 100) or (kube_daemonset_status_desired_number_scheduled{} - kube_daemonset_status_current_number_scheduled{}))
              > 0"
          type: metric-prometheus
        - enable: disable
          expression: prom.daemonset.misscheduled
          name: daemonset-misscheduled
          notification:
            message: "Cluster Daemonset misscheduled. \nPromQL: kube_daemonset_status_number_misscheduled{job=\"_kube-state-metrics\"}
              \ > 0"
          type: metric-prometheus
      - name: pod-exceptions
        rules:
        - enable: disable
          expression: sls.app.ack.pod.oom
          name: pod-oom
          notification:
            message: kubernetes pod oom event.
          type: event
        - enable: disable
          expression: sls.app.ack.pod.failed
          name: pod-failed
          notification:
            message: kubernetes pod start failed event.
          type: event
        - enable: disable
          expression: prom.pod.status.notHealthy
          name: pod-status-err
          notification:
            message: 'Pod status exception. \nPromQL: min_over_time(sum by (namespace,
              pod, phase) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed", job="_kube-state-metrics"})[${mins}m:1m])
              > 0'
          type: metric-prometheus
        - enable: disable
          expression: prom.pod.status.crashLooping
          name: pod-crashloop
          notification:
            message: 'Pod status exception. \nPromQL: sum_over_time(increase(kube_pod_container_status_restarts_total{job="_kube-state-metrics"}[1m])[${mins}m:1m])
              > 3'
          type: metric-prometheus
      - name: cluster-storage-err
        rules:
        - enable: disable
          expression: sls.app.ack.csi.invalid_disk_size
          name: csi_invalid_size
          notification:
            message: kubernetes csi invalid disk size.
          type: event
        - enable: disable
          expression: sls.app.ack.csi.disk_not_portable
          name: csi_not_portable
          notification:
            message: kubernetes csi not protable.
          type: event
        - enable: disable
          expression: sls.app.ack.csi.deivce_busy
          name: csi_device_busy
          notification:
            message: kubernetes csi disk device busy.
          type: event
        - enable: disable
          expression: sls.app.ack.csi.no_ava_disk
          name: csi_no_ava_disk
          notification:
            message: kubernetes csi no available disk.
          type: event
        - enable: disable
          expression: sls.app.ack.csi.disk_iohang
          name: csi_disk_iohang
          notification:
            message: kubernetes csi ioHang.
          type: event
        - enable: disable
          expression: sls.app.ack.csi.latency_too_high
          name: csi_latency_high
          notification:
            message: kubernetes csi pvc latency load too high.
          type: event
        - enable: disable
          expression: prom.pv.failed
          name: pv-failed
          notification:
            message: 'Cluster PersistentVolume failed. \nPromQL: kube_persistentvolume_status_phase{phase=~"Failed|Pending",
              job="_kube-state-metrics"} > 0'
          type: metric-prometheus
      - name: cluster-network-err
        rules:
        - enable: disable
          expression: sls.app.ack.ccm.no_ava_slb
          name: slb-no-ava
          notification:
            message: kubernetes slb not available.
          type: event
        - enable: disable
          expression: sls.app.ack.ccm.sync_slb_failed
          name: slb-sync-err
          notification:
            message: kubernetes slb sync failed.
          type: event
        - enable: disable
          expression: sls.app.ack.ccm.del_slb_failed
          name: slb-del-err
          notification:
            message: kubernetes slb delete failed.
          type: event
        - enable: disable
          expression: sls.app.ack.ccm.create_route_failed
          name: route-create-err
          notification:
            message: kubernetes create route failed.
          type: event
        - enable: disable
          expression: sls.app.ack.ccm.sync_route_failed
          name: route-sync-err
          notification:
            message: kubernetes sync route failed.
          type: event
        - enable: disable
          expression: sls.app.ack.terway.invalid_resource
          name: terway-invalid-res
          notification:
            message: kubernetes terway have invalid resource.
          type: event
        - enable: disable
          expression: sls.app.ack.terway.alloc_ip_fail
          name: terway-alloc-ip-err
          notification:
            message: kubernetes terway allocate ip error.
          type: event
        - enable: disable
          expression: sls.app.ack.terway.parse_fail
          name: terway-parse-err
          notification:
            message: kubernetes terway parse k8s.aliyun.com/ingress-bandwidth annotation
              error.
          type: event
        - enable: disable
          expression: sls.app.ack.terway.allocate_failure
          name: terway-alloc-res-err
          notification:
            message: kubernetes parse resource error.
          type: event
        - enable: disable
          expression: sls.app.ack.terway.dispose_failure
          name: terway-dispose-err
          notification:
            message: kubernetes dispose resource error.
          type: event
        - enable: disable
          expression: sls.app.ack.terway.virtual_mode_change
          name: terway-virt-mod-err
          notification:
            message: kubernetes virtual mode changed.
          type: event
        - enable: disable
          expression: sls.app.ack.terway.config_check
          name: terway-ip-check
          notification:
            message: kubernetes terway execute pod ip config check.
          type: event
        - enable: disable
          expression: sls.app.ack.ingress.err_reload_nginx
          name: ingress-reload-err
          notification:
            message: kubernetes ingress reload config error.
          type: event
      - name: security-err
        rules:
        - enable: disable
          expression: sls.app.ack.si.config_audit_high_risk
          name: si-c-a-risk
          notification:
            message: kubernetes high risks have be found after running config audit.
          type: event
      ruleVersion: v1.0.9

步骤四:分发报警规则到关联集群中

报警规则实际也是一种Kubernetes资源。报警规则分发的原理和应用分发的原理一样,都是通过开源Kubevela,将Fleet实例上的Kubernetes资源分发到关联集群中。分发报警规则步骤如下:

  1. 使用以下模板创建分发规则ackalertrule-app.yaml。

    • 方式一:将报警规则分发到打标production=true的关联集群中。

      1. 执行以下命令,为关联集群打标。

        kubectl get managedclusters    #获取关联集群clusterid。
        kubectl label managedclusters <clusterid> production=true
      2. 将报警规则分发到打标production=true的关联集群中。

      apiVersion: core.oam.dev/v1beta1
      kind: Application
      metadata:
        name: alertrules
        namespace: kube-system
        annotations:
          app.oam.dev/publishVersion: version1
      spec:
        components:
          - name: alertrules
            type: ref-objects
            properties:
              objects:
                - resource: ackalertrules
                  name: default
        policies:
          - type: topology
            name: prod-clusters
            properties:
              clusterSelector:
                production: "true"  #通过标签选择集群。
    • 方式二:可以直接输入集群ID,将报警规则分发到指定的关联集群。

      替换以下<clusterid>为您需要下发的关联集群的ID。

      apiVersion: core.oam.dev/v1beta1
      kind: Application
      metadata:
        name: alertrules
        namespace: kube-system
        annotations:
          app.oam.dev/publishVersion: version1
      spec:
        components:
          - name: alertrules
            type: ref-objects
            properties:
              objects:
                - resource: ackalertrules
                  name: default
        policies:
          - type: topology
            name: prod-clusters
            properties:
              clusters: ["<clusterid1>", "<clusterid2>"]  #通过clusterid选择集群。
  2. 执行以下命令,创建分发规则。

    kubectl apply -f ackalertrule-app.yaml
  3. 执行以下命令,查看分发执行状态。

    kubectl amc appstatus alertrules -n kube-system --tree --detail

    预期输出:

    CLUSTER                  NAMESPACE       RESOURCE             STATUS    APPLY_TIME          DETAIL
    c565e4**** (cluster1)─── kube-system─── AckAlertRule/default updated   2022-**-** **:**:** Age: **
    cbaa12**** (cluster2)─── kube-system─── AckAlertRule/default updated   2022-**-** **:**:** Age: **

修改报警规则

您可以通过以下步骤修改报警规则。

  1. 修改报警规则模板ackalertrule.yaml,并执行命令kubectl apply -f ackalertrule.yaml创建报警规则。

  2. 修改分发模板ackalertrule-app.yaml,更新annotations: app.oam.dev/publishVersion,并执行kubectl apply -f ackalertrule-app.yaml分发报警规则。

相关文章

舰队监控 2025-04-18 18:09

ACK One的舰队监控基于可观测监控Prometheus版的监控指标,提供了舰队自身的监控大盘,包括核心组件(APIServer、etcd)监控和GitOps监控(Argo CD监控、ECI Pod磁盘监控等),让您掌握舰队本身及其托管的Argo CD的实时运行情况。本文介绍如何开启舰队监控。 背

全局监控 2025-04-18 18:09

ACK One舰队的全局监控基于单集群Prometheus的监控指标,聚合汇总多个集群的监控指标,并提供多集群全局的监控大盘,让您可以在一个监控大盘上获取多个集群的监控指标。本文介绍如何开启舰队的全局监控并查看全局监控信息。 前提条件

多集群统一报警管理 2025-04-18 18:09

您可以通过多集群统一报警管理能力,在Fleet实例中配置或修改报警规则。Fleet实例会将报警规则统一下发到指定关联集群,并保证各集群中的规则一致性。当有新集群关联时,Fleet实例会自动同步报警规则。本文介绍如何实现舰队中多集群的统一报警管理。 前提条件

多集群报警差异化配置 2025-04-18 18:09

通过多集群统一报警管理,您可以在Fleet实例中配置或修改报警规则。但由Fleet实例下发的报警规则在各个关联集群中完全一致。如果在实际使用中,不同集群需要不同的报警规则,您可以通过多集群的报警差异化配置实现该功能。本文介绍如何定义多集群报警差异化,实现不同集群的报警差异化配置。 前提条件

目录
Copyright © 2025 your company All Rights Reserved. Powered by 赛尔网络.
京ICP备14022346号-15
gongan beian 京公网安备11010802041014号