We use analytics and cookies to understand site traffic. Information about your use of our site is shared with Google for that purpose.You can read our privacy policies and terms of use etc by clicking here.
Metrics Monitoring
Note
Before starting the installation procedure, please download installation resources as explained here and make sure that all pre-requisites are satisfied.
This page also assumes that main Seldon components are installed.
Installation
The analytics component is configured with the Prometheus integration. The monitoring for Seldon Deploy is based on the Open Source Analytics package which brings in Prometheus (and Grafana) and is required for metrics collection.
Before installing we should set up a recording rules file. Name this model-usage.rules.yml
. The contents of this file are given in the last section.
Create a configmap from the file with
kubectl create configmap -n seldon-system model-usage-rules --from-file=model-usage.rules.yml --dry-run -o yaml | kubectl apply -f -
This can be mounted by setting the below extraConfigmapMounts
in an analytics-values.yaml:
grafana:
resources:
limits:
cpu: 200m
memory: 220Mi
requests:
cpu: 50m
memory: 110Mi
prometheus:
alertmanager:
resources:
limits:
cpu: 50m
memory: 64Mi
requests:
cpu: 10m
memory: 32Mi
nodeExporter:
service:
hostPort: 9200
servicePort: 9200
resources:
limits:
cpu: 200m
memory: 220Mi
requests:
cpu: 50m
memory: 110Mi
server:
retention: "90d"
extraArgs:
storage.tsdb.retention.size: 27GB
persistentVolume:
enabled: true
existingClaim: ""
mountPath: /data
size: 32Gi
resources:
limits:
cpu: 1
memory: 2Gi
requests:
cpu: 600m
memory: 1Gi
extraConfigmapMounts:
- name: prometheus-config-volume
mountPath: /etc/prometheus/conf/
subPath: ""
configMap: prometheus-server-conf
readOnly: true
- name: prometheus-rules-volume
mountPath: /etc/prometheus-rules
subPath: ""
configMap: prometheus-rules
readOnly: true
- name: model-usage-rules-volume
mountPath: /etc/prometheus-rules/model-usage/
subPath: ""
configMap: model-usage-rules
readOnly: true
Other settings in the above are suggested only. Configure to suit your disk availability.
helm repo add seldonio https://storage.googleapis.com/seldon-charts
helm repo update
helm upgrade seldon-core-analytics seldonio/seldon-core-analytics \
--version 1.4.0 \
--namespace seldon-system \
--install
-f analytics-values.yaml
This Prometheus installation is already configured to scrape metrics from Seldon Deployments. Seldon Core documentation on analytics covers metrics discussion and configuration of Prometheus itself.
It’s possible to leverage further custom parameters provided by the helm charts, such as: * grafana_prom_admin_password - The admin password for grafana to use * persistence.enabled - This provides the configuration to enable prometheus persistence
Bringing your own Prometheus
It is possible to use your own Prometheus instance - see prometheus
section in the default values file
seldon-deploy-install/sd-setup/helm-charts/seldon-deploy/values.yaml
If you want to monitor usage of models over time then you need recording rules in place - see below.
Recording Rules
The below is model-usage.rules.yml:
groups:
- name: model-usage.rules
interval: 2m
rules:
- record: model_containers_average
expr: label_replace(sum by (label_seldon_deployment_id,namespace) ((sum_over_time(kube_pod_labels{label_app_kubernetes_io_managed_by=~"seldon-core"}[2m] ) / scalar(max(sum_over_time(kube_pod_labels[2m] )))) * on(pod,namespace) group_right(label_seldon_deployment_id) max by (namespace,pod,container,namespace) (avg_over_time(kube_pod_container_info[2m] ))), "name","$1","label_seldon_deployment_id", "(.+)")
labels:
type: "SeldonDeployment"
- record: model_memory_usage_bytes
expr: label_replace(sort_desc(sum by (label_seldon_deployment_id,namespace) ((sum_over_time(kube_pod_labels{label_app_kubernetes_io_managed_by=~"seldon-core"}[2m] ) / scalar(max(sum_over_time(kube_pod_labels[2m] )))) * on(pod,namespace) group_right(label_seldon_deployment_id) sum by (namespace,pod,container) (rate(container_memory_usage_bytes[2m] )))), "name","$1","label_seldon_deployment_id", "(.+)")
labels:
type: "SeldonDeployment"
- record: model_cpu_usage_seconds_total
expr: label_replace(sort_desc(sum by (label_seldon_deployment_id,namespace) ((sum_over_time(kube_pod_labels{label_app_kubernetes_io_managed_by=~"seldon-core"}[2m] ) / scalar(max(sum_over_time(kube_pod_labels[2m] )))) * on(pod,namespace) group_right(label_seldon_deployment_id) sum by (namespace,pod,container) (rate(container_cpu_usage_seconds_total[2m] )))), "name","$1","label_seldon_deployment_id", "(.+)")
labels:
type: "SeldonDeployment"
- record: model_cpu_requests
expr: label_replace(sort_desc(sum by (label_seldon_deployment_id,namespace) ((sum_over_time(kube_pod_labels{label_app_kubernetes_io_managed_by=~"seldon-core"}[2m] ) / scalar(max(sum_over_time(kube_pod_labels[2m] )))) * on(pod,namespace) group_right(label_seldon_deployment_id) sum by (namespace,pod,container) (kube_pod_container_resource_requests_cpu_cores ))), "name","$1","label_seldon_deployment_id", "(.+)")
labels:
type: "SeldonDeployment"
- record: model_cpu_limits
expr: label_replace(sort_desc(sum by (label_seldon_deployment_id,namespace) ((sum_over_time(kube_pod_labels{label_app_kubernetes_io_managed_by=~"seldon-core"}[2m] ) / scalar(max(sum_over_time(kube_pod_labels[2m] )))) * on(pod,namespace) group_right(label_seldon_deployment_id) sum by (namespace,pod,container) (kube_pod_container_resource_limits_cpu_cores ))), "name","$1","label_seldon_deployment_id", "(.+)")
labels:
type: "SeldonDeployment"
- record: model_memory_requests_bytes
expr: label_replace(sort_desc(sum by (label_seldon_deployment_id,namespace) ((sum_over_time(kube_pod_labels{label_app_kubernetes_io_managed_by=~"seldon-core"}[2m] ) / scalar(max(sum_over_time(kube_pod_labels[2m] )))) * on(pod,namespace) group_right(label_seldon_deployment_id) sum by (namespace,pod,container) (kube_pod_container_resource_requests_memory_bytes ))), "name","$1","label_seldon_deployment_id", "(.+)")
labels:
type: "SeldonDeployment"
- record: model_memory_limits_bytes
expr: label_replace(sort_desc(sum by (label_seldon_deployment_id,namespace) ((sum_over_time(kube_pod_labels{label_app_kubernetes_io_managed_by=~"seldon-core"}[2m] ) / scalar(max(sum_over_time(kube_pod_labels[2m] )))) * on(pod,namespace) group_right(label_seldon_deployment_id) sum by (namespace,pod,container) (kube_pod_container_resource_limits_memory_bytes ))), "name","$1","label_seldon_deployment_id", "(.+)")
labels:
type: "SeldonDeployment"
- name: model-usage-kfserving.rules
interval: 2m
rules:
- record: model_containers_average
expr: label_replace(sum by (label_serving_kubeflow_org_inferenceservice,namespace) ((sum_over_time(kube_pod_labels{label_serving_kubeflow_org_inferenceservice!=""}[2m] ) / scalar(max(sum_over_time(kube_pod_labels[2m] )))) * on(pod,namespace) group_right(label_serving_kubeflow_org_inferenceservice) max by (namespace,pod,container,namespace) (avg_over_time(kube_pod_container_info[2m] ))), "name","$1","label_serving_kubeflow_org_inferenceservice", "(.+)")
labels:
type: "InferenceService"
- record: model_memory_usage_bytes
expr: label_replace(sort_desc(sum by (label_serving_kubeflow_org_inferenceservice,namespace) ((sum_over_time(kube_pod_labels{label_serving_kubeflow_org_inferenceservice!=""}[2m] ) / scalar(max(sum_over_time(kube_pod_labels[2m] )))) * on(pod,namespace) group_right(label_serving_kubeflow_org_inferenceservice) sum by (namespace,pod,container) (rate(container_memory_usage_bytes[2m] )))), "name","$1","label_serving_kubeflow_org_inferenceservice", "(.+)")
labels:
type: "InferenceService"
- record: model_cpu_usage_seconds_total
expr: label_replace(sort_desc(sum by (label_serving_kubeflow_org_inferenceservice,namespace) ((sum_over_time(kube_pod_labels{label_serving_kubeflow_org_inferenceservice!=""}[2m] ) / scalar(max(sum_over_time(kube_pod_labels[2m] )))) * on(pod,namespace) group_right(label_serving_kubeflow_org_inferenceservice) sum by (namespace,pod,container) (rate(container_cpu_usage_seconds_total[2m] )))), "name","$1","label_serving_kubeflow_org_inferenceservice", "(.+)")
labels:
type: "InferenceService"
- record: model_cpu_requests
expr: label_replace(sort_desc(sum by (label_serving_kubeflow_org_inferenceservice,namespace) ((sum_over_time(kube_pod_labels{label_serving_kubeflow_org_inferenceservice!=""}[2m] ) / scalar(max(sum_over_time(kube_pod_labels[2m] )))) * on(pod,namespace) group_right(label_serving_kubeflow_org_inferenceservice) sum by (namespace,pod,container) (kube_pod_container_resource_requests_cpu_cores ))), "name","$1","label_serving_kubeflow_org_inferenceservice", "(.+)")
labels:
type: "InferenceService"
- record: model_cpu_limits
expr: label_replace(sort_desc(sum by (label_serving_kubeflow_org_inferenceservice,namespace) ((sum_over_time(kube_pod_labels{label_serving_kubeflow_org_inferenceservice!=""}[2m] ) / scalar(max(sum_over_time(kube_pod_labels[2m] )))) * on(pod,namespace) group_right(label_serving_kubeflow_org_inferenceservice) sum by (namespace,pod,container) (kube_pod_container_resource_limits_cpu_cores ))), "name","$1","label_serving_kubeflow_org_inferenceservice", "(.+)")
labels:
type: "InferenceService"
- record: model_memory_requests_bytes
expr: label_replace(sort_desc(sum by (label_serving_kubeflow_org_inferenceservice,namespace) ((sum_over_time(kube_pod_labels{label_serving_kubeflow_org_inferenceservice!=""}[2m] ) / scalar(max(sum_over_time(kube_pod_labels[2m] )))) * on(pod,namespace) group_right(label_serving_kubeflow_org_inferenceservice) sum by (namespace,pod,container) (kube_pod_container_resource_requests_memory_bytes ))), "name","$1","label_serving_kubeflow_org_inferenceservice", "(.+)")
labels:
type: "InferenceService"
- record: model_memory_limits_bytes
expr: label_replace(sort_desc(sum by (label_serving_kubeflow_org_inferenceservice,namespace) ((sum_over_time(kube_pod_labels{label_serving_kubeflow_org_inferenceservice!=""}[2m] ) / scalar(max(sum_over_time(kube_pod_labels[2m] )))) * on(pod,namespace) group_right(label_serving_kubeflow_org_inferenceservice) sum by (namespace,pod,container) (kube_pod_container_resource_limits_memory_bytes ))), "name","$1","label_serving_kubeflow_org_inferenceservice", "(.+)")
labels:
type: "InferenceService"
This should be configured with seldon-core-analytics or the same recording rules put into a provided prometheus. Without this you may see warnings about usage.