doc(*) replenish documentations for katalyst

This commit is contained in:
shaowei.wayne 2023-02-23 21:18:33 +08:00
parent 6cc2a36bf7
commit cd563f9ea2
30 changed files with 2725 additions and 126 deletions

View File

@ -21,7 +21,7 @@ More Detailed Introduction will be presented in the future.
<div align="center">
<picture>
<img src="docs/katalyst-overview.jpg" width=80% title="Katalyst Overview" loading="eager" />
<img src="docs/imgs/katalyst-overview.jpg" width=80% title="Katalyst Overview" loading="eager" />
</picture>
</div>
@ -36,7 +36,7 @@ Since Kubewharf enhanced kubernetes is developed based on specific versions of u
## Getting started
Katalyst provides several example yaml to demonstrate the common use cases. (todo)
Katalyst provides several example yaml to demonstrate the common use cases. For more information, please refer to [tutorials](./docs/tutorial/collocation.md).
## Community

View File

@ -19,7 +19,7 @@ Katalyst 分为三个主要 Project
<div align="center">
<picture>
<img src="docs/katalyst-overview.jpg" width=80% title="Katalyst Overview" loading="eager" />
<img src="docs/imgs/katalyst-overview.jpg" width=80% title="Katalyst Overview" loading="eager" />
</picture>
</div>

169
docs/concepts.md Normal file
View File

@ -0,0 +1,169 @@
# Concepts - katalyst core concepts
Katalyst contains a lot of components, making it difficult to dive deep. This documentation will introduce the basic concepts of katalyst to help developers understand how the system works, how it abstracts the QoS model, and how you can dynamically configure the system.
## Architecture
As shown in the architecture below, katalyst mainly contains three layers. For user-side API, katalyst defines a suit of QoS model along with multiple enhancements to match up with QoS requirements for different kinds of workload. Users can deploy their workload with different QoS requirements, and katalyst daemon will try to allocate proper resources and devices for those pods to satisfy their QoS requirements. This allocation process will work both at pod admission phase and runtime, taking into consideration the resource usage and QoS class of pods running on the same node. Besides, centralized components will cooperate with daemons to provide better resource adjustments for each workload with a cluster-level perspective.
<div align="center">
<picture>
<img src="./imgs/katalyst-overview.jpg" width=80% title="Katalyst Overview" loading="eager" />
</picture>
</div>
## Components
Katalyst contains centralized components that are deployed as deployments, and agents that run as deamonsets on each and every node.
### Centralized Components
#### Katalyst Controllers/Webhooks
Katalyst controllers provide cluster-level abilities, including service profiling, elastic resource recommendation, core Custom Resource lifecycle management, and centralized eviction strategies run as a backstop. Katalyst webhooks are responsible for validating QoS configurations, and mutating resource requests according to service profiling.
#### Katalyst Scheduler
Katalyst scheduler is developed based on the scheduler v2 framework to provide the scheduling functionality for hybrid deployment and topology-aware scheduling scenarios
#### Custom Metrics API
Custom metrics API implements the standard custom-metrics-apiserver interface, and is responsible for collecting, storing, and inquiring metrics. It is mainly used by elastic resource recommendation and re-scheduling in the katalyst system.
### Daemon Components
#### QoS Resource Manager
QoS Resource Manager (QRM for short) is designed as an extended framework in kubelet, and it works as a new hint provider similar to Device Manager. But unlike Device Manager, QRM aims at allocating nondiscrete resources (i.e. cpu/memory) rather than discrete devices, and it can adjust allocation results dynamically and periodically based on container running status. QRM is implemented in kubewahrf enhanced kubernetes, and if you want to get more information about QRM, please refer to [qos-resource-manager](./proposals/qos-management/qos-resource-manager/20221018-qos-resource-manager.md).
#### Katalyst agent
Katalyst Agent is designed as the core daemon component to implement resource management according to QoS requirements and container running status. Katalyst agent contains several individual modules that are responsible for different functionalities. These modules can either be deployed as a monolithic container or separate ones.
- Eviction Manager is a framework for eviction strategies. Users can implement their own eviction plugins to handle contention for each resource type. For more information about eviction manager, please refer to [eviction-manager](./proposals/qos-management/eviction-manager/20220424-eviction-manager.md).
- Resource Reporter is a framework for different CRDs or different fields in the same CRD. For instance, different fields in CNR may be collected through different sources, and this framework makes it possible for users to implement each resource reporter with a plugin. For more information about reporter manager, please refer to [reporter-manager](./proposals/qos-management/reporter-manager/20220515-reporter-manager.md).
- SysAdvisor is the core node-level resource recommendation module, and it uses statistical-based, indicator-based, and ml-based algorithms for different scenarios. For more information about sysadvisor, please refer to [sys-advisor](proposals/qos-management/wip-20220615-sys-advisor.md).
- QRM Plugin works as a plugin for each resource with static or dynamic policies. Generally, QRM Plugins receive resource recommendations from SysAdvisor, and export controlling configs through CRI interface embedded in QRM Framework.
#### Malachite
Malachite is a unified metrics-collecting component. It is implemented out-of-tree, and serves node, numa, pod and container level metrics through an http endpoint from which katalyst will query real-time metrics data. In a real-world production environment, you can replace malachite with your own metric implementations.
## QoS
To extend the ability of kubernetes' original QoS level, katalyst defines its own QoS level with CPU as the dominant resource. Other than memory, CPU is considered as a divisible resource and is easier to isolate. And for cloudnative workloads, CPU is usually the dominant resource that causes performance problems. So katalyst uses CPU to name different QoS classes, and other resources are implicitly accompanied by it.
### Definition
<br>
<table>
<tbody>
<tr>
<th align="center">Qos level</th>
<th align="center">Feature</th>
<th align="center">Target Workload</th>
<th align="center">Mapped k8s QoS</th>
</tr>
<tr>
<td>dedicated_cores</td>
<td>
<ul>
<li>Bind with a quantity of dedicated cpu cores</li>
<li>Without sharing with any other pod</li>
</ul>
</td>
<td>
<ul>
<li>Workload that's very sensitive to latency</li>
<li>such as online advertising, recommendation.</li>
</ul>
</td>
<td>Guaranteed</td>
</tr>
<tr>
<td>shared_cores</td>
<td>
<ul>
<li>Share a set of dedicated cpu cores with other shared_cores pods</li>
</ul>
</td>
<td>
<ul>
<li>Workload that can tolerate a little cpu throttle or neighbor spikes</li>
<li>such as microservices for webservice.</li>
</ul>
</td>
<td>Guaranteed/Burstable</td>
</tr>
<tr>
<td>reclaimed_cores</td>
<td>
<ul>
<li>Over-committed resources that are squeezed from dedicated_cores or shared_cores</li>
<li>Whenever dedicated_cores or shared_cores need to claim their resources back, reclaimed_cores will be suppressed or evicted</li>
</ul>
</td>
<td>
<ul>
<li>Workload that mainly cares about throughput rather than latency</li>
<li>such as batch bigdata, offline training.</li>
</ul>
</td>
<td>BestEffort</td>
</tr>
<tr>
<td>system_cores</td>
<td>
<ul>
<li>Reserved for core system agents to ensure performance</li>
</ul>
</td>
<td>
<ul>
<li>Core system agents.</li>
</ul>
</td>
<td>Burstable</td>
</tr>
</tbody>
</table>
#### Pool
As introduced above, katalyst uses the term `pool` to indicate a combination of resources that a batch of pods share with each other. For instance, pods with shared_cores may share a shared pool, meaning that they share the same cpusets, memory limits and so on; in the meantime, if `cpuset_pool` enhancement is enabled, the single shared pool will be separated into several pools based on the configurations.
### Enhancement
Beside the core QoS level, katalyst also provides a mechanism to enhance the ability of standard QoS levels. The enhancement works as a flexible extensibility, and may be added continuously.
<br>
<table>
<tbody>
<tr>
<th align="center">Enhancement</th>
<th align="center">Feature</th>
</tr>
<tr>
<td>numa_binding</td>
<td>
<ul>
<li>Indicates that the pod should be bound into a (or several) numa node(s) to gain further performance improvements</li>
<li>Only supported by dedicated_cores</li>
</ul>
</td>
</tr>
<tr>
<td>cpuset_pool</td>
<td>
<ul>
<li>Allocate a separated cpuset in shared_cores pool to isolate scheduling domain for identical pods.</li>
<li>Only supported by shared_cores</li>
</ul>
</td>
</tr>
<tr>
<td>...</td>
<td>
</td>
</tr>
</tbody>
</table>
## Configurations
To make the configuration more flexible, katalyst designs a new mechanism to set configs on the run, and it works as a supplement for static configs defined via command-line flags. In katalyst, the implementation of this mechanism is called `KatalystCustomConfig` (`KCC` for short). It enables each daemon component to dynamically adjust its working status without restarting or re-deploying.
For more information about KCC, please refer to [dynamic-configuration](proposals/qos-management/wip-20220706-dynamic-configuration.md).

View File

Before

Width:  |  Height:  |  Size: 333 KiB

After

Width:  |  Height:  |  Size: 333 KiB

View File

Before

Width:  |  Height:  |  Size: 477 KiB

After

Width:  |  Height:  |  Size: 477 KiB

View File

Before

Width:  |  Height:  |  Size: 710 KiB

After

Width:  |  Height:  |  Size: 710 KiB

View File

@ -10,22 +10,21 @@ The following instructions are tested on veLinux on Volcengine with kubeadm, and
## Download binaries
Some components and tools are delivered as binary artifacts. The release versions for these binaries are referenced from kubernetes changelog. Notice that kubelet/kubeadm/kubectl are redistributed by kubewharf.
```
```bash
mkdir deploy
cd deploy
cd deploy
wget https://github.com/containerd/containerd/releases/download/v1.4.12/containerd-1.6.9-linux-amd64.tar.gz
wget https://github.com/opencontainers/runc/releases/download/v1.1.1/runc.amd64
wget https://github.com/kubernetes-sigs/cri-tools/releases/download/v1.25.0/crictl-v1.25.0-linux-amd64.tar.gz
# TODO replace with release tarball url later
wget https://storage.googleapis.com/kubernetes-release/release/v1.24.6/kubernetes-server-linux-amd64.tar.gz
wget https://github.com/kubewharf/enhanced-k8s/releases/download/v1.24.6-kubewharf.4/kubernetes-node-linux-amd64.tar.gz
cd -
```
## Set up linux host
```
Install packages and make configuration changes which are required by the setup process or by kubernetes itself.
```bash
apt-get update && \
apt-get install -y apt-transport-https ca-certificates curl && \
apt install socat ebtables conntrack -y
@ -50,7 +49,7 @@ swapoff -a
## Set up container runtime
As a more favored setup, containerd is used as the container runtime. This is required for both master and node.
```
```bash
cd deploy
tar xvf containerd-1.4.12-linux-amd64.tar.gz
@ -67,36 +66,36 @@ root = "/var/lib/containerd"
state = "/run/containerd"
[grpc]
max_recv_message_size = 16777216
max_send_message_size = 16777216
max_recv_message_size = 16777216
max_send_message_size = 16777216
[metrics]
address = ""
grpc_histogram = false
address = ""
grpc_histogram = false
[plugins."io.containerd.grpc.v1.cri"]
stream_server_address = "127.0.0.1"
sandbox_image = "kubewharf/pause:3.7"
max_container_log_line_size = -1
disable_cgroup = false
[plugins."io.containerd.grpc.v1.cri".containerd]
snapshotter = "overlayfs"
default_runtime_name = "runc"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
NoPivotRoot = false
NoNewKeyring = false
SystemdCgroup = false
[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"
max_conf_num = 1
[plugins."io.containerd.grpc.v1.cri".registry]
config_path = "/etc/containerd/certs.d"
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
endpoint = ["registry-1.docker.io"]
stream_server_address = "127.0.0.1"
sandbox_image = "kubewharf/pause:3.7"
max_container_log_line_size = -1
disable_cgroup = false
[plugins."io.containerd.grpc.v1.cri".containerd]
snapshotter = "overlayfs"
default_runtime_name = "runc"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
NoPivotRoot = false
NoNewKeyring = false
SystemdCgroup = false
[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"
max_conf_num = 1
[plugins."io.containerd.grpc.v1.cri".registry]
config_path = "/etc/containerd/certs.d"
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
endpoint = ["registry-1.docker.io"]
EOF
cat <<EOF | tee /etc/systemd/system/containerd.service
@ -165,10 +164,10 @@ crictl ps
## Set up kubernetes components
Install kubeadm, kubelet and kubectl. This is required for both master and node.
```
tar xvf kubernetes-server-linux-amd64.tar.gz
```bash
tar xvf kubernetes-node-linux-amd64.tar.gz
cd kubernetes/server/bin
cd kubernetes/node/bin
install -p -D -m 0755 kubeadm /usr/bin
install -p -D -m 0755 kubelet /usr/bin
install -p -D -m 0755 kubectl /usr/bin
@ -196,7 +195,7 @@ cat <<EOF | tee /usr/lib/systemd/system/kubelet.service.d/10-kubeadm.conf
# Note: This dropin only works with kubeadm and kubelet v1.11+
[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf \\
--kubeconfig=/etc/kubernetes/kubelet.conf"
--kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"
# This is a file that "kubeadm init" and "kubeadm join" generates at runtime, populating the KUBELET_KUBEADM_ARGS variable dynamically
EnvironmentFile=-/var/lib/kubelet/kubeadm-flags.env
@ -205,18 +204,18 @@ EnvironmentFile=-/var/lib/kubelet/kubeadm-flags.env
EnvironmentFile=-/etc/sysconfig/kubelet
ExecStart=
ExecStart=/usr/bin/kubelet \\
\$KUBELET_KUBECONFIG_ARGS \\
\$KUBELET_CONFIG_ARGS \\
\$KUBELET_KUBEADM_ARGS \\
\$KUBELET_EXTRA_ARGS \\
\$KUBELET_HOSTNAME
\$KUBELET_KUBECONFIG_ARGS \\
\$KUBELET_CONFIG_ARGS \\
\$KUBELET_KUBEADM_ARGS \\
\$KUBELET_EXTRA_ARGS \\
\$KUBELET_HOSTNAME
EOF
```
## Set up kubernetes master
Set up the kubernetes master. You can adapt some configurations to your needs, e.g. `podSubnet`,`serviceSubnet`,`clusterDNS` etc. Take down the outputs of the last two commands as these will be used later when joining the nodes. Notice that this guide only covers the setup of a single master cluster.
```
```bash
mkdir -p /etc/kubernetes
export KUBEADM_TOKEN=`kubeadm token generate`
export APISERVER_ADDR=<your master ip address>
@ -225,93 +224,93 @@ cat <<EOF | tee /etc/kubernetes/kubeadm-client.yaml
apiVersion: kubeadm.k8s.io/v1beta3
bootstrapTokens:
- groups:
- system:bootstrappers:kubeadm:default-node-token
token: $KUBEADM_TOKEN
ttl: 24h0m0s
usages:
- signing
- authentication
kind: InitConfiguration
localAPIEndpoint:
advertiseAddress: $APISERVER_ADDR
bindPort: 6443
nodeRegistration:
criSocket: unix:///var/run/containerd/containerd.sock
imagePullPolicy: IfNotPresent
name: $APISERVER_ADDR
taints: []
- system:bootstrappers:kubeadm:default-node-token
token: $KUBEADM_TOKEN
ttl: 24h0m0s
usages:
- signing
- authentication
kind: InitConfiguration
localAPIEndpoint:
advertiseAddress: $APISERVER_ADDR
bindPort: 6443
nodeRegistration:
criSocket: unix:///var/run/containerd/containerd.sock
imagePullPolicy: IfNotPresent
name: $APISERVER_ADDR
taints: []
---
apiServer:
timeoutForControlPlane: 4m0s
extraArgs:
enable-aggregator-routing: "true"
timeoutForControlPlane: 4m0s
extraArgs:
enable-aggregator-routing: "true"
apiVersion: kubeadm.k8s.io/v1beta3
certificatesDir: /etc/kubernetes/pki
clusterName: kubernetes
controllerManager: {}
dns: {}
etcd:
local:
dataDir: /var/lib/etcd
local:
dataDir: /var/lib/etcd
imageRepository: kubewharf
kind: ClusterConfiguration
kubernetesVersion: v1.24.6
kubernetesVersion: v1.24.6-kubewharf.4
networking:
dnsDomain: cluster.local
serviceSubnet: 172.23.192.0/18
podSubnet: 172.28.208.0/20
dnsDomain: cluster.local
serviceSubnet: 172.23.192.0/18
podSubnet: 172.28.208.0/20
scheduler: {}
controlPlaneEndpoint: $APISERVER_ADDR:6443
---
apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
anonymous:
enabled: false
webhook:
cacheTTL: 0s
enabled: true
x509:
clientCAFile: /etc/kubernetes/pki/ca.crt
anonymous:
enabled: false
webhook:
cacheTTL: 0s
enabled: true
x509:
clientCAFile: /etc/kubernetes/pki/ca.crt
authorization:
mode: Webhook
webhook:
cacheAuthorizedTTL: 0s
cacheUnauthorizedTTL: 0s
mode: Webhook
webhook:
cacheAuthorizedTTL: 0s
cacheUnauthorizedTTL: 0s
cgroupDriver: cgroupfs
clusterDNS:
- 172.23.192.10
clusterDomain: cluster.local
cpuManagerReconcilePeriod: 0s
evictionPressureTransitionPeriod: 0s
featureGates:
clusterDomain: cluster.local
cpuManagerReconcilePeriod: 0s
evictionPressureTransitionPeriod: 0s
featureGates:
KubeletPodResources: true
KubeletPodResourcesGetAllocatable: true
QoSResourceManager: true
fileCheckFrequency: 0s
healthzBindAddress: 127.0.0.1
healthzPort: 10248
readOnlyPort: 10255
httpCheckFrequency: 0s
imageMinimumGCAge: 0s
kind: KubeletConfiguration
logging:
fileCheckFrequency: 0s
healthzBindAddress: 127.0.0.1
healthzPort: 10248
readOnlyPort: 10255
httpCheckFrequency: 0s
imageMinimumGCAge: 0s
kind: KubeletConfiguration
logging:
flushFrequency: 0
options:
json:
infoBufferSize: "0"
json:
infoBufferSize: "0"
verbosity: 0
memorySwap: {}
nodeStatusReportFrequency: 0s
nodeStatusUpdateFrequency: 0s
rotateCertificates: true
runtimeRequestTimeout: 0s
shutdownGracePeriod: 0s
shutdownGracePeriodCriticalPods: 0s
staticPodPath: /etc/kubernetes/manifests
streamingConnectionIdleTimeout: 0s
syncFrequency: 0s
volumeStatsAggPeriod: 0s
EOF
memorySwap: {}
nodeStatusReportFrequency: 0s
nodeStatusUpdateFrequency: 0s
rotateCertificates: true
runtimeRequestTimeout: 0s
shutdownGracePeriod: 0s
shutdownGracePeriodCriticalPods: 0s
staticPodPath: /etc/kubernetes/manifests
streamingConnectionIdleTimeout: 0s
syncFrequency: 0s
volumeStatsAggPeriod: 0s
EOF
kubeadm init --config=/etc/kubernetes/kubeadm-client.yaml --upload-certs -v=5
@ -332,7 +331,7 @@ kubeadm token create
To join a node to cluster, you need to copy some of the outputs when setting up master:
- `CA_CERTS_HASH`: the output of `openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | openssl dgst -sha256 -hex | sed 's/^.* //'` on master
- `KUBEADM_TOKEN`: the output of `kubeadm token create` on master. If you forgot the token, you can try `kubeadm token list` on master to get the previously created token. If the token expires, you can create a new one.
```
```bash
mkdir -p /etc/kubernetes/manifests
export APISERVER_ADDR=<your apiserver ip address>
export KUBEADM_TOKEN=<token created with 'kubeadm token create'>
@ -343,17 +342,17 @@ cat <<EOF | tee /etc/kubernetes/kubeadm-client.yaml
apiVersion: kubeadm.k8s.io/v1beta2
kind: JoinConfiguration
discovery:
bootstrapToken:
apiServerEndpoint: $APISERVER_ADDR:6443
token: $KUBEADM_TOKEN
caCertHashes:
- sha256:$CA_CERTS_HASH
timeout: 60s
tlsBootstrapToken: $KUBEADM_TOKEN
bootstrapToken:
apiServerEndpoint: $APISERVER_ADDR:6443
token: $KUBEADM_TOKEN
caCertHashes:
- sha256:$CA_CERTS_HASH
timeout: 60s
tlsBootstrapToken: $KUBEADM_TOKEN
nodeRegistration:
name: $NODE_NAME
criSocket: unix:///var/run/containerd/containerd.sock
tains: []
name: $NODE_NAME
criSocket: unix:///var/run/containerd/containerd.sock
tains: []
EOF
# Join the node to cluster

View File

@ -0,0 +1,144 @@
---
- title: Katalyst Custom Metrics APIServer
- authors:
- "waynepeking348"
- reviewers:
- "luomingmeng"
- "xuchen-xiaoying"
- creation-date: 2022-12-02
- last-updated: 2023-02-23
- status: implemented
---
# Katalyst Custom Metrics APIServer
## Table of Contents
<!-- toc -->
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [User Stories](#user-stories)
- [Story 1](#story-1)
- [Story 2](#story-2)
- [Design Overview [Optional]](#design-overview-optional)
- [API [Optional]](#api-optional)
- [Design Details](#design-details)
- [Alternatives](#alternatives)
<!-- /toc -->
## Summary
Katalyst Custom Metrics APIServer (KCMS for short) is responsible for collecting and storing customized metrics in the production environment, and it implements the standard custom-metrics-apiserver interface for inquiring. It is mainly used by elastic resource recommendation and re-scheduling in the katalyst system.
## Motivation
### Goals
- Provide a customized solution to obtain real-time metrics for systems or modules in katalyst.
### Non-Goals
- Replace the native custom metrics server.
- Works as a general solution for observability.
## Proposal
In katalyst, many centralized components may depend on customized real-time metrics as critical inputs to generate controlling strategies.
### User Stories
#### Story 1
Resource autoscaling (both horizontal and vertical) is a core functionality for cloud-native services. It is usually deployed as a centralized component, and it will dynamically adjust the resource requests for pod or workload level based on real-time running states. To achieve this, the autoscaling module needs to obtain metrics, both for general ones like cpu usage and customized ones like requests qps. Those metrics must be as fresh as possible, and it must be easier enough if we want to add new metrics for more intelligent recommendation strategies.
#### Story 2
Rescheulder or Descheduler is a must-to-have ability in a large cluster, since the running states may change frequently, especially when compared with the time when pods are scheduled. So katalyst will try to rebalance the global scheduling states based on real-time metrics periodically. And the requirements for time-efficiency, stability, reliability, and convenience for new metrics are almost the same as autoscaling.
### Design Overview
<div align="center">
<picture>
<img src="custom-metrics-overview.jpg" width=80% title="Katalyst Overview" loading="eager" />
</picture>
</div>
There are four main components in KCMS.
- KCMS Agent is embedded as a module in SysAdvisor, and it's responsible for collecting metrics from multiple sources, including raw metrics obtained from malachite, encapsulated metrics generated by SysAdvisor and so on. Those metrics will be exported through standard Prometheus interface and opentelemetry framework. Each metric item will be kept for 5 minutes before GC.
- KCMS Collector scrapes metric items from KCMS Agent periodically (default for 5 seconds). It uses label selectors to match KCMS Agent, and always uses Node IP in KCMS Agent Pod as the quering endpoint. After scraping metric, KCMS Collector will send them to KCMS Store.
- KCMS Store is an in-memory cache for metrics. It receives metric items from KCMS Collector, and builds indexes in local cache for KCMS Server to query. Each metric item will be kept for 10 minutes before GC.
- KCMS Server is actually an extended APIServer, and implements the native custom-metrics-apiserver interface. Native APIServer passes through metric requests to KCMS Server according to the registered APIService, and then KCMS Server proxies the requests to KCMS Store, and returns the responded data list back to Native APIServer.
### API
KCMS implements the standard custom-metrics-apiserver interface, so users can use metricsClient to refer to those metrics without any extra efforts. For more information about custom-metrics-apiserver (including the basic concepts and common use cases), please refer to [sig-custom-apiserver](https://github.com/kubernetes-sigs/custom-metrics-apiserver).
```
type CustomMetricsProvider interface {
// GetMetricByName fetches a particular metric for a particular object.
// The namespace will be empty if the metric is root-scoped.
GetMetricByName(ctx context.Context, name types.NamespacedName, info CustomMetricInfo, metricSelector labels.Selector) (*custom_metrics.MetricValue, error)
// GetMetricBySelector fetches a particular metric for a set of objects matching
// the given label selector. The namespace will be empty if the metric is root-scoped.
GetMetricBySelector(ctx context.Context, namespace string, selector labels.Selector, info CustomMetricInfo, metricSelector labels.Selector) (*custom_metrics.MetricValueList, error)
// ListAllMetrics provides a list of all available metrics at
// the current time. Note that this is not allowed to return
// an error, so it is reccomended that implementors cache and
// periodically update this list, instead of querying every time.
ListAllMetrics() []CustomMetricInfo
}
// ExternalMetricsProvider is a source of external metrics.
// Metric is normally identified by a name and a set of labels/tags. It is up to a specific
// implementation how to translate metricSelector to a filter for metric values.
// Namespace can be used by the implemetation for metric identification, access control or ignored.
type ExternalMetricsProvider interface {
GetExternalMetric(ctx context.Context, namespace string, metricSelector labels.Selector, info ExternalMetricInfo) (*external_metrics.ExternalMetricValueList, error)
ListAllExternalMetrics() []ExternalMetricInfo
}
type MetricsProvider interface {
CustomMetricsProvider
ExternalMetricsProvider
}
```
To get information about the detailed custom and external metrics that katalyst supports, please refer to [Katalyst-API](https://github.com/kubewharf/katalyst-api.git), and we plan to provide more metrics along with the implementation of autoscaling and recheduler.
### Design Details
In this section, we will try to introduce the deployment modes that KCMS supports.
#### StandAlone Mode
In StandAlone mode, all centralized components (KCMS Collector, KCMS Store, and KCMS Server) all run in a monolith container. This means that all data dependency is achieved by function calls rather RPC calls. The benefit is performance is strong and deploying is convenient, but reliability is lacking. So it is usually used in testing environments, or in small clusters where it's acceptable to lose some metric.
<div align="center">
<picture>
<img src="custom-metrics-monolith-mode.jpg" width=80% title="Katalyst Overview" loading="eager" />
</picture>
</div>
#### Multi Store Mode
In Multi Store mode, centralized components run in separated containers, and each component has multiple instances.
- For KCMS Collector, only one instance will be the leader while others are cold standbies. Since Collector is stateless, cold standby is enough in case the leader instance crashes and fails to restart.
- For KCMS Store, all instances share the same metric contents. KCMS Collector must perform Quorum write and KCMS Server must perform Quorum read to ensure reliability.
- For KCMS Serve, all instances are equal to handling querying requests since they are only proxies.
Multi Store mode sacrifices some performance to win the most reliability, and is often used in production environments.
<div align="center">
<picture>
<img src="custom-metrics-multi-store-mode.jpg" width=80% title="Katalyst Overview" loading="eager" />
</picture>
</div>
#### Future Work: Sharing mode
Sharing mode is the future work in case metric items are too large to be stored in one instance. In this mode, both KCMS Collector and KCMS Store will be shared, and each instance will be responsible for an independent piece of items. To ensure reliability, each metric shard will still contain duplicated replicas.
<div align="center">
<picture>
<img src="custom-metrics-sharding-mode.jpg" width=80% title="Katalyst Overview" loading="eager" />
</picture>
</div>
## Alternatives
- Expand metrics in the native custom metrics server, but it doesn't support external metrics. Besides, it only contains several metrics in cadvisor, and the cost of adding new metrics is heavy, let along adding labels or something like that. And it doesn't consider reliability, time-efficiency, stability as its main functionalities.

Binary file not shown.

After

Width:  |  Height:  |  Size: 180 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 246 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 230 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 327 KiB

View File

@ -0,0 +1,157 @@
---
- title: Katalyst Reporter Manager
- authors:
- "csfldf"
- reviewers:
- "waynepeking348"
- "luomingmeng"
- "caohe"
- creation-date: 2022-04-24
- last-updated: 2023-02-23
- status: implemented
---
<!-- toc -->
## Table of Contents
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [User Stories](#user-stories)
- [Story 1](#story-1)
- [Story 2](#story-2)
- [Design Overview [Optional]](#design-overview-optional)
- [API [Optional]](#api-optional)
- [Design Details](#design-details)
- [Alternatives](#alternatives)
<!-- toc -->
## Summary
Eviction is usually used as a common back-and-force method, in case that QoS requirements fail to be satisfied. The eviction manager will work as a general framework, and is the only entrance for eviction logic. Different vendors can implement their own eviction strategy based on customized scenario, and the eviction manager will gather those info from all plugins, analyze by sorting and filtering algorithms, and then trigger eviction requests.
## Motivation
### Goals
- Make it easier for vendors or administrators to implement customized eviction strategies.
- Implement a common framework to converge eviction info from multiple eviction plugins.
### Non-Goals
- Replace the original implementation of eviction manager in kubelet.
- Implement a fully functional eviction strategy to cover all scenarios.
## Proposal
### User Story
For a production environment containing pods with multiple QoS levels, there may exist different resources or device vendors, and they usually focus on their customized scenarios. For instance, disk vendors mainly keep an eye on whether contention happens at disk-level, such as iops for disk is beyond threshold, and nic vendors usually care about contention at network interface or protocol stack.
Compared with kubelet eviction manager static threshold strategy, katalyst eviction manager provides more flexible eviction interfaces. Vendors or administrators just need to focus on implementing customized eviction strategies in plugins for pressure detection and picking eviction candidates. There is no need for them to perform eviction requests or mark pressure conditions in Node or CNR, katalyst eviction manager will do it as a coordinator. Without a coordinator, each plugin may choose pods from its perspective, thus evicting too many pods. Katalyst eviction manager will analyze candidates from all plugins by sorting and filtering algorithms and perform eviction requests under control of throttle algorithm, thus reducing the disturbance.
### Design Overview
<div align="center">
<picture>
<img src="eviction-manager-overview.png" width=80% title="Katalyst Overview" loading="eager" />
</picture>
</div>
For architecture overview, the system mainly contains two modules: eviction manager and eviction plugin.
- Eviction Manager is a coordinator, and communicates with multiple Eviction Plugins. It receives pressure conditions and eviction candidates from each plugin, and makes the final eviction decision based on sorting and filtering algorithms.
- Eviction Plugins are implemented according to each individual vendor or scenario. Each plugin will only output eviction candidates or resource pressure status based on its own knowledge, and report those info to Eviction Manager periodically.
### API [Optional]
Eviction Plugin communicates with Eviction Manager with GPRC, and the protobuf is shown as below.
```
type ThresholdMetType int
const (
NotMet ThresholdMetType = iota
SoftMet
HardMet
)
type ConditionType int
const (
NodeCondition = iota
CNRCondition
)
type Condition struct {
ConditionType ConditionType
ConditionName string
MetCondition bool
}
type ThresholdMetResponse struct {
ThresholdValue float64
ObservedValue float64
ThresholdOperator string
MetType ThresholdMetType
EvictionScode string // eg. resource name
Condition Condition
}
type GetTopEvictionPodsRequest struct {
ActivePods []*v1.Pod
topN uint64
}
type GetTopEvictionPodsResponse struct {
TargetPods []*v1.Pod // length is less than or equal to topN in GetTopEvictionPodsRequest
GracePeriodSeconds uint64
}
type EvictPods struct {
Pod *v1.Pod
Reason string
GracePeriod time.Duration
ForceEvict bool
}
type GetEvictPodsResponse struct {
EvictPods []*EvictPod
Condition Condition
}
func ThresholdMet(Empty) (ThresholdMetResponse, error)
func GetTopEvictionPods(GetTopEvictionPodsRequest) (GetTopEvictionPodsResponse, error)
func GetEvictPods(Empty) (GetEvictPodsResponse, error)
```
Based on the API, the workflow is as below.
- Eviction Manager periodically calls the ThresholdMet function of each Eviction Plugin through endpoint to get pressure condition status, and filters out the returned values if NotMet. After comparing the smoothed pressure contention with the target threshold, the manager will update pressure conditions both for Node and CNR. If hard threshold is met, eviction manager calls GetTopEvictionPods function of corresponding plugin to get eviction candidates.
- Eviction Manager also periodically calls the GetEvictPods function of each Eviction Plugin to get eviction candidates explicitly. Those candidates include forced ones and soft ones, and the former means the manager should trigger eviction immediately, while the latter means manager should choose a selected set of pods to evict.
- Eviction Manager will then aggregate all candidates, perform filtering, sorting, and rate-limiting logic, and finally send eviction requests for all selected pods.
### Design Details
In this part, we will introduce the detailed responsibility for Eviction Manager, along with embedded eviction plugins in katalyst.
#### Eviction Manager
- Plugin Manager is responsible for the registration process, and constructs the endpoint for each plugin.
- Plugin Endpoints maintain the endpoint info for each plugin, including client, descriptions and so on.
- Launcher is the core calculation module. It will communicate with each plugin through GRPC periodically, and perform eviction strategies to exact those pods for eviction.
- Evictor is the utility module to communicate with APIServer. When candidates are finally confirmed, the Launcher will call Evictor to trigger eviction.
- Condition Reporter is used to update pressure conditions for Node and CNR to prevent more pods from scheduling into the same node if resource pressure already exists.
#### Plugins
- Inner eviction plugins only depend on the raw metrics/data, and are implemented and deployed along with eviction manager. For instance, cpu suppression eviction plugin and reclaimed resource over-commit eviction plugin both belong to this type.
- Outer eviction plugins depend on the calculation results of other modules. For instance, load eviction plugin depends on the allocation results of QRM, and memory bandwidth eviction depends on the suppression strategy in SysAdvisor, so these eviction plugins should be implemented out-of-tree.
## Alternatives
- Implement pod eviction in native kubelet eviction manager, but this invades too much into the source codes. Besides, we must also implement the metrics collecting and analyzing logic in kubelet, and this means we must be bound with a specific metric source. Finally, upgrading the kubelet is too heavy compared with daemonset, so if we need to add a new or adjust an existing eviction strategy frequently, it is not convenient.
- Implement pod eviction in each plugin without a centralized coordinator. Usually, when resource contention happens, it may cause thundering herds, meaning that more than one plugin decides to trigger pod eviction. And this problem can not be solved if there is no coordinator.

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.8 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 329 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 254 KiB

View File

@ -0,0 +1,245 @@
---
- title: Katalyst Reporter Manager
- authors:
- "luomingmeng"
- reviewers:
- "waynepeking348"
- "csfldf"
- "NickrenREN"
- creation-date: 2022-05-15
- last-updated: 2023-02-22
- status: implemented
---
# Katalyst Reporter Manager
## Table of Contents
<!-- toc -->
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [User Stories](#user-stories)
- [Story 1](#story-1)
- [Story 2](#story-2)
- [Design Overview [Optional]](#design-overview-optional)
- [API [Optional]](#api-optional)
- [Design Details](#design-details)
- [Alternatives](#alternatives)
<!-- /toc -->
## Summary
Katalyst expands the node resource with a new CRD Custom Node Resource (kcnr for short). This CRD contains both topology aware resources and reclaimed resources, but different fields in CNR may be collected through different sources. For instance, reclaimed resources are collected from SysAdvisor according to container running status, while topology aware resources are collected by each pluggable plugin. If each plugin patches CNR by itself, there may exist many API conflicts, and the QPS for API update requests may rise uncontrollably.
To solve this, Reporter Manager implements a common framework. Users can implement each resource reporter with a pluggable plugin, and the manager will merge the reported resources into the corresponding fields, and update the CR through one API request.
## Motivation
### Goals
- Implement a common framework to merge different fields for node-level CR, and update through one API request.
- Make the resource reporting component pluggable.
### Non-Goals
- Replace native resource representations in nodes.
- Replace device manager or device plugins for scalar resources in nodes.
- Implement the general resource, metrics, or device reporting logic.
## Proposal
### User Stories
#### Story 1
SysAdvisor is the core node-level resource recommendation module in Katalyst, and it will predict the resource requirements for non-reclaimed pods in real-time. If allocatable resources are surplus after matching up with the resource requirements, Katalyst will try to schedule reclaimed pods into this node for better resource utilization. And thus those reclaimed resources should be reported in CNR.
Since the reporting source is in SysAdvisor, Reporter Manager provides a mechanism for SysAdvisor to register an out-of-tree resource reporting plugin. So the boundary is clear, SysAdvisor is responsible for calculation logics for reclaimed resources, while Reporter Manager is responsible for updating it in CNR.
#### Story 2
In the production environment, we may have a lot of resources or devices that are affiliated with micro topology. To make those micro topology info appreciable in scheduling process, katalyst should report them in CNR. But those resources or devices are usually bounded by specific vendors. If each vendor updates CNR by itself, there may exist many API conflicts, and the QPS for API update requests may rise uncontrollably.
In this case, Reporter Manager works as a coordinator. Each vendor can implement its own reporter plugin, and register itself into Reporter Manager. The latter will merge all the reporting fields, fix the conflicts, and update CNT through one API request.
### Design Overview
<div align="center">
<picture>
<img src="reporter-manager-overview.jpg" width=80% title="Katalyst Overview" loading="eager" />
</picture>
</div>
- Reporter is responsible for updating CR to APIServer. For each CRD, there should exist one corresponding reporter. It receives all reporting fields from the Reporter Plugin Manager, validates the legality, merges them into the CR, and finally updates by clientset.
- Reporter Plugin Manager is the core framework. It listens to the registration events from each Reporter Plugin, periodically poll each plugin to obtain reporting info, and dispatch those info to different Reporters.
- Reporter Plugin is implemented by each resource or device vendor, and it is registered into Reporter Plugin Manager before it starts. Whenever it is called by Reporter Plugin Manager, it should return the fresh resource or device info for return, along with enough information about which CRD and which field those info are mapped with.
### API
Reporter Plugin communicates with Reporter Plugin Manager with GPRC, and the protobuf is shown as below. Each Reporter Plugin can report resources for more than one CRD, so it should explicitly nominate the GVR and fields in the protobuf.
```
message Empty {
}
enum FieldType {
Spec = 0;
Status = 1;
Metadata = 2;
}
message ReportContent {
k8s.io.apimachinery.pkg.apis.meta.v1.GroupVersionKind groupVersionKind = 1;
repeated ReportField field = 2;
}
message ReportField {
FieldType fieldType = 1;
string fieldName = 2;
bytes value = 3;
}
message GetReportContentResponse {
repeated ReportContent content = 1;
}
service ReporterPlugin {
rpc GetReportContent(Empty) returns (GetReportContentResponse) {}
rpc ListAndWatchReportContent(Empty) returns (stream GetReportContentResponse) {}
}
```
Reporter Plugin Manager will implement the interfaces below.
```
type AgentPluginHandler interface {
GetHandlerType() string // "ReportePlugin"
PluginHandler
}
type PluginHandler interface {
// Validate returns an error if the information provided by
// the potential plugin is erroneous (unsupported version, ...)
ValidatePlugin(pluginName string, endpoint string, versions []string) error
// RegisterPlugin is called so that the plugin can be register by any
// plugin consumer
// Error encountered here can still be Notified to the plugin.
RegisterPlugin(pluginName, endpoint string, versions []string) error
// DeRegister is called once the pluginwatcher observes that the socket has
// been deleted.
DeRegisterPlugin(pluginName string)
}
```
Currently, katalyst only supports updating resources for CNR and native Node object. If you want to add a new node-level CRD, and report resources using this mechanism, you should implement a new Reporter in katalyst-core with the following interface.
```
type Reporter interface {
// Update receives ReportField list from report manager, the reporter implementation
// should be responsible for assembling and updating the specific object
Update(ctx context.Context, fields []*v1alpha1.ReportField) error
// Run starts the syncing logic of reporter
Run(ctx context.Context)
}
```
### Design Details
To simplify, Reporter Manager only supports FieldType at the top-level of each object, i.e. only Status, Spec, Metadata are supported. If the object has embedded struct in Spec, you should explicitly nominate them in `FieldName`
For the `FieldName`, we will introduce the details about how to fill up the protobuf for each type.
#### Map
Different Reporter Plugin can report to the same Map field with different keys, and Reporter Manager will merge them together. But if multiple plugins report to the same Map field with the same key, Reporter Manager will always override the former one with the latter one, and the priority is based on the registration orders to make sure the override results are always deterministic.
- Plugin A (the former)
```
Field: []*v1alpha1.ReportField{
{
FieldType: v1alpha1.FieldType_Status,
FieldName: "ReclaimedResourceCapacity",
Value: []byte(`&{"cpu": "10"}`),
},
}
```
- Plugin B (the latter)
```
Field: []*v1alpha1.ReportField{
{
FieldType: v1alpha1.FieldType_Status,
FieldName: "ReclaimedResourceCapacity",
Value: []byte(`&{"memory": "24Gi"}`),
},
}
```
- Result (json)
```
{
"reclaimedResourceCapacity": {
"cpu": "10",
"memory": "24Gi"
}
}
```
#### Slice and Array
Different Reporter Plugin can report to the same Slice or Array field, and Reporter Manager will merge them together.
- Plugin A
```
Field: []*v1alpha1.ReportField{
{
FieldType: v1alpha1.FieldType_Status,
FieldName: "ResourceStatus",
Value: []byte(`[&{"numa": "numa-1", "available": {"device-1": "12"}}]`),
},
}
```
- Plugin B
```
Field: []*v1alpha1.ReportField{
{
FieldType: v1alpha1.FieldType_Status,
FieldName: "ResourceStatus",
Value: []byte(`[&{"numa": "numa-2", "available": {"device-2": "13"}}]`),
},
}
```
- Result (json)
```
{
"resourceStatus": [
{
&{"numa": "numa-1", "available": {"device-1": "12"}},
&{"numa": "numa-2", "available": {"device-2": "13"}}
}
]
}
```
#### Common Field
If different Reporter Plugins try to report to the same common field (int, string, ...), Reporter Manager will always override the former one with the latter one, and the priority is based on the registration orders to make sure the override results are always deterministic.
- Plugin A (the former)
```
Field: []*v1alpha1.ReportField{
{
FieldType: v1alpha1.FieldType_Status,
FieldName: "TopologyStatus",
Value: []byte(`&{"sockets": [&{"socketID": 0}]}`),
},
}
```
- Plugin B (the latter)
```
Field: []*v1alpha1.ReportField{
{
FieldType: v1alpha1.FieldType_Status,
FieldName: "TopologyStatus",
Value: []byte(`&{"sockets": [&{"socketID": 1}]}`),
},
}
```
- Result (json)
```
{
"topologyStatus": &{
"sockets": [&{"socketID": 1}]
}
}
```
## Alternatives
- All CNR resources are collected and updated by a monolith component, but this may cause this monolith component too complicated to maintain. In the meantime, all logics must be implemented in-tree, which will it be inconvenient to extend if new devices are needed.

Binary file not shown.

After

Width:  |  Height:  |  Size: 128 KiB

View File

@ -0,0 +1,59 @@
---
- title: Katalyst Reporter Manager
- authors:
- "sun-yuliang"
- reviewers:
- "waynepeking348"
- "csfldf"
- "Aiden-cn"
- "pendoragon"
- creation-date: 2022-06-15
- last-updated: 2023-02-23
- status: provisional
---
# Katalyst SysAdvisor (TBD)
SysAdvisor is the core node-level resource recommendation module, and it uses statistical-based, indicator-based, and ml-based algorithms for different scenarios.
The detailed proposal contents for Katalyst SysAdvisor is ongoing and will be released in the next quarter.
## Table of Contents
<!-- toc -->
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [User Stories](#user-stories)
- [Story 1](#story-1)
- [Story 2](#story-2)
- [Design Overview [Optional]](#design-overview-optional)
- [API [Optional]](#api-optional)
- [Design Details](#design-details)
- [Alternatives](#alternatives)
<!-- /toc -->
## Summary
## Motivation
### Goals
### Non-Goals
## Proposal
### User Stories
#### Story 1
#### Story 2
### Design Overview
### Design Details
## Alternatives

View File

@ -0,0 +1,59 @@
---
- title: Katalyst Reporter Manager
- authors:
- "luomingmeng"
- reviewers:
- "waynepeking348"
- "csfldf"
- "Aiden-cn"
- "sun-yuliang"
- creation-date: 2022-07-06
- last-updated: 2023-02-23
- status: provisional
---
# Katalyst Dynamic Configuration (TBD)
atalyst Dynamic Configuration i.e. `KatalystCustomConfig` (`KCC` for short) is a general framework to enable each daemon component to dynamically adjust its working status without restarting or re-deploying.
The detailed proposal contents for Katalyst SysAdvisor is ongoing and will be released in the next quarter.
## Table of Contents
<!-- toc -->
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [User Stories](#user-stories)
- [Story 1](#story-1)
- [Story 2](#story-2)
- [Design Overview](#design-overview-optional)
- [API](#api-optional)
- [Design Details](#design-details)
- [Alternatives](#alternatives)
<!-- /toc -->
## Summary
## Motivation
### Goals
### Non-Goals
## Proposal
### User Stories
#### Story 1
#### Story 2
### Design Overview
### Design Details
## Alternatives

View File

@ -270,7 +270,7 @@ events are registered to the informer.
<div align="center">
<picture>
<img src="/docs/imgs/fit-plugin.jpg" width=80% title="QoSAwareNodeResourcesFit Plugin" loading="eager" />
<img src="/docs/imgs/scheduler-fit-plugin.jpg" width=80% title="QoSAwareNodeResourcesFit Plugin" loading="eager" />
</picture>
</div>
@ -320,7 +320,7 @@ We will implement the following extension points:
<div align="center">
<picture>
<img src="/docs/imgs/balanced-plugin.jpg" width=80% title="QoSAwareNodeResourcesBalancedAllocation Plugin" loading="eager" />
<img src="/docs/imgs/scheduler-balanced-plugin.jpg" width=80% title="QoSAwareNodeResourcesBalancedAllocation Plugin" loading="eager" />
</picture>
</div>

91
docs/roadmap.md Normal file
View File

@ -0,0 +1,91 @@
# RoadMap - katalyst core functionalities
This roadmap defines the core functionalities and features that katalyst plans to deliver in the future, but the detailed timeline-breakdown will be published in the next quarter.
<br>
<table>
<tbody>
<tr>
<th>Core Functionalities</th>
<th align="center">Core Features</th>
<th align="right">Milestone breakdown</th>
</tr>
<tr>
<td>Collocation for Pods with all QoS</td>
<td>
<ul>
<li>Improve the isolation mechanism</li>
<ul>
<li>Intel RDT, iocost, tc, cgroup v2 ...</li>
</ul>
<li>Improve eviction strategy for more resource dimensions</li>
<ul>
<li>PSI based eviction, IO bandwidth based eviction ...</li>
</ul>
<li>Improve resource advertising</li>
<ul>
<li>numa-aware reclaimed resource ...</li>
</ul>
<li>Support all QoS classes</li>
<li>QoS enhancement</li>
<ul>
<li>numa binding, numa exclusive policy ...</li>
</ul>
</ul>
</td>
<td>
<ul>
<li>2023/07/31: v1 release</li>
</ul>
</td>
</tr>
<tr>
<td>Dynamic Resource Management</td>
<td>
<ul>
<li>Implement resource estimation algorithm to enhance accuracy</li>
<ul>
<li>indicator-based resource estimation ...</li>
<li>ml-based resource estimation ...</li>
</ul>
<li>Design algorithm interface to support more estimation algorithms as plugins</li>
</ul>
</td>
<td>
<ul>
<li>2023/07/31: v1 release</li>
</ul>
</td>
</tr>
<tr>
<td>Rescheduling Based on Workload Profiling</td>
<td>
<ul>
<li>Provide real-time metrics via custom-metrics-apiserver</li>
<li>Collect and extract workload profile</li>
<li>Implement rescheduling to balance the resource usage across the cluster</li>
</ul>
</td>
<td>/</td>
</tr>
<tr>
<td>Topology Aware Resource Scheduling and Allocation</td>
<td>
<ul>
<li>Support all kinds of heterogeneous devices</li>
<li>Support more topology aware devices(e.g. nic, disk etc)</li>
<li>Improve device utilization with virtualization mechanism</li>
</ul>
</td>
<td>/</td>
</tr>
<tr>
<td>Elastic Resource Management</td>
<td>
<ul>
<li>Optimize HPA and VPA to support elastic resource management</li>
</ul>
</td>
<td>/</td>
</tr>
</tbody>
</table>

View File

@ -0,0 +1,310 @@
# Tutorial - katalyst collocation best-practices
This guide introduces best-practices for collocation in katalyst with an end-to-end example. Follow the steps below to take a glance at the integrated functionality, and then you can replace the sample yamls with your workload when applying the system in your own production environment.
## Prerequisite
Please make sure you have deployed all pre-dependent components before moving on to the next step.
- Install enhanced kubernetes based on [install-enhanced-k8s.md](../install-enhanced-k8s.md)
- Install components according to the instructions in [Charts](https://github.com/kubewharf/charts.git). To enable full functionality of colocation, the following components are required while others are optional.
- Agent
- Controller
- Scheduler
## Functionalities
Before going to the next step, let's assume that we will use those settings and configurations as baseline:
- Total resources are set as 48 cores and 195924424Ki per nod;
- Reserved resources for pods with shared_cores are set as 4 cores and 5Gi, and it means that we'll always keep at least this amount of resources for those pods for bursting requirements.
Based on the assumption above, you can follow the steps to deep dive into the colocation workflow.
### Resource Reporting
After installing, resource reporting module will report reclaimed resources. Since there are no pods running, the reclaimed resource will be calculated as:
`reclaimed_resources = total_resources - reserve_resources`
When you refer to CNR, reclaimed resources will be as follows, and it means that pods with reclaimed_cores can be scheduled onto this node.
```
status:
resourceAllocatable:
katalyst.kubewharf.io/reclaimed_memory: "195257901056"
katalyst.kubewharf.io/reclaimed_millicpu: 44k
resourceCapacity:
katalyst.kubewharf.io/reclaimed_memory: "195257901056"
katalyst.kubewharf.io/reclaimed_millicpu: 44k
```
Submit several pods with shared_cores, and put pressure on those workloads to make reclaimed resources fluctuate along with the running state of workload.
```
$ kubectl create -f ./examples/shared-normal-pod.yaml
```
After successfully scheduled, the pod starts running with cpu-usage ~= 1cores and cpu-load ~=1, and the reclaimed resources will be changed according to the formula below. We skip memory here since it's more difficult to reproduce with accurate value than cpu, but the principle is familiar.
`reclaim cpu = allocatable - round(ceil(reserve + max(usage,load.1min,load.5min))`
```
status:
resourceAllocatable:
katalyst.kubewharf.io/reclaimed_millicpu: 42k
resourceCapacity:
katalyst.kubewharf.io/reclaimed_millicpu: 42k
```
You can then put pressure on those pods to simulate requested peaks with `stress`, and the cpu-load will rise to approximately 3 to make the reclaimed cpu shrink to 40k.
```
$ kubectl exec shared-normal-pod -it -- stress -c 2
```
```
status:
resourceAllocatable:
katalyst.kubewharf.io/reclaimed_millicpu: 40k
resourceCapacity:
katalyst.kubewharf.io/reclaimed_millicpu: 40k
```
### Scheduling Strategy
Katalyst provides several scheduling strategies to schedule pods with reclaimed_cores. You can alter the default scheduling config, and then create deployment with reclaimed_cores.
```
$ kubectl create -f ./examples/reclaimed-deployment.yaml
```
#### Spread
Spread is the default scheduling strategy, and it will try to spread pods among all suitable nodes, and it is usually used to balance workload contention in the cluster. Apply Spread policy with the command below, and pod will be scheduled onto each node evenly.
```
$ kubectl apply -f ./examples/scheduler-policy-spread.yaml
$ kubectl get po -owide
```
```
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
reclaimed-pod-5f7f69d7b8-4lknl 1/1 Running 0 3m31s 192.168.1.169 node-1 <none> <none>
reclaimed-pod-5f7f69d7b8-656bz 1/1 Running 0 3m31s 192.168.2.103 node-2 <none> <none>
reclaimed-pod-5f7f69d7b8-89n46 1/1 Running 0 3m31s 192.168.0.129 node-3 <none> <none>
reclaimed-pod-5f7f69d7b8-bcpbs 1/1 Running 0 3m31s 192.168.1.171 node-1 <none> <none>
reclaimed-pod-5f7f69d7b8-bq22q 1/1 Running 0 3m31s 192.168.0.126 node-3 <none> <none>
reclaimed-pod-5f7f69d7b8-jblgk 1/1 Running 0 3m31s 192.168.0.128 node-3 <none> <none>
reclaimed-pod-5f7f69d7b8-kxqdl 1/1 Running 0 3m31s 192.168.0.127 node-3 <none> <none>
reclaimed-pod-5f7f69d7b8-mdh2d 1/1 Running 0 3m31s 192.168.1.170 node-1 <none> <none>
reclaimed-pod-5f7f69d7b8-p2q7s 1/1 Running 0 3m31s 192.168.2.104 node-2 <none> <none>
reclaimed-pod-5f7f69d7b8-x7lqh 1/1 Running 0 3m31s 192.168.2.102 node-2 <none> <none>
```
#### Binpack
Binpack tries to schedule pods into a single node, until the node is unsuitable to schedule more, and it is usually used to squeeze resource utilization to a limited bound to rise utilization. Apply Binpack policy with the command below, and pods will be scheduled onto one intensive node.
```
$ kubectl apply -f ./examples/scheduler-policy-binpack.yaml
$ kubectl get po -owide
```
```
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
reclaimed-pod-5f7f69d7b8-7mjbz 1/1 Running 0 36s 192.168.1.176 node-1 <none> <none>
reclaimed-pod-5f7f69d7b8-h8nmk 1/1 Running 0 36s 192.168.1.177 node-1 <none> <none>
reclaimed-pod-5f7f69d7b8-hfhqt 1/1 Running 0 36s 192.168.1.181 node-1 <none> <none>
reclaimed-pod-5f7f69d7b8-nhx4h 1/1 Running 0 36s 192.168.1.182 node-1 <none> <none>
reclaimed-pod-5f7f69d7b8-s8sx7 1/1 Running 0 36s 192.168.1.178 node-1 <none> <none>
reclaimed-pod-5f7f69d7b8-szn8z 1/1 Running 0 36s 192.168.1.180 node-1 <none> <none>
reclaimed-pod-5f7f69d7b8-vdm7c 1/1 Running 0 36s 192.168.0.133 node-3 <none> <none>
reclaimed-pod-5f7f69d7b8-vrr8w 1/1 Running 0 36s 192.168.1.179 node-1 <none> <none>
reclaimed-pod-5f7f69d7b8-w9hv4 1/1 Running 0 36s 192.168.2.109 node-2 <none> <none>
reclaimed-pod-5f7f69d7b8-z2wqv 1/1 Running 0 36s 192.168.4.200 node-4 <none> <none>
```
#### Custom
Besides those in-tree policies, you can also use self-defined scoring functions to customize scheduling strategy. In the example below, we use self-defined RequestedToCapacityRatio scorer as scheduling policy, and it will work the same as Binpack policy.
```
$ kubectl apply -f ./examples/scheduler-policy-custom.yaml
$ kubectl get po -owide
```
```
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
reclaimed-pod-5f7f69d7b8-547zk 1/1 Running 0 7s 192.168.1.191 node-1 <none> <none>
reclaimed-pod-5f7f69d7b8-6jzbs 1/1 Running 0 6s 192.168.1.193 node-1 <none> <none>
reclaimed-pod-5f7f69d7b8-6v7kr 1/1 Running 0 7s 192.168.2.111 node-2 <none> <none>
reclaimed-pod-5f7f69d7b8-9vrb9 1/1 Running 0 6s 192.168.1.192 node-1 <none> <none>
reclaimed-pod-5f7f69d7b8-dnn7n 1/1 Running 0 7s 192.168.4.204 node-4 <none> <none>
reclaimed-pod-5f7f69d7b8-jtgx9 1/1 Running 0 7s 192.168.1.189 node-1 <none> <none>
reclaimed-pod-5f7f69d7b8-kjrlv 1/1 Running 0 7s 192.168.0.139 node-3 <none> <none>
reclaimed-pod-5f7f69d7b8-mr85t 1/1 Running 0 6s 192.168.1.194 node-1 <none> <none>
reclaimed-pod-5f7f69d7b8-q4dz5 1/1 Running 0 7s 192.168.1.188 node-1 <none> <none>
reclaimed-pod-5f7f69d7b8-v28nv 1/1 Running 0 7s 192.168.1.190 node-1 <none> <none>
```
### QoS Controlling
After successfully scheduled, katalyst falls into the main QoS-controlling loop to dynamically adjust resource allocations in each node. In the current version, we use cpuset to isolate scheduling domain for pods in each pool, and memory limit to restrict upper bound of memory usage.
Before going to the next step, remember to clear previous pods to construct a pure environment.
Apply a pod with shared_cores as the command below. After ramp-up period, the cpuset for shared-pool will be 6 cores in total (i.e. reserved for 4 cores to reply for bursting, plus 2 cores for regular requirements). And the left cores are considered suitable for reclaimed pods.
```
$ kubectl create -f ./examples/shared-normal-pod.yaml
```
```
root@node-1:~# ./examples/get_cpuset.sh shared-normal-pod
Tue Jan 3 16:18:31 CST 2023
11,22-23,35,46-47
```
Apply a pod with reclaimed_cores as the command below, and the cpuset for reclaimed-pool will be 40 cores.
```
kubectl create -f ./examples/reclaimed-normal-pod.yaml
```
```
root@node-1:~# ./examples/get_cpuset.sh reclaimed-normal-pod
Tue Jan 3 16:23:20 CST 2023
0-10,12-21,24-34,36-45
```
Put pressure on the previous pod with shared_cores to make its load rise to 3, and the cpuset for shared-pool will be 8 cores in total (i.e. reserved for 4 cores to reply for bursting, plus 4 cores for regular requirements). And cores for reclaimed pool will shrink to 48 relatively.
```
$ kubectl exec reclaimed-normal-pod -it -- stress -c 2
```
```
root@node-1:~# ./examples/get_cpuset.sh shared-normal-pod
Tue Jan 3 16:25:23 CST 2023
10-11,22-23,34-35,46-47
```
```
root@node-1:~# ./examples/get_cpuset.sh reclaimed-normal-pod
Tue Jan 3 16:28:32 CST 2023
0-9,12-21,24-33,36-45
```
### Pod Eviction
Eviction is usually used as a common back-and-force method in case that the QoS fails to be satisfied, and we should always make sure pods with higher priority (i.e. shared_cores) to meet its QoS by evicting pods with lower priority (i.e. reclaimed_cores). Katalyst contains both agent and central evictions to meet different requirements.
Before going to the next step, remember to clear previous pods to construct a pure environment.
#### Agent Eviction
Currently, katalyst provides several in-tree agent eviction implementations.
##### Resource OverCommit
Since reclaimed resources are always fluctuating according to the running state of pods with shared_cores, it may reduce shrink to a critical point that even pods will reclaimed pods can not run properly. In this case, katalyst will evict them to rebalance them to other nodes, and the comparison formula is as follows:
`sum(requested_reclaimed_resource) > alloctable_reclaimed_resource * threshold`
Apply several pods (including shared_cores and reclaimed_cores), and put some pressure to reduce allocatable reclaimed resources until it is below the tolerance threshold. It will finally trigger to evict pod reclaimed-large-pod-2.
```
$ kubectl create -f ./examples/shared-large-pod.yaml ./examples/reclaimed-large-pod.yaml
```
```
$ kubectl exec shared-large-pod-2 -it -- stress -c 40
```
```
status:
resourceAllocatable:
katalyst.kubewharf.io/reclaimed_millicpu: 4k
resourceCapacity:
katalyst.kubewharf.io/reclaimed_millicpu: 4k
```
```
$ kubectl get event -A | grep evict
default 43s Normal EvictCreated pod/reclaimed-large-pod-2 Successfully create eviction; reason: met threshold in scope: katalyst.kubewharf.io/reclaimed_millicpu from plugin: reclaimed-resource-pressure-eviction-plugin
default 8s Normal EvictSucceeded pod/reclaimed-large-pod-2 Evicted pod has been deleted physically; reason: met threshold in scope: katalyst.kubewharf.io/reclaimed_millicpu from plugin: reclaimed-resource-pressure-eviction-plugin
```
The default threshold for reclaimed resources 5, you can change it dynamically with KCC.
```$ kubectl create -f ./examples/kcc-eviction-reclaimed-resource-config.yaml```
##### Memory
Memory eviction is implemented in two parts: numa-level eviction and system-level eviction. The former is used along with numa-binding enhancement, while the latter is used for more general cases. In this tutorial, we will mainly demonstrate the latter. For each level, katalyst will trigger memory eviciton based on memory usage and Kswapd active rate to avoid slow path for memory allocation in kernel.
Apply several pods (including shared_cores and reclaimed_cores).
```$ kubectl create -f ./examples/shared-large-pod.yaml ./examples/reclaimed-large-pod.yaml```
Apply KCC to alert the default free memory and Kswapd rate threshold.
```$ kubectl create -f ./examples/kcc-eviction-memory-system-config.yaml```
###### For Memory Usage
Exec into reclaimed-large-pod-2 and request enough memory. When memory free falls below the target, it will trigger eviction for pods with reclaimed cores, and it will choose the pod that uses the most memory.
```
$ kubectl exec -it reclaimed-large-pod-2 bash
$ stress --vm 1 --vm-bytes 175G --vm-hang 1000 --verbose
```
```
$ kubectl get event -A | grep evict
default 2m40s Normal EvictCreated pod/reclaimed-large-pod-2 Successfully create eviction; reason: met threshold in scope: memory from plugin: memory-pressure-eviction-plugin
default 2m5s Normal EvictSucceeded pod/reclaimed-large-pod-2 Evicted pod has been deleted physically; reason: met threshold in scope: memory from plugin: memory-pressure-eviction-plugin
```
```
taints:
- effect: NoSchedule
key: node.katalyst.kubewharf.io/MemoryPressure
timeAdded: "2023-01-09T06:32:08Z"
```
###### For Kswapd
Login into the working node and put some pressure on system memory. When Kswapd active rates exceed the target threshold (default = 1), it will trigger eviction for pods both for reclaimed cores and shared cores, but reclaimed cores will be prior to shared cores.
```
$ stress --vm 1 --vm-bytes 180G --vm-hang 1000 --verbose
```
```
$ kubectl get event -A | grep evict
default 2m2s Normal EvictCreated pod/reclaimed-large-pod-2 Successfully create eviction; reason: met threshold in scope: memory from plugin: memory-pressure-eviction-plugin
default 92s Normal EvictSucceeded pod/reclaimed-large-pod-2 Evicted pod has been deleted physically; reason: met threshold in scope: memory from plugin: memory-pressure-eviction-plugin
```
```
taints:
- effect: NoSchedule
key: node.katalyst.kubewharf.io/MemoryPressure
timeAdded: "2023-01-09T06:32:08Z"
```
##### Load
For pods with shared_cores, if any pod creates too many threads, the scheduling-period in cfs may be split into small pieces, and makes throttle more frequent, thus impacts workload performance. To solve this, katalyst implements load eviction to detect load counts and trigger taint and eviction actions based on threshold, and the comparison formula is as follows.
``` soft: load > resource_pool_cpu_amount ```
``` hard: load > resource_pool_cpu_amount * threshold ```
Apply several pods (including shared_cores and reclaimed_cores).
```
$ kubectl create -f ./examples/shared-large-pod.yaml ./examples/reclaimed-large-pod.yaml
```
put some pressure to reduce allocatable reclaimed resources until the load exceeds the soft bound. In this case, taint will be added in CNR to avoid scheduling new pods, but the existing pods will keep running.
```
$ kubectl exec shared-large-pod-2 -it -- stress -c 50
```
```
taints:
- effect: NoSchedule
key: node.katalyst.kubewharf.io/CPUPressure
timeAdded: "2023-01-05T05:26:51Z"
```
put more pressure to reduce allocatable reclaimed resources until the load exceeds the hard bound. In this case, katalyst will evict the pods that create the most amount of threads.
```
$ kubectl exec shared-large-pod-2 -it -- stress -c 100
```
```
$ kubectl get event -A | grep evict
67s Normal EvictCreated pod/shared-large-pod-2 Successfully create eviction; reason: met threshold in scope: cpu.load.1min.container from plugin: cpu-pressure-eviction-plugin
68s Normal Killing pod/shared-large-pod-2 Stopping container stress
32s Normal EvictSucceeded pod/shared-large-pod-2 Evicted pod has been deleted physically; reason: met threshold in scope: cpu.load.1min.container from plugin: cpu-pressure-eviction-plugin
```
#### Centralized Eviction
In some cases, the agents may suffer from the single point problem, i.e. in a large cluster, the daemon may fail to work because of a lot of abnormal cases, and it may cause the pods running in this node out of control. So, centralized eviction by katalyst will try to evict all reclaimed pods to relieve this problem.
By default, if the readiness state keeps failing for 10 minutes, katalyst will taint the CNR as unSchedubable to make sure no more pods with reclaimed_cores can be scheduled into this node. And if the readiness state keeps failing for 20 minutes, it will try to evict all pods with reclaimed_cores.
```
taints:
- effect: NoScheduleForReclaimedTasks
key: node.kubernetes.io/unschedulable
```
## Further More
We will try to provide more tutorials in the future along with feature releases in the future.

View File

@ -15,21 +15,22 @@
apiVersion: config.katalyst.kubewharf.io/v1alpha1
kind: KatalystCustomConfig
metadata:
name: katalyst-agent-config
name: eviction-configuration
spec:
targetType:
group: config.katalyst.kubewharf.io
resource: katalystagentconfigs
resource: evictionconfigurations
version: v1alpha1
---
apiVersion: config.katalyst.kubewharf.io/v1alpha1
kind: KatalystAgentConfig
kind: EvictionConfiguration
metadata:
name: default
spec:
config:
reclaimedResourcesEvictionPluginConfig:
evictionThreshold:
"katalyst.kubewharf.io/reclaimed_millicpu": 10
"katalyst.kubewharf.io/reclaimed_memory": 10
evictionPluginsConfig:
memoryEvictionPluginConfig:
enableNumaLevelDetection: false
systemKswapdRateExceedTimesThreshold: 1
systemKswapdRateThreshold: 2000

View File

@ -0,0 +1,36 @@
# Copyright 2022 The Katalyst Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: config.katalyst.kubewharf.io/v1alpha1
kind: KatalystCustomConfig
metadata:
name: eviction-configuration
spec:
targetType:
group: config.katalyst.kubewharf.io
resource: evictionconfigurations
version: v1alpha1
---
apiVersion: config.katalyst.kubewharf.io/v1alpha1
kind: EvictionConfiguration
metadata:
name: default
spec:
config:
evictionPluginsConfig:
reclaimedResourcesEvictionPluginConfig:
evictionThreshold:
"katalyst.kubewharf.io/reclaimed_millicpu": 10
"katalyst.kubewharf.io/reclaimed_memory": 10

View File

@ -0,0 +1,80 @@
# Copyright 2022 The Katalyst Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: v1
kind: ConfigMap
metadata:
name: katalyst-scheduler-config
namespace: katalyst-system
data:
scheduler-config.yaml: |-
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
leaderElection:
leaderElect: true
resourceLock: leases
resourceName: katalyst-scheduler
resourceNamespace: katalyst-system
profiles:
- schedulerName: katalyst-scheduler
plugins:
preFilter:
enabled:
- name: QoSAwareNodeResourcesFit
filter:
enabled:
- name: QoSAwareNodeResourcesFit
score:
enabled:
- name: QoSAwareNodeResourcesFit
weight: 15
- name: QoSAwareNodeResourcesBalancedAllocation
weight: 1
disabled:
- name: NodeResourcesFit
- name: NodeResourcesBalancedAllocation
reserve:
enabled:
- name: QoSAwareNodeResourcesFit
pluginConfig:
- name: NodeResourcesFit
args:
ignoredResourceGroups:
- katalyst.kubewharf.io
- name: QoSAwareNodeResourcesFit
args:
scoringStrategy:
type: MostAllocated
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
reclaimedResources:
- name: "katalyst.kubewharf.io/reclaimed_millicpu"
weight: 1
- name: "katalyst.kubewharf.io/reclaimed_memory"
weight: 1
- name: QoSAwareNodeResourcesBalancedAllocation
args:
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
reclaimedResources:
- name: "katalyst.kubewharf.io/reclaimed_millicpu"
weight: 1
- name: "katalyst.kubewharf.io/reclaimed_memory"
weight: 1

View File

@ -0,0 +1,92 @@
# Copyright 2022 The Katalyst Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: v1
kind: ConfigMap
metadata:
name: katalyst-scheduler-config
namespace: katalyst-system
data:
scheduler-config.yaml: |-
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
leaderElection:
leaderElect: true
resourceLock: leases
resourceName: katalyst-scheduler
resourceNamespace: katalyst-system
profiles:
- schedulerName: katalyst-scheduler
plugins:
preFilter:
enabled:
- name: QoSAwareNodeResourcesFit
filter:
enabled:
- name: QoSAwareNodeResourcesFit
score:
enabled:
- name: QoSAwareNodeResourcesFit
weight: 15
- name: QoSAwareNodeResourcesBalancedAllocation
weight: 1
disabled:
- name: NodeResourcesFit
- name: NodeResourcesBalancedAllocation
reserve:
enabled:
- name: QoSAwareNodeResourcesFit
pluginConfig:
- name: NodeResourcesFit
args:
ignoredResourceGroups:
- katalyst.kubewharf.io
- name: QoSAwareNodeResourcesFit
args:
scoringStrategy:
type: RequestedToCapacityRatio
requestedToCapacityRatio:
shape:
- utilization: 0
score: 0
- utilization: 100
score: 10
reclaimedRequestedToCapacityRatio:
shape:
- utilization: 0
score: 0
- utilization: 100
score: 10
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
reclaimedResources:
- name: "katalyst.kubewharf.io/reclaimed_millicpu"
weight: 1
- name: "katalyst.kubewharf.io/reclaimed_memory"
weight: 1
- name: QoSAwareNodeResourcesBalancedAllocation
args:
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
reclaimedResources:
- name: "katalyst.kubewharf.io/reclaimed_millicpu"
weight: 1
- name: "katalyst.kubewharf.io/reclaimed_memory"
weight: 1

View File

@ -0,0 +1,80 @@
# Copyright 2022 The Katalyst Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: v1
kind: ConfigMap
metadata:
name: katalyst-scheduler-config
namespace: katalyst-system
data:
scheduler-config.yaml: |-
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
leaderElection:
leaderElect: true
resourceLock: leases
resourceName: katalyst-scheduler
resourceNamespace: katalyst-system
profiles:
- schedulerName: katalyst-scheduler
plugins:
preFilter:
enabled:
- name: QoSAwareNodeResourcesFit
filter:
enabled:
- name: QoSAwareNodeResourcesFit
score:
enabled:
- name: QoSAwareNodeResourcesFit
weight: 4
- name: QoSAwareNodeResourcesBalancedAllocation
weight: 1
disabled:
- name: NodeResourcesFit
- name: NodeResourcesBalancedAllocation
reserve:
enabled:
- name: QoSAwareNodeResourcesFit
pluginConfig:
- name: NodeResourcesFit
args:
ignoredResourceGroups:
- katalyst.kubewharf.io
- name: QoSAwareNodeResourcesFit
args:
scoringStrategy:
type: LeastAllocated
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
reclaimedResources:
- name: "katalyst.kubewharf.io/reclaimed_millicpu"
weight: 1
- name: "katalyst.kubewharf.io/reclaimed_memory"
weight: 1
- name: QoSAwareNodeResourcesBalancedAllocation
args:
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
reclaimedResources:
- name: "katalyst.kubewharf.io/reclaimed_millicpu"
weight: 1
- name: "katalyst.kubewharf.io/reclaimed_memory"
weight: 1