Commit Graph

204 Commits

Author SHA1 Message Date
Mingmeng Luo 0aebfd765d
separate dynamic agent configuration with static agent configuration (#96) 2023-06-08 21:30:28 +08:00
zhou hongbin 73f78da5b3
chore(eviction): dry-run metric add pod name tag (#100) 2023-06-08 15:58:55 +08:00
SunYuliang 697fb6a9d7
exchange round and slowdown in cpu regulator (#99) 2023-06-08 14:31:06 +08:00
Lin Zhecheng 3b403aada6
fix: avoid fetching empty metrics (#98)
As metricsFetcher.GetContainerMetric and cpuResourceAdvisor.update are called in separate goroutines,
the latter may be executed before the former, resulting in obtaining empty metrics and
falling back to using pod requests as estimated resources, which leads to a sharp decrease in headroom.

Signed-off-by: linzhecheng <linzhecheng@bytedance.com>
2023-06-08 11:21:49 +08:00
SunYuliang 06be0f4aea
refine sysadvisor (#86)
* refactor(sysadvisor): refine cpu advisor to improve clarity and fix several bugs

* refactor(sysadvisor): abstract provision and headroom assembler for extensibility

* test(sysadvisor): fix cpu advisor tests and bugs
2023-06-07 18:03:04 +08:00
shaowei 80483379ba
support to use logger with pre-defined prefix (#97) 2023-06-07 15:22:39 +08:00
shaowei 26f75c14a3
change log format to export the entire pkg name (#95) 2023-06-06 16:14:04 +08:00
Lin Zhecheng 5d1313dbcf
fix: estimating cpu (#92)
We need to set binding numas for policy, otherwise total usage of Pod
will be treated as per numa usage of Pod.

Signed-off-by: linzhecheng <linzhecheng@bytedance.com>
2023-06-06 16:10:33 +08:00
Lin Zhecheng 932361060b
fix(sysadvisor): set estimated resource as pod request when ramping up (#74)
Signed-off-by: linzhecheng <linzhecheng@bytedance.com>
2023-06-05 20:43:24 +08:00
Jianyu Sun 44ec0867c2
fix(qrm): fix checkMemorySet skpping judgement (#94) 2023-06-05 20:09:11 +08:00
Jianyu Sun ce8b3033a0
feat(qrm): qrm cpu/mem plugins support identifying debug pod (#88) 2023-06-05 15:33:19 +08:00
Mingmeng Luo 2588c6eb64
add GetContainerID and GetContainerEnvs to pod util (#93) 2023-06-05 14:48:08 +08:00
Jianyu Sun adf11eaf98
fix(external_mgr): fix adding cgroup id log format typo (#90) 2023-06-05 12:45:34 +08:00
Lin Zhecheng 358a13b2a6
fix(sysadvisor): create new checkpoint if fetching it failed (#91)
The modification of data struct for checkpoint will lead to hash not matched,
we should create new checkpoint in this case.

Signed-off-by: linzhecheng <linzhecheng@bytedance.com>
2023-06-05 12:44:25 +08:00
Lin Zhecheng 6fdc547324
fix(sysadvisor): return notFound error when spd name not found (#83)
Signed-off-by: linzhecheng <linzhecheng@bytedance.com>
2023-06-05 11:20:16 +08:00
zhou hongbin d190e6cc7f
feat(eviction): rss overuse eviction base on request (#87) 2023-06-04 21:58:58 +08:00
shaowei c1139d4e4a
fix memory plugin ut for mac (#84) 2023-05-31 15:39:46 +08:00
shaowei 03c5861f9f
refine coding styles for cpu&memory plugin (#54)
* refine coding styles for cpu&memory plugin

* fix styles and fix bugs for cpu/memory plugin according to comments

* refine(qrm): add detail comments for complex codes of qrm plugins

* refine(qrm): fix conflicts in rebase numa_exclusive feature

* fix(qrm): ensures dedicated_cores owner pool non-empty

* fix comment typo and refine numa-execlusice judgement

* remove useless functions

---------

Co-authored-by: 孙健俞 <sunjianyu@bytedance.com>
2023-05-31 15:22:12 +08:00
zhou hongbin 478db28357
feat(eviction): support dryrun plugin (#79)
* feat(eviction): support dryrun plugin

* feat(eviction): support dry run plugin

* chore(eviction): licence format

* chore(eviction): change flag name

* chore(eviction): reuse general function

* chore(eviction): print dry run plugins

* chore(eviction): rename inner eviction plugin initializers
2023-05-31 11:03:03 +08:00
Jianyu Sun cd04bac195
feat(qrm): katalyst network qrm plugin supports nic affinitive allocation (#69)
* fix(qos): fix QoSEnhancementAnnotationSelector parser

* feat(qrm): katalyst network qrm plugin supports nic affinitive allocation

* move the network detact logic to general util

* fix(qrm): fix network plugin bugs

* switch to the latest api main and fix bugs

---------

Co-authored-by: shaowei.wayne <shaowei.wayne@bytedance.com>
2023-05-30 21:08:50 +08:00
Jianyu Sun f840717d3d
fix(qrm): ensures dedicated_cores owner pool non-empty (#80) 2023-05-30 15:48:53 +08:00
Mingmeng Luo 9fb76173aa
dynamic config manager if ConfigSkipFailedInitialization is true just try once update config when initialize (#81) 2023-05-30 12:12:39 +08:00
Mingmeng Luo e5591ecb51
add pod fetcher health check (#78) 2023-05-25 11:54:01 +08:00
Mingmeng Luo ed19d660d1
fix meta server no start success (#76) 2023-05-24 21:05:05 +08:00
Lin Zhecheng af88edba52
fix(spd): fix nil reference (#75)
Signed-off-by: linzhecheng <linzhecheng@bytedance.com>
2023-05-24 17:47:44 +08:00
Jianyu Sun 48338d0003
feat(qrm): support memory enhancement numa_exclusive (#19) 2023-05-24 17:46:54 +08:00
zhou hongbin e6c6e8dd8d
feat(eviction manager): support pod-level rss overuse evict (#57)
* feat: support pod-level rss overuse evict

* refactor: remove duplication

* chore: change api tag
2023-05-24 15:19:33 +08:00
Mingmeng Luo bc6fe4316b
support service profiling manager (#71) 2023-05-24 12:00:31 +08:00
Mingmeng Luo 5d0044aa6f
pod resources filter support pod parameter (#73) 2023-05-23 11:30:27 +08:00
shaowei 715fd595f2
support to dynamically switched to transformed informer (#70) 2023-05-23 11:11:50 +08:00
Mingmeng Luo 0be9a136fc
memory headroom canonical policy support enable buffer (#68) 2023-05-19 18:28:25 +08:00
Jianyu Sun 162012e99b
fix(qrm): set sidecar owner pool name same to its main container (#66) 2023-05-18 20:10:28 +08:00
Lin Zhecheng 8fbd195b01
fix(sysadvisor): maximize share pool size when disable reclaim (#60)
Signed-off-by: linzhecheng <linzhecheng@bytedance.com>
2023-05-17 15:28:00 +08:00
Lin Zhecheng 2ffafb0352
fix(sysadvisor): fix failed to find regions and pools (#63)
1. The region names of containers belonging to the same pod
should be the same, so we have to get region by podUID.
2. The regionNames in poolInfo should not be cleaned up when updatePoolInfo.

Signed-off-by: linzhecheng <linzhecheng@bytedance.com>
2023-05-17 15:14:19 +08:00
Mingmeng Luo f42d83af88
add dynamic config disable configuration and fix zone allocations out of order (#67)
* support disable dynamic configuration

* merge topologyZone Attributes and Allocations in generateTopologyZoneStatus to make sure the final zone status sorted
2023-05-17 15:00:47 +08:00
Lin Zhecheng e1ac91b3c6
feat(sysadvisor): support persist regionEntries (#64)
In order to maintain the same region names after rebuilding regionMap of sysadvisor,
we need to persist regionEntries.
And checkpoint corruption should be ignored if MetaCacheCheckpoint struct changed after upgrading.

Signed-off-by: linzhecheng <linzhecheng@bytedance.com>
2023-05-16 17:51:59 +08:00
shaowei 6b3a86250d
support to mark unhealthy for acquiring file locks (#65) 2023-05-16 14:59:33 +08:00
Mingmeng Luo b857532402
support utilization based canonical cpu headroom policy (#59)
* support adaptive cpu headroom policy

* fix network policy register not import path

* rename sysadvisor RegisterHealthzCheckRules to RegisterAdvisorPlugin

* change policy name adaptive to utilization
2023-05-16 12:02:05 +08:00
Lin Zhecheng 68be93726f
fix(sysadvisor): clean up the contianers not existed in checkpoint (#62)
the `AddContainer` request may time out due to `storeState`,
causing container leaks in metaCache, so it is necessary clean up any excess containers.

Signed-off-by: linzhecheng <linzhecheng@bytedance.com>
2023-05-15 14:13:54 +08:00
Mingmeng Luo d0f5af755d
feat(resource) refine resource manager to support newly cnr definition to report enhanced topology information (#45)
* 1. refine kubelet plugin to support report topology zone to cnr
2. refine eviction, scheduler, reporter to support newly cnr definition
3. cnr reporter support merge cnr's TopologyZone field by Type and Name as unique key
4. add conversion framework to reporter manager to support transformation from old ReportField to newly one
5. support reset cnr to default when get cnr from remote with UnmarshalTypeError

* refactor(test): go test add -race flag

* fix pod resources server topology adapter restart
2023-05-12 19:20:54 +08:00
Lin Zhecheng be47c2535b
feat(sysadvisor): support multi share regions (#47)
Signed-off-by: linzhecheng <linzhecheng@bytedance.com>
2023-05-11 17:50:58 +08:00
Lin Zhecheng 1b8db6f118
fix(sysadvisor): fix race between AddPod, RemovePod and RangeAndUpdateContainer (#52)
1. add new interface RangeAndDeleteContainer
2. break the mutex lock into finer-grained locks: podMutex, poolMutex and poolMutex.
So that poolEntries or regionEntries can be access when RangeContainer or RangeAndUpdateContainer

Signed-off-by: linzhecheng <linzhecheng@bytedance.com>
2023-05-10 16:52:44 +08:00
Lin Zhecheng e7f1116b4e
fix ut failed on macOS (#58)
Signed-off-by: linzhecheng <linzhecheng@bytedance.com>
2023-05-09 23:32:12 +08:00
shaowei b37964eaeb
Merge pull request #55 from caohe/license
chore(*): add license header check to ci pipeline and fix some related issues
2023-05-09 20:41:14 +08:00
shaowei 68b5950474
Merge pull request #56 from csfldf/dev/fix_passing_ramp_up
fix(qrm): pass ramp up infomation to sys advisor
2023-05-09 15:56:01 +08:00
孙健俞 8e00fe2e02 fix(qrm): pass ramp up information to sys advisor 2023-05-09 15:33:59 +08:00
caohe 21d86d9a08 chore(*): add license header checking to ci pipeline
Signed-off-by: caohe <caohe9603@gmail.com>
2023-05-09 11:59:07 +08:00
caohe 0d4829ed0c chore(*): remove redundant config files for lint checking
Signed-off-by: caohe <caohe9603@gmail.com>
2023-05-09 11:53:48 +08:00
caohe c088ca1ec8 chore(*): fix incorrect license headers
Signed-off-by: caohe <caohe9603@gmail.com>
2023-05-09 11:53:48 +08:00
caohe 4037b6f89e chore(*): add license header scripts
Signed-off-by: caohe <caohe9603@gmail.com>
2023-05-09 11:53:48 +08:00