AIOps : Investigation par l’IA dans Kubernetes avec HolmesGPT, Ollama et RunPod …

Karim
24 min read1 day ago

--

Dans le monde de l’orchestration de conteneurs, Kubernetes est devenu une norme pour gérer les workloads conteneurisés. Cependant, la gestion et le dépannage des clusters Kubernetes peuvent être complexes et chronophages. Cet article explore comment l’intelligence artificielle (IA) peut être intégrée dans Kubernetes pour améliorer l’investigation et la gestion des incidents. J’avais d’ailleurs évoqué le sujet dans un article précédent :

Ici je vais m’intéresser à HolmesGPT. HolmesGPT, développé par Robusta, est un agent de dépannage open source qui utilise l’IA pour investiguer les incidents dans les clusters Kubernetes avec ces caractéristiques :

  • Intégration avec les outils de gestion d’incidents : HolmesGPT se connecte à des outils comme PagerDuty, OpsGenie et Prometheus pour collecter des données et analyser les alertes.
  • Investigation automatisée : Grâce à l’IA, HolmesGPT peut identifier et résoudre des problèmes tels que l’expiration des certificats SSL, les problèmes de ressources insuffisantes et les problèmes d’affinité des nœuds. Cela réduit significativement le temps et l’effort nécessaires pour le dépannage.
  • Personnalisation : HolmesGPT permet de créer des livres de recettes (runbooks) personnalisés pour gérer des problèmes spécifiques, en utilisant des API et des outils personnalisés si nécessaire.

Pour cet exercice, je vais d’abord lancer une instance Ubuntu 24.04 LTS de nouveau chez le fournisseur Cloud DigitalOcean :

Je vais y installer Incus, un fork de LXD qui va me servir de base pour la formation d’un cluster Kubernetes avec plusieurs containers :

Comme pour LXD, je vais procéder à la création de plusieurs profiles. Mais dans un premier temps, installation d’Incus sur l’instance :

root@k0s-incus:~# curl -fsSL https://pkgs.zabbly.com/key.asc | gpg --show-keys --fingerprint
gpg: directory '/root/.gnupg' created
gpg: keybox '/root/.gnupg/pubring.kbx' created
pub rsa3072 2023-08-23 [SC] [expires: 2025-08-22]
4EFC 5906 96CB 15B8 7C73 A3AD 82CC 8797 C838 DCFD
uid Zabbly Kernel Builds <info@zabbly.com>
sub rsa3072 2023-08-23 [E] [expires: 2025-08-22]

root@k0s-incus:~# mkdir -p /etc/apt/keyrings/

root@k0s-incus:~# curl -fsSL https://pkgs.zabbly.com/key.asc -o /etc/apt/keyrings/zabbly.asc

root@k0s-incus:~# sh -c 'cat <<EOF > /etc/apt/sources.list.d/zabbly-incus-stable.sources
Enabled: yes
Types: deb
URIs: https://pkgs.zabbly.com/incus/stable
Suites: $(. /etc/os-release && echo ${VERSION_CODENAME})
Components: main
Architectures: $(dpkg --print-architecture)
Signed-By: /etc/apt/keyrings/zabbly.asc

EOF'
root@k0s-incus:~# apt-get update

Hit:1 http://security.ubuntu.com/ubuntu noble-security InRelease
Hit:2 http://mirrors.digitalocean.com/ubuntu noble InRelease
Hit:3 https://repos-droplet.digitalocean.com/apt/droplet-agent main InRelease
Hit:4 http://mirrors.digitalocean.com/ubuntu noble-updates InRelease
Hit:5 http://mirrors.digitalocean.com/ubuntu noble-backports InRelease
Get:6 https://pkgs.zabbly.com/incus/stable noble InRelease [7358 B]
Get:7 https://pkgs.zabbly.com/incus/stable noble/main amd64 Packages [3542 B]
Fetched 10.9 kB in 1s (13.3 kB/s)
Reading package lists... Done
root@k0s-incus:~# apt-get install incus incus-client incus-ui-canonical -y
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
attr dconf-gsettings-backend dconf-service dns-root-data dnsmasq-base fontconfig genisoimage glib-networking glib-networking-common glib-networking-services gsettings-desktop-schemas
gstreamer1.0-plugins-base gstreamer1.0-plugins-good gstreamer1.0-x incus-base iw libaa1 libasyncns0 libavc1394-0 libboost-iostreams1.83.0 libboost-thread1.83.0 libbtrfs0t64 libcaca0
libcairo-gobject2 libcairo2 libcdparanoia0 libdatrie1 libdaxctl1 libdconf1 libdv4t64 libflac12t64 libgdk-pixbuf-2.0-0 libgdk-pixbuf2.0-bin libgdk-pixbuf2.0-common libgraphite2-3
libgstreamer-plugins-base1.0-0 libgstreamer-plugins-good1.0-0 libharfbuzz0b libiec61883-0 libmp3lame0 libmpg123-0t64 libndctl6 libnet1 libogg0 libopus0 liborc-0.4-0t64 libpango-1.0-0
libpangocairo-1.0-0 libpangoft2-1.0-0 libpixman-1-0 libpmem1 libpmemobj1 libproxy1v5 libpulse0 librados2 libraw1394-11 librbd1 librdmacm1t64 libshout3 libsndfile1 libsoup-3.0-0
libsoup-3.0-common libspeex1 libspice-server1 libtag1v5 libtag1v5-vanilla libthai-data libthai0 libtheora0 libtwolame0 libusbredirparser1t64 libv4l-0t64 libv4lconvert0t64 libvisual-0.4-0
libvorbis0a libvorbisenc2 libvpx9 libwavpack1 libx11-xcb1 libxcb-render0 libxcb-shm0 libxdamage1 libxfixes3 libxi6 libxrender1 libxtst6 libxv1 session-migration sshfs wireless-regdb
x11-common xdelta3
root@k0s-incus:~# incus
Description:
Command line client for Incus

All of Incus's features can be driven through the various commands below.
For help with any of those, simply call them with --help.

Custom commands can be defined through aliases, use "incus alias" to control those.

Usage:
incus [command]

Available Commands:
admin Manage incus daemon
cluster Manage cluster members
config Manage instance and server configuration options
console Attach to instance consoles
copy Copy instances within or in between servers
create Create instances from images
delete Delete instances
exec Execute commands in instances
export Export instance backups
file Manage files in instances
help Help about any command
image Manage images
import Import instance backups
info Show instance or server information
launch Create and start instances from images
list List instances
move Move instances within or in between servers
network Manage and attach instances to networks
pause Pause instances
profile Manage profiles
project Manage projects
publish Publish instances as images
rebuild Rebuild instances
remote Manage the list of remote servers
rename Rename instances
restart Restart instances
resume Resume instances
snapshot Manage instance snapshots
start Start instances
stop Stop instances
storage Manage storage pools and volumes
top Display resource usage info per instance
version Show local and remote versions
webui Open the web interface

Flags:
--all Show less common commands
--debug Show all debug messages
--force-local Force using the local unix socket
-h, --help Print help
--project Override the source project
-q, --quiet Don't show progress information
--sub-commands Use with help or --help to view sub-commands
-v, --verbose Show all information messages
--version Print version number

Use "incus [command] --help" for more information about a command.

Initialisation d’Incus en version minimaliste :

root@k0s-incus:~# incus admin init
Would you like to use clustering? (yes/no) [default=no]:
Do you want to configure a new storage pool? (yes/no) [default=yes]:
Name of the new storage pool [default=default]:
Name of the storage backend to use (btrfs, dir, lvm) [default=btrfs]: dir
Where should this storage pool store its data? [default=/var/lib/incus/storage-pools/default]:
Would you like to create a new local network bridge? (yes/no) [default=yes]:
What should the new bridge be called? [default=incusbr0]:
What IPv4 address should be used? (CIDR subnet notation, “auto” or “none”) [default=auto]:
What IPv6 address should be used? (CIDR subnet notation, “auto” or “none”) [default=auto]:
Would you like the server to be available over the network? (yes/no) [default=no]:
Would you like stale cached images to be updated automatically? (yes/no) [default=yes]:
Would you like a YAML "init" preseed to be printed? (yes/no) [default=no]:

root@k0s-incus:~# incus list
+------+-------+------+------+------+-----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
+------+-------+------+------+------+-----------+

root@k0s-incus:~# incus profile list
+---------+-----------------------+---------+
| NAME | DESCRIPTION | USED BY |
+---------+-----------------------+---------+
| default | Default Incus profile | 0 |
+---------+-----------------------+---------+

root@k0s-incus:~# incus profile show default
config: {}
description: Default Incus profile
devices:
eth0:
name: eth0
network: incusbr0
type: nic
root:
path: /
pool: default
type: disk
name: default
used_by: []
project: default

root@k0s-incus:~# incus profile create k8s

Incus dispose d’un tableau de bord de contrôle qui peut être actionné temporairement par incus webui.

Activation de ce dernier :

root@k0s-incus:~# nohup incus webui &
[1] 4104
root@k0s-incus:~# nohup: ignoring input and appending output to 'nohup.out'

root@k0s-incus:~# cat nohup.out
Web server running at: http://127.0.0.1:34363/ui?auth_token=3c5f5d4b-f9ed-4bf9-a174-d5ea2366cfbf

Utilisation de pinggy.io pour y accéder :

root@k0s-incus:~# ssh -p 443 -R0:127.0.0.1:34363 a.pinggy.io

Je récupère le même profile qu’utilise LXD pour MicroK8s :

roothttps://microk8s.io/docs/install-lxd@k0s-incus:~# wget https://raw.githubusercontent.com/ubuntu/microk8s/master/tests/lxc/microk8s.profile -O k8s.profile
--2025-01-14 20:58:42-- https://raw.githubusercontent.com/ubuntu/microk8s/master/tests/lxc/microk8s.profile
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 816 [text/plain]
Saving to: ‘k8s.profile’

k8s.profile 100%[=====================================================================================================>] 816 --.-KB/s in 0s

2025-01-14 20:58:42 (33.4 MB/s) - ‘k8s.profile’ saved [816/816]

root@k0s-incus:~# cat k8s.profile | incus profile edit k8s
root@k0s-incus:~# rm k8s.profile
root@k0s-incus:~# incus profile show k8s
config:
boot.autostart: "true"
linux.kernel_modules: ip_vs,ip_vs_rr,ip_vs_wrr,ip_vs_sh,ip_tables,ip6_tables,netlink_diag,nf_nat,overlay,br_netfilter
raw.lxc: |
lxc.apparmor.profile=unconfined
lxc.mount.auto=proc:rw sys:rw cgroup:rw
lxc.cgroup.devices.allow=a
lxc.cap.drop=
security.nesting: "true"
security.privileged: "true"
description: ""
devices:
aadisable:
path: /sys/module/nf_conntrack/parameters/hashsize
source: /sys/module/nf_conntrack/parameters/hashsize
type: disk
aadisable2:
path: /dev/kmsg
source: /dev/kmsg
type: unix-char
aadisable3:
path: /sys/fs/bpf
source: /sys/fs/bpf
type: disk
aadisable4:
path: /proc/sys/net/netfilter/nf_conntrack_max
source: /proc/sys/net/netfilter/nf_conntrack_max
type: disk
name: k8s
used_by: []
project: default

Comme Incus a la faculté d’utiliser cloud-init, je crée un nouveau profile destiné à cet usage :

root@k0s-incus:~# incus profile show cloud
config:
cloud-init.user-data: |
#cloud-config
package_update: true
package_upgrade: true
package_reboot_if_required: true
packages:
- vim
- wget
- git
- curl
- htop
- openssh-server
bootcmd:
- systemctl enable ssh
- systemctl start ssh
ssh_authorized_keys:
- ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQCpbsaaVUMa2TM9q8VkeBmbKvJpbreXTcqI5F5N3riGsoZ7Z/IIN7eR6J47UP2bj3IBTdgHmij1uOexm60QBO2PY4abIhsN+xnVS4a0LSyI8v6nYECWbEehL/gFn6uDmSLA4m0hZCF5BSpLxQYzKS28dHIdXsLC4CDd67nAXIhOiVpM0q/AUCuSy+mA0VwFa/JAkFCk8TpQBorgwJIq635imrgxYIpEUA2wHXOhw23mO3zTUlay13LSlA2a1xyTkP8hSDWdRYVxr2DEB/MtmTX2BdWlA5rDRmzXE7R2/csE245WAxG+XfSu4zNqhHzm8Df3zmZn3/UyKLcx4eJF//mVZyrM7RQHRteA/im8I4IavrReGyCUKY+OsSfygYVFyO87rYQ+IOauOnB4LxBohBjSBN3Skk4X7krYFIi8D9R1lmL+VvBfpvy0YMurOahY1VJFzD0dUeK2bDUdeWzfFkcX039d9/RRXRxieNpxwp1BLPi5/DXG8FihzgwVTf6h60J9/fkYzY+BO8CKG2kYTUsy1ykuXLzLY5sTCREiEoEKcJ9IGz8OimZ1AmkgJJCrQnI6mT/KiNDU6YCc75ONKTKX5HKVPhZWT255Aw4f5LBbBrj06cJX3GuunV0I30+BYyHwLbPBoqgd4GUk3YJlr8wS3qre/YUSc2iKNDTOzFCC8Q== root@k0s-incus
description: incus with cloud-init
devices: {}
name: cloud
used_by: []
project: default

Je suis prêt pour la création de trois containers qui me serviront de pivot à la création d’un cluster Kubernetes :

root@k0s-incus:~# for i in {1..3}; do incus launch -p default -p k8s -p cloud images:ubuntu/24.04/cloud k0s-$i; done
Launching k0s-1
Launching k0s-2
Launching k0s-3

root@k0s-incus:~# incus list
+-------+---------+-----------------------+-----------------------------------------------+-----------+-----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
+-------+---------+-----------------------+-----------------------------------------------+-----------+-----------+
| k0s-1 | RUNNING | 10.224.160.99 (eth0) | fd42:4641:b619:c782:216:3eff:fea4:53d3 (eth0) | CONTAINER | 0 |
+-------+---------+-----------------------+-----------------------------------------------+-----------+-----------+
| k0s-2 | RUNNING | 10.224.160.54 (eth0) | fd42:4641:b619:c782:216:3eff:feee:7af8 (eth0) | CONTAINER | 0 |
+-------+---------+-----------------------+-----------------------------------------------+-----------+-----------+
| k0s-3 | RUNNING | 10.224.160.215 (eth0) | fd42:4641:b619:c782:216:3eff:fef3:709b (eth0) | CONTAINER | 0 |
+-------+---------+-----------------------+-----------------------------------------------+-----------+-----------+

root@k0s-incus:~# cat .ssh/config
Host *
StrictHostKeyChecking no
UserKnownHostsFile=/dev/null

root@k0s-incus:~# ssh ubuntu@10.224.160.99

Welcome to Ubuntu 24.04.1 LTS (GNU/Linux 6.8.0-51-generic x86_64)

* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/pro

The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.

To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

ubuntu@k0s-1:~$

Récupération de k0sctl pour la création d’un cluster Kubernetes avec k0s :

root@k0s-incus:~# wget -c https://github.com/k0sproject/k0sctl/releases/download/v0.21.0/k0sctl-linux-amd64 && chmod +x k0sctl-linux-amd64 && mv k0sctl-linux-amd64 /usr/local/bin/k0sctl

Saving to: ‘k0sctl-linux-amd64’

k0sctl-linux-amd64 100%[=====================================================================================================>] 18.21M --.-KB/s in 0.1s

2025-01-14 21:22:23 (122 MB/s) - ‘k0sctl-linux-amd64’ saved [19091608/19091608]

root@k0s-incus:~# k0sctl
NAME:
k0sctl - k0s cluster management tool

USAGE:
k0sctl [global options] command [command options]

COMMANDS:
version Output k0sctl version
apply Apply a k0sctl configuration
kubeconfig Output the admin kubeconfig of the cluster
init Create a configuration template
reset Remove traces of k0s from all of the hosts
backup Take backup of existing clusters state
config Configuration related sub-commands
completion
help, h Shows a list of commands or help for one command

GLOBAL OPTIONS:
--debug, -d Enable debug logging (default: false) [$DEBUG]
--trace Enable trace logging (default: false) [$TRACE]
--no-redact Do not hide sensitive information in the output (default: false)
--help, -h show help
root@k0s-incus:~# k0sctl init --k0s > k0sctl.yaml
root@k0s-incus:~# cat k0sctl.yaml 
apiVersion: k0sctl.k0sproject.io/v1beta1
kind: Cluster
metadata:
name: k0s-cluster
user: admin
spec:
hosts:
- ssh:
address: 10.224.160.99
user: ubuntu
port: 22
keyPath: /root/.ssh/id_rsa
role: controller
- ssh:
address: 10.224.160.54
user: ubuntu
port: 22
keyPath: /root/.ssh/id_rsa
role: worker
- ssh:
address: 10.224.160.215
user: ubuntu
port: 22
keyPath: /root/.ssh/id_rsa
role: worker
k0s:
config:
apiVersion: k0s.k0sproject.io/v1beta1
kind: Cluster
metadata:
name: k0s
spec:
api:
k0sApiPort: 9443
port: 6443
installConfig:
users:
etcdUser: etcd
kineUser: kube-apiserver
konnectivityUser: konnectivity-server
kubeAPIserverUser: kube-apiserver
kubeSchedulerUser: kube-scheduler
konnectivity:
adminPort: 8133
agentPort: 8132
network:
kubeProxy:
disabled: false
mode: iptables
kuberouter:
autoMTU: true
mtu: 0
peerRouterASNs: ""
peerRouterIPs: ""
podCIDR: 10.244.0.0/16
provider: kuberouter
serviceCIDR: 10.96.0.0/12
podSecurityPolicy:
defaultPolicy: 00-k0s-privileged
storage:
type: etcd
telemetry:
enabled: true

Lancement de la création :

                                                      
root@k0s-incus:~# k0sctl apply --config k0sctl.yaml

⠀⣿⣿⡇⠀⠀⢀⣴⣾⣿⠟⠁⢸⣿⣿⣿⣿⣿⣿⣿⡿⠛⠁⠀⢸⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠀█████████ █████████ ███
⠀⣿⣿⡇⣠⣶⣿⡿⠋⠀⠀⠀⢸⣿⡇⠀⠀⠀⣠⠀⠀⢀⣠⡆⢸⣿⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀███ ███ ███
⠀⣿⣿⣿⣿⣟⠋⠀⠀⠀⠀⠀⢸⣿⡇⠀⢰⣾⣿⠀⠀⣿⣿⡇⢸⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠀███ ███ ███
⠀⣿⣿⡏⠻⣿⣷⣤⡀⠀⠀⠀⠸⠛⠁⠀⠸⠋⠁⠀⠀⣿⣿⡇⠈⠉⠉⠉⠉⠉⠉⠉⠉⢹⣿⣿⠀███ ███ ███
⠀⣿⣿⡇⠀⠀⠙⢿⣿⣦⣀⠀⠀⠀⣠⣶⣶⣶⣶⣶⣶⣿⣿⡇⢰⣶⣶⣶⣶⣶⣶⣶⣶⣾⣿⣿⠀█████████ ███ ██████████
k0sctl v0.21.0 Copyright 2023, k0sctl authors.
By continuing to use k0sctl you agree to these terms:
https://k0sproject.io/licenses/eula
INFO ==> Running phase: Set k0s version
INFO Looking up latest stable k0s version
INFO Using k0s version v1.31.3+k0s.0
INFO ==> Running phase: Connect to hosts
INFO [ssh] 10.224.160.215:22: connected
INFO [ssh] 10.224.160.99:22: connected
INFO [ssh] 10.224.160.54:22: connected
INFO ==> Running phase: Detect host operating systems
INFO [ssh] 10.224.160.215:22: is running Ubuntu 24.04.1 LTS
INFO [ssh] 10.224.160.99:22: is running Ubuntu 24.04.1 LTS
INFO [ssh] 10.224.160.54:22: is running Ubuntu 24.04.1 LTS
INFO ==> Running phase: Acquire exclusive host lock
INFO ==> Running phase: Prepare hosts
INFO ==> Running phase: Gather host facts
INFO [ssh] 10.224.160.215:22: using k0s-3 as hostname
INFO [ssh] 10.224.160.54:22: using k0s-2 as hostname
INFO [ssh] 10.224.160.99:22: using k0s-1 as hostname
INFO [ssh] 10.224.160.215:22: discovered eth0 as private interface
INFO [ssh] 10.224.160.54:22: discovered eth0 as private interface
INFO [ssh] 10.224.160.99:22: discovered eth0 as private interface
INFO ==> Running phase: Validate hosts
INFO ==> Running phase: Validate facts
INFO ==> Running phase: Download k0s on hosts
INFO [ssh] 10.224.160.215:22: downloading k0s v1.31.3+k0s.0
INFO [ssh] 10.224.160.54:22: downloading k0s v1.31.3+k0s.0
INFO [ssh] 10.224.160.99:22: downloading k0s v1.31.3+k0s.0
INFO ==> Running phase: Install k0s binaries on hosts
INFO [ssh] 10.224.160.99:22: validating configuration
INFO ==> Running phase: Configure k0s
INFO [ssh] 10.224.160.99:22: installing new configuration
INFO ==> Running phase: Initialize the k0s cluster
INFO [ssh] 10.224.160.99:22: installing k0s controller
INFO [ssh] 10.224.160.99:22: waiting for the k0s service to start
INFO [ssh] 10.224.160.99:22: wait for kubernetes to reach ready state
INFO ==> Running phase: Install workers
INFO [ssh] 10.224.160.99:22: generating a join token for worker 1
INFO [ssh] 10.224.160.99:22: generating a join token for worker 2
INFO [ssh] 10.224.160.215:22: validating api connection to https://10.224.160.99:6443 using join token
INFO [ssh] 10.224.160.54:22: validating api connection to https://10.224.160.99:6443 using join token
INFO [ssh] 10.224.160.215:22: writing join token to /etc/k0s/k0stoken
INFO [ssh] 10.224.160.54:22: writing join token to /etc/k0s/k0stoken
INFO [ssh] 10.224.160.54:22: installing k0s worker
INFO [ssh] 10.224.160.215:22: installing k0s worker
INFO [ssh] 10.224.160.215:22: starting service
INFO [ssh] 10.224.160.215:22: waiting for node to become ready
INFO [ssh] 10.224.160.54:22: starting service
INFO [ssh] 10.224.160.54:22: waiting for node to become ready
INFO ==> Running phase: Release exclusive host lock
INFO ==> Running phase: Disconnect from hosts
INFO ==> Finished in 42s
INFO k0s cluster version v1.31.3+k0s.0 is now installed
INFO Tip: To access the cluster you can now fetch the admin kubeconfig using:
INFO k0sctl kubeconfig

Le cluster est actif :

root@k0s-incus:~# curl -LO https://dl.k8s.io/release/v1.31.3/bin/linux/amd64/kubectl && chmod +x kubectl && mv kubectl /usr/local/bin/
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 138 100 138 0 0 923 0 --:--:-- --:--:-- --:--:-- 926
100 53.7M 100 53.7M 0 0 476k 0 0:01:55 0:01:55 --:--:-- 1023k

root@k0s-incus:~# mkdir .kube
root@k0s-incus:~# k0sctl kubeconfig --config k0sctl.yaml > .kube/config
root@k0s-incus:~# kubectl cluster-info
Kubernetes control plane is running at https://10.224.160.99:6443
CoreDNS is running at https://10.224.160.99:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

root@k0s-incus:~# kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k0s-2 Ready <none> 5m1s v1.31.3+k0s 10.224.160.54 <none> Ubuntu 24.04.1 LTS 6.8.0-51-generic containerd://1.7.24
k0s-3 Ready <none> 5m1s v1.31.3+k0s 10.224.160.215 <none> Ubuntu 24.04.1 LTS 6.8.0-51-generic containerd://1.7.24

root@k0s-incus:~# kubectl get po,svc -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system pod/coredns-645c5d6f5b-kgnsf 1/1 Running 0 5m2s
kube-system pod/coredns-645c5d6f5b-n2rbk 1/1 Running 0 5m2s
kube-system pod/konnectivity-agent-2dg8l 1/1 Running 0 5m4s
kube-system pod/konnectivity-agent-5l5dl 1/1 Running 0 5m4s
kube-system pod/kube-proxy-cx47n 1/1 Running 0 5m7s
kube-system pod/kube-proxy-sp5fd 1/1 Running 0 5m7s
kube-system pod/kube-router-6l4qv 1/1 Running 0 5m7s
kube-system pod/kube-router-b9t89 1/1 Running 0 5m7s
kube-system pod/metrics-server-78c4ccbc7f-jxpzz 1/1 Running 0 5m1s

NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 5m17s
kube-system service/kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 5m7s
kube-system service/metrics-server ClusterIP 10.109.44.51 <none> 443/TCP 5m1s

Il est alors possible de procéder à l’installation d’HolmesGPT via Pipx :

root@k0s-incus:~# apt install pipx -y

root@k0s-incus:~# pipx ensurepath

Success! Added /root/.local/bin to the PATH environment variable.

Consider adding shell completions for pipx. Run 'pipx completions' for instructions.

You will need to open a new terminal or re-login for the PATH changes to take effect.

Otherwise pipx is ready to go! ✨ 🌟 ✨

root@k0s-incus:~# pipx install "https://github.com/robusta-dev/holmesgpt/archive/refs/heads/master.zip"
installed package holmesgpt 0.1.0, installed using Python 3.12.3
These apps are now globally available
- holmes
done! ✨ 🌟 ✨
root@k0s-incus:~# holmes version
/root/.local/share/pipx/venvs/holmesgpt/lib/python3.12/site-packages/pydantic/_internal/_config.py:345: UserWarning: Valid config keys have changed in V2:
* 'fields' has been removed
warnings.warn(message, UserWarning)
HEAD -> master-bfafbde3

Pour l’accompagner, je récupère K9s qui fournit une interface utilisateur de terminal pour interagir avec vos clusters Kubernetes. L’objectif de ce projet est de faciliter la navigation, l’observation et la gestion de vos applications dans la nature. K9s surveille continuellement Kubernetes pour les changements et offre des commandes ultérieures pour interagir avec vos ressources observées.

root@k0s-incus:~# wget -c https://github.com/derailed/k9s/releases/download/v0.32.7/k9s_linux_amd64.deb
HTTP request sent, awaiting response... 200 OK
Length: 31832132 (30M) [application/octet-stream]
Saving to: ‘k9s_linux_amd64.deb’

k9s_linux_amd64.deb 100%[=====================================================================================================>] 30.36M --.-KB/s in 0.1s

2025-01-14 21:40:07 (291 MB/s) - ‘k9s_linux_amd64.deb’ saved [31832132/31832132]

root@k0s-incus:~# apt install -f ./k9s_linux_amd64.deb
root@k0s-incus:~# k9s --help
K9s is a CLI to view and manage your Kubernetes clusters.

Usage:
k9s [flags]
k9s [command]

Available Commands:
completion Generate the autocompletion script for the specified shell
help Help about any command
info List K9s configurations info
version Print version/build info

Flags:
-A, --all-namespaces Launch K9s in all namespaces
--as string Username to impersonate for the operation
--as-group stringArray Group to impersonate for the operation
--certificate-authority string Path to a cert file for the certificate authority
--client-certificate string Path to a client certificate file for TLS
--client-key string Path to a client key file for TLS
--cluster string The name of the kubeconfig cluster to use
-c, --command string Overrides the default resource to load when the application launches
--context string The name of the kubeconfig context to use
--crumbsless Turn K9s crumbs off
--headless Turn K9s header off
-h, --help help for k9s
--insecure-skip-tls-verify If true, the server's caCertFile will not be checked for validity
--kubeconfig string Path to the kubeconfig file to use for CLI requests
--logFile string Specify the log file (default "/root/.local/state/k9s/k9s.log")
-l, --logLevel string Specify a log level (error, warn, info, debug, trace) (default "info")
--logoless Turn K9s logo off
-n, --namespace string If present, the namespace scope for this CLI request
--readonly Sets readOnly mode by overriding readOnly configuration setting
-r, --refresh int Specify the default refresh rate as an integer (sec) (default 2)
--request-timeout string The length of time to wait before giving up on a single server request
--screen-dump-dir string Sets a path to a dir for a screen dumps
--token string Bearer token for authentication to the API server
--user string The name of the kubeconfig user to use
--write Sets write mode by overriding the readOnly configuration setting

Use "k9s [command] --help" for more information about a command.

Ollama, une alternative à ChatGPT, peut être déployée pour fournir des capacités de traitement du langage naturel directement dans votre environnement. Cela permet de bénéficier des capacités de traitement du langage naturel de Ollama sans dépendre de services cloud externes.

En intégrant Ollama à vos outils de dépannage, vous pouvez générer des réponses et des solutions basées sur l’analyse des logs et des données de votre cluster Kubernetes.

Pour son exécution, je suis amené à utiliser RunPod, une plateforme qui permet d’exécuter des tâches de traitement du langage naturel et d’autres tâches IA. RunPod vous permet en effet de créer des environnements de pod personnalisés pour exécuter des modèles de langage comme Ollama ou d’autres applications IA :

Création d’un Pod GPU qui me permet donc d’utiliser Ollama …

Je peux m’y connecter via SSH :

Welcome to Ubuntu 22.04.3 LTS (GNU/Linux 6.5.0-44-generic x86_64)

* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage

This system has been minimized by removing packages and content that are
not required on a system that users do not log into.

To restore this content, you can run the 'unminimize' command.

The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.

______ ______ _
(_____ \ (_____ \ | |
_____) ) _ _ ____ _____) )___ __| |
| __ / | | | || _ \ | ____// _ \ / _ |
| | \ \ | |_| || | | || | | |_| |( (_| |
|_| |_||____/ |_| |_||_| \___/ \____|

For detailed documentation and guides, please visit:
https://docs.runpod.io/ and https://blog.runpod.io/


root@5ed8df208cf4:~# nvidia-smi
Tue Jan 14 22:03:15 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4070 Ti On | 00000000:81:00.0 Off | N/A |
| 0% 28C P8 11W / 285W | 2MiB / 12282MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+

Exécution d’Ollama :

root@5ed8df208cf4:~# apt update 2> /dev/null && apt install -qq lshw -y 2> /dev/null

root@5ed8df208cf4:~# export OLLAMA_HOST=0.0.0.0:11434
root@5ed8df208cf4:~# (curl -fsSL https://ollama.com/install.sh | sh && ollama serve > ollama.log 2>&1) &
[1] 950
root@5ed8df208cf4:~# >>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
######################################################################## 100.0%
>>> Creating ollama user...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
WARNING: systemd is not running
>>> NVIDIA GPU installed.
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.

root@5ed8df208cf4:~# netstat -tunlp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:7861 0.0.0.0:* LISTEN 52/nginx: master pr
tcp 0 0 0.0.0.0:8081 0.0.0.0:* LISTEN 52/nginx: master pr
tcp 0 0 0.0.0.0:8001 0.0.0.0:* LISTEN 52/nginx: master pr
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 70/sshd: /usr/sbin/
tcp 0 0 0.0.0.0:3001 0.0.0.0:* LISTEN 52/nginx: master pr
tcp 0 0 0.0.0.0:9091 0.0.0.0:* LISTEN 52/nginx: master pr
tcp 0 0 127.0.0.11:39145 0.0.0.0:* LISTEN -
tcp6 0 0 :::22 :::* LISTEN 70/sshd: /usr/sbin/
tcp6 0 0 :::11434 :::* LISTEN 1006/ollama
udp 0 0 127.0.0.11:33663 0.0.0.0:* -

Récupération d’un LLM avec Llama3.2 :

root@5ed8df208cf4:~# ollama pull llama3.2:3b-instruct-q4_K_S
pulling manifest
pulling d5e517daeee4... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 1.9 GB
pulling 966de95ca8a6... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 1.4 KB
pulling fcc5a6bec9da... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 7.7 KB
pulling a70ff7e570d9... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 6.0 KB
pulling 56bb8bd477a5... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 96 B
pulling 9c65e8607c0c... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 561 B
verifying sha256 digest
writing manifest
success

root@5ed8df208cf4:~# ollama list
NAME ID SIZE MODIFIED
llama3.2:3b-instruct-q4_K_S 80f2089878c9 1.9 GB 31 seconds ago

L’endpoint d’Ollama est disponible publiquement via le proxy offert par RunPod :

Modification de K9s pour y intégrer cet endpoint et HolmesGPT sous forme d’un plug-in :

root@k0s-incus:~# cat ~/.config/k9s/plugins.yaml
plugins:
holmesgpt:
shortCut: Shift-H
description: Ask HolmesGPT
scopes:
- all
command: bash
background: false
confirm: false
args:
- -c
- |
holmes ask "why is $NAME of $RESOURCE_NAME in -n $NAMESPACE not working as expected" --model="openai/llama3.2:3b-instruct-q4_K_S"
echo "Press 'q' to exit"
while : ; do
read -n 1 k <&1
if [[ $k = q ]] ; then
break
fi
done

root@k0s-incus:~# export OPENAI_API_BASE="https://vsr6spvysc6jly-11434.proxy.runpod.net/v1"
root@k0s-incus:~# export OPENAI_API_KEY=123

Déploiement d’un exemple de Pod problématique dans le cluster Kubernetes via les exemples fournis par Robusta :

root@k0s-incus:~# kubectl apply -f https://raw.githubusercontent.com/robusta-dev/kubernetes-demos/main/crashpod/broken.yaml
deployment.apps/payment-processing-worker created

root@k0s-incus:~# kubectl get po
NAME READY STATUS RESTARTS AGE
payment-processing-worker-747ccfb9db-njgmw 0/1 CrashLoopBackOff 1 (4s ago) 9s

Je peux lancer la requête relative au plug-in avec HolmesGPT et la combinaison CTRL+H pour obtenir cette première réponse :

The payment-processing-container container has crashed and is being restarted for the 6th time due to a CrashLoopBackOff. The last state indicates that the container was terminated with an  
exit code of 0, which suggests that the command executed successfully but did not complete as expected.

To investigate further, you can check the logs of the payment-processing-container container to see if there are any error messages or clues about what is causing the issue. You can also
check the Kubernetes events for any other errors or warnings that may be related to this issue.

Additionally, you can try to debug the command executed by the payment-processing-container container to see if it's correct and working as expected. The command is:


if [[ -z "${DEPLOY_ENV}" ]]; then echo Environment variable DEPLOY_ENV is undefined ; else while true; do echo hello; sleep 10;done; fi

This command checks if the DEPLOY_ENV environment variable is set, and if it's not, it prints a message. If it is set, it enters an infinite loop that prints "hello" every 10 seconds.

If you're running this container in a Kubernetes pod, you can try to debug the issue by checking the pod's logs or using a tool like kubectl to inspect the container's state and logs.
Press 'q' to exit

Modification de la requête et autre réponse :

root@k0s-incus:~# cat ~/.config/k9s/plugins.yaml
plugins:
holmesgpt:
shortCut: Shift-H
description: Ask HolmesGPT
scopes:
- all
command: bash
background: false
confirm: false
args:
- -c
- |
holmes ask "why is $NAME of $RESOURCE_NAME in -n $NAMESPACE not working and why $NAME is crashed?" --model="openai/llama3.2:3b-instruct-q4_K_S"
echo "Press 'q' to exit"
while : ; do
read -n 1 k <&1
if [[ $k = q ]] ; then
break
fi
done
The payment-processing-container container has crashed and is being restarted for the 6th time due to a CrashLoopBackOff. The last state indicates that the container was terminated with an  
exit code of 0, which suggests that the command executed successfully but did not produce any output.

To investigate further, you can check the logs of the payment-processing-container container to see if there are any error messages or clues about what is causing the crash:


kubectl logs payment-processing-worker-747ccfb9db-njgmw -c payment-processing-container

Additionally, you can check the configuration of the payment-processing-container container to ensure that it is running with the correct environment variables and settings.


kubectl describe pod payment-processing-worker-747ccfb9db-njgmw -c payment-processing-container

This will provide more detailed information about the container's configuration and any errors that may be occurring.

HolmesGPT peut s’intégrer plus globalement à la plateforme Robusta via une installation dans le cluster Kubernetes et Helm …

Pour cela génération du fichier YAML de ce type en configuration :

root@k0s-incus:~# cat generated_values.yaml 
globalConfig:
signing_key: 568927d5-6e65-4c13-b3fe-fdc50e616fde
account_id: a4d7cea6-fba3-4ce6-ba3d-941b55ec83db
sinksConfig:
- robusta_sink:
name: robusta_ui_sink
token: <TOKEN>
enablePrometheusStack: true
kube-prometheus-stack:
grafana:
persistence:
enabled: true
enablePlatformPlaybooks: true
runner:
sendAdditionalTelemetry: true
enableHolmesGPT: true
holmes:
additionalEnvVars:
- name: ROBUSTA_AI
value: "true"

Utilisation des commandes et du fichier de configuration YAML fournis par la plateforme Robusta :

root@k0s-incus:~# helm repo add robusta https://robusta-charts.storage.googleapis.com && helm repo update
"robusta" has been added to your repositories
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "robusta" chart repository
Update Complete. ⎈Happy Helming!⎈
root@k0s-incus:~# helm install robusta robusta/robusta -f ./generated_values.yaml --set clusterName="k0s-cluster" \
--set isSmallCluster=true \
--set holmes.resources.requests.memory=512Mi \
--set kube-prometheus-stack.prometheus.prometheusSpec.retentionSize=9GB \
--set kube-prometheus-stack.prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=10Gi \
--set kube-prometheus-stack.prometheus.prometheusSpec.resources.requests.memory=512Mi
NAME: robusta
LAST DEPLOYED: Tue Jan 14 22:59:09 2025
NAMESPACE: default
STATUS: deployed
REVISION: 1
NOTES:
Thank you for installing Robusta 0.20.0

As an open source project, we collect general usage statistics.
This data is extremely limited and contains only general metadata to help us understand usage patterns.
If you are willing to share additional data, please do so! It really help us improve Robusta.

You can set sendAdditionalTelemetry: true as a Helm value to send exception reports and additional data.
This is disabled by default.

To opt-out of telemetry entirely, set a ENABLE_TELEMETRY=false environment variable on the robusta-runner deployment.
Note that if the Robusta UI is enabled, telemetry cannot be disabled even if ENABLE_TELEMETRY=false is set.

Visit the web UI at: https://platform.robusta.dev/

root@k0s-incus:~# helm ls -A
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
robusta default 2 2025-01-14 23:10:13.935906491 +0000 UTC deployed robusta-0.20.0 0.20.0
root@k0s-incus:~# kubectl get po,svc
NAME READY STATUS RESTARTS AGE
pod/alertmanager-robusta-kube-prometheus-st-alertmanager-0 0/2 Pending 0 2m
pod/payment-processing-worker-747ccfb9db-njgmw 0/1 CrashLoopBackOff 10 (2m33s ago) 28m
pod/prometheus-robusta-kube-prometheus-st-prometheus-0 0/2 Pending 0 2m
pod/robusta-forwarder-cd847ccc-wxc6d 1/1 Running 0 2m5s
pod/robusta-grafana-8588b8fb85-fv5vj 3/3 Running 0 2m5s
pod/robusta-holmes-55dd58ff6d-m4zth 1/1 Running 0 2m5s
pod/robusta-kube-prometheus-st-operator-6885c8f675-szncg 1/1 Running 0 2m5s
pod/robusta-kube-state-metrics-8667fd9775-s49z4 1/1 Running 0 2m5s
pod/robusta-prometheus-node-exporter-c6jvb 1/1 Running 0 2m5s
pod/robusta-prometheus-node-exporter-j6zp5 1/1 Running 0 2m5s
pod/robusta-runner-5d667b7d9c-dm2z7 1/1 Running 0 2m5s

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 2m1s
service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 94m
service/prometheus-operated ClusterIP None <none> 9090/TCP 2m1s
service/robusta-forwarder ClusterIP 10.102.7.41 <none> 80/TCP 2m5s
service/robusta-grafana ClusterIP 10.106.69.72 <none> 80/TCP 2m5s
service/robusta-holmes ClusterIP 10.110.124.241 <none> 80/TCP 2m5s
service/robusta-kube-prometheus-st-alertmanager ClusterIP 10.105.101.210 <none> 9093/TCP,8080/TCP 2m5s
service/robusta-kube-prometheus-st-operator ClusterIP 10.103.213.208 <none> 443/TCP 2m5s
service/robusta-kube-prometheus-st-prometheus ClusterIP 10.107.13.104 <none> 9090/TCP,8080/TCP 2m5s
service/robusta-kube-state-metrics ClusterIP 10.103.53.30 <none> 8080/TCP 2m5s
service/robusta-prometheus-node-exporter ClusterIP 10.102.243.65 <none> 9104/TCP 2m5s
service/robusta-runner ClusterIP 10.97.82.15 <none> 80/TCP 2m5s

Je peux procéder à l’installation complête via cette formule :

root@k0s-incus:~# helm upgrade robusta robusta/robusta -f ./generated_values.yaml --set clusterName="k0s-cluster"
Release "robusta" has been upgraded. Happy Helming!
NAME: robusta
LAST DEPLOYED: Tue Jan 14 23:14:02 2025
NAMESPACE: default
STATUS: deployed
REVISION: 5
NOTES:
Thank you for installing Robusta 0.20.0

As an open source project, we collect general usage statistics.
This data is extremely limited and contains only general metadata to help us understand usage patterns.
If you are willing to share additional data, please do so! It really help us improve Robusta.

You can set sendAdditionalTelemetry: true as a Helm value to send exception reports and additional data.
This is disabled by default.

To opt-out of telemetry entirely, set a ENABLE_TELEMETRY=false environment variable on the robusta-runner deployment.
Note that if the Robusta UI is enabled, telemetry cannot be disabled even if ENABLE_TELEMETRY=false is set.

Visit the web UI at: https://platform.robusta.dev/

Le cluster apparaît sur Robusta :

Et là également via HolmesGPT, intterogation de la plateforme sur les éventuelles problématiques rencontrées dans le cluster Kubernetes :

Le tout avec une consommation moindre dans le cluster …

L’utilisation de l’IA pour le dépannage et l’analyse des incidents réduit le temps et l’effort humain nécessaire, permettant aux équipes de se concentrer sur des tâches plus stratégiques.

Les outils comme HolmesGPT et Ollama peuvent être mis à l’échelle en fonction de la demande, ce qui est particulièrement utile dans les environnements de production où la charge de travail peut varier significativement.

On peut donc en conclure que l’intégration de l’IA dans les clusters Kubernetes à l’aide d’outils comme HolmesGPT, Ollama et de fournisseur d’instances GPU comme RunPod, offre des avantages significatifs en termes d’efficiacité, de scalabilité et de tolérance aux pannes.

Ces technologies permettent de rationaliser le cycle de vie des applications, de simplifier le dépannage et d’améliorer la gestion des ressources, rendant ainsi les opérations Kubernetes plus robustes et plus performantes …

À suivre !

--

--

Karim
Karim

Written by Karim

Above the clouds, the sky is always blue ...

No responses yet