AIOps : Investigation par l’IA dans Kubernetes avec HolmesGPT, Ollama et RunPod …

24 min readJan 15, 2025

Dans le monde de l’orchestration de conteneurs, Kubernetes est devenu une norme pour gérer les workloads conteneurisés. Cependant, la gestion et le dépannage des clusters Kubernetes peuvent être complexes et chronophages. Cet article explore comment l’intelligence artificielle (IA) peut être intégrée dans Kubernetes pour améliorer l’investigation et la gestion des incidents. J’avais d’ailleurs évoqué le sujet dans un article précédent :

AIOps : Déboguer son cluster Kubernetes en utilisant l’intelligence artificielle générative via…

Les avancées dans l’Intelligence Artificielle Générative et sa mise en oeuvre simplifiée via certains outils…

deep75.medium.com

Ici je vais m’intéresser à HolmesGPT. HolmesGPT, développé par Robusta, est un agent de dépannage open source qui utilise l’IA pour investiguer les incidents dans les clusters Kubernetes avec ces caractéristiques :

Intégration avec les outils de gestion d’incidents : HolmesGPT se connecte à des outils comme PagerDuty, OpsGenie et Prometheus pour collecter des données et analyser les alertes.
Investigation automatisée : Grâce à l’IA, HolmesGPT peut identifier et résoudre des problèmes tels que l’expiration des certificats SSL, les problèmes de ressources insuffisantes et les problèmes d’affinité des nœuds. Cela réduit significativement le temps et l’effort nécessaires pour le dépannage.
Personnalisation : HolmesGPT permet de créer des livres de recettes (runbooks) personnalisés pour gérer des problèmes spécifiques, en utilisant des API et des outils personnalisés si nécessaire.

GitHub - robusta-dev/holmesgpt: On-Call Assistant for Prometheus Alerts - Get a head start on…

On-Call Assistant for Prometheus Alerts - Get a head start on fixing alerts with AI investigation …

github.com

Pour cet exercice, je vais d’abord lancer une instance Ubuntu 24.04 LTS de nouveau chez le fournisseur Cloud DigitalOcean :

Je vais y installer Incus, un fork de LXD qui va me servir de base pour la formation d’un cluster Kubernetes avec plusieurs containers :

Linux Containers - Incus - Introduction

The umbrella project behind Incus, LXC, LXCFS, Distrobuilder and more.

linuxcontainers.org

Comme pour LXD, je vais procéder à la création de plusieurs profiles. Mais dans un premier temps, installation d’Incus sur l’instance :

root@k0s-incus:~# curl -fsSL https://pkgs.zabbly.com/key.asc | gpg --show-keys --fingerprint
gpg: directory '/root/.gnupg' created
gpg: keybox '/root/.gnupg/pubring.kbx' created
pub   rsa3072 2023-08-23 [SC] [expires: 2025-08-22]
      4EFC 5906 96CB 15B8 7C73  A3AD 82CC 8797 C838 DCFD
uid                      Zabbly Kernel Builds <info@zabbly.com>
sub   rsa3072 2023-08-23 [E] [expires: 2025-08-22]

root@k0s-incus:~# mkdir -p /etc/apt/keyrings/

root@k0s-incus:~# curl -fsSL https://pkgs.zabbly.com/key.asc -o /etc/apt/keyrings/zabbly.asc

root@k0s-incus:~# sh -c 'cat <<EOF > /etc/apt/sources.list.d/zabbly-incus-stable.sources
Enabled: yes
Types: deb
URIs: https://pkgs.zabbly.com/incus/stable
Suites: $(. /etc/os-release && echo ${VERSION_CODENAME})
Components: main
Architectures: $(dpkg --print-architecture)
Signed-By: /etc/apt/keyrings/zabbly.asc

EOF'

root@k0s-incus:~# apt-get update

Hit:1 http://security.ubuntu.com/ubuntu noble-security InRelease
Hit:2 http://mirrors.digitalocean.com/ubuntu noble InRelease
Hit:3 https://repos-droplet.digitalocean.com/apt/droplet-agent main InRelease
Hit:4 http://mirrors.digitalocean.com/ubuntu noble-updates InRelease
Hit:5 http://mirrors.digitalocean.com/ubuntu noble-backports InRelease
Get:6 https://pkgs.zabbly.com/incus/stable noble InRelease [7358 B]   
Get:7 https://pkgs.zabbly.com/incus/stable noble/main amd64 Packages [3542 B]
Fetched 10.9 kB in 1s (13.3 kB/s)   
Reading package lists... Done

root@k0s-incus:~# apt-get install incus incus-client incus-ui-canonical -y
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  attr dconf-gsettings-backend dconf-service dns-root-data dnsmasq-base fontconfig genisoimage glib-networking glib-networking-common glib-networking-services gsettings-desktop-schemas
  gstreamer1.0-plugins-base gstreamer1.0-plugins-good gstreamer1.0-x incus-base iw libaa1 libasyncns0 libavc1394-0 libboost-iostreams1.83.0 libboost-thread1.83.0 libbtrfs0t64 libcaca0
  libcairo-gobject2 libcairo2 libcdparanoia0 libdatrie1 libdaxctl1 libdconf1 libdv4t64 libflac12t64 libgdk-pixbuf-2.0-0 libgdk-pixbuf2.0-bin libgdk-pixbuf2.0-common libgraphite2-3
  libgstreamer-plugins-base1.0-0 libgstreamer-plugins-good1.0-0 libharfbuzz0b libiec61883-0 libmp3lame0 libmpg123-0t64 libndctl6 libnet1 libogg0 libopus0 liborc-0.4-0t64 libpango-1.0-0
  libpangocairo-1.0-0 libpangoft2-1.0-0 libpixman-1-0 libpmem1 libpmemobj1 libproxy1v5 libpulse0 librados2 libraw1394-11 librbd1 librdmacm1t64 libshout3 libsndfile1 libsoup-3.0-0
  libsoup-3.0-common libspeex1 libspice-server1 libtag1v5 libtag1v5-vanilla libthai-data libthai0 libtheora0 libtwolame0 libusbredirparser1t64 libv4l-0t64 libv4lconvert0t64 libvisual-0.4-0
  libvorbis0a libvorbisenc2 libvpx9 libwavpack1 libx11-xcb1 libxcb-render0 libxcb-shm0 libxdamage1 libxfixes3 libxi6 libxrender1 libxtst6 libxv1 session-migration sshfs wireless-regdb
  x11-common xdelta3

root@k0s-incus:~# incus
Description:
  Command line client for Incus

  All of Incus's features can be driven through the various commands below.
  For help with any of those, simply call them with --help.

  Custom commands can be defined through aliases, use "incus alias" to control those.

Usage:
  incus [command]

Available Commands:
  admin       Manage incus daemon
  cluster     Manage cluster members
  config      Manage instance and server configuration options
  console     Attach to instance consoles
  copy        Copy instances within or in between servers
  create      Create instances from images
  delete      Delete instances
  exec        Execute commands in instances
  export      Export instance backups
  file        Manage files in instances
  help        Help about any command
  image       Manage images
  import      Import instance backups
  info        Show instance or server information
  launch      Create and start instances from images
  list        List instances
  move        Move instances within or in between servers
  network     Manage and attach instances to networks
  pause       Pause instances
  profile     Manage profiles
  project     Manage projects
  publish     Publish instances as images
  rebuild     Rebuild instances
  remote      Manage the list of remote servers
  rename      Rename instances
  restart     Restart instances
  resume      Resume instances
  snapshot    Manage instance snapshots
  start       Start instances
  stop        Stop instances
  storage     Manage storage pools and volumes
  top         Display resource usage info per instance
  version     Show local and remote versions
  webui       Open the web interface

Flags:
      --all            Show less common commands
      --debug          Show all debug messages
      --force-local    Force using the local unix socket
  -h, --help           Print help
      --project        Override the source project
  -q, --quiet          Don't show progress information
      --sub-commands   Use with help or --help to view sub-commands
  -v, --verbose        Show all information messages
      --version        Print version number

Use "incus [command] --help" for more information about a command.

Initialisation d’Incus en version minimaliste :

root@k0s-incus:~# incus admin init
Would you like to use clustering? (yes/no) [default=no]: 
Do you want to configure a new storage pool? (yes/no) [default=yes]: 
Name of the new storage pool [default=default]: 
Name of the storage backend to use (btrfs, dir, lvm) [default=btrfs]: dir
Where should this storage pool store its data? [default=/var/lib/incus/storage-pools/default]: 
Would you like to create a new local network bridge? (yes/no) [default=yes]: 
What should the new bridge be called? [default=incusbr0]: 
What IPv4 address should be used? (CIDR subnet notation, “auto” or “none”) [default=auto]: 
What IPv6 address should be used? (CIDR subnet notation, “auto” or “none”) [default=auto]: 
Would you like the server to be available over the network? (yes/no) [default=no]:    
Would you like stale cached images to be updated automatically? (yes/no) [default=yes]: 
Would you like a YAML "init" preseed to be printed? (yes/no) [default=no]: 

root@k0s-incus:~# incus list
+------+-------+------+------+------+-----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
+------+-------+------+------+------+-----------+

root@k0s-incus:~# incus profile list
+---------+-----------------------+---------+
|  NAME   |      DESCRIPTION      | USED BY |
+---------+-----------------------+---------+
| default | Default Incus profile | 0       |
+---------+-----------------------+---------+

root@k0s-incus:~# incus profile show default
config: {}
description: Default Incus profile
devices:
  eth0:
    name: eth0
    network: incusbr0
    type: nic
  root:
    path: /
    pool: default
    type: disk
name: default
used_by: []
project: default

root@k0s-incus:~# incus profile create k8s

Incus dispose d’un tableau de bord de contrôle qui peut être actionné temporairement par incus webui.

incus webui

Open the web interface Synopsis: Description:, Open the web interface,,,, Options inherited from parent commands: SEE…

linuxcontainers.org

Activation de ce dernier :

root@k0s-incus:~# nohup incus webui &
[1] 4104
root@k0s-incus:~# nohup: ignoring input and appending output to 'nohup.out'

root@k0s-incus:~# cat nohup.out 
Web server running at: http://127.0.0.1:34363/ui?auth_token=3c5f5d4b-f9ed-4bf9-a174-d5ea2366cfbf

Utilisation de pinggy.io pour y accéder :

Pinggy - Simple Localhost Tunnels

Pinggy is the simplest way to bring your localhost projects online. It the the fastest way to create secure tunnels to…

pinggy.io

root@k0s-incus:~# ssh -p 443 -R0:127.0.0.1:34363 a.pinggy.io

Je récupère le même profile qu’utilise LXD pour MicroK8s :

MicroK8s - MicroK8s in LXD | MicroK8s

MicroK8s is the simplest production-grade upstream K8s. Lightweight and focused. Single command install on Linux…

microk8s.io

roothttps://microk8s.io/docs/install-lxd@k0s-incus:~# wget https://raw.githubusercontent.com/ubuntu/microk8s/master/tests/lxc/microk8s.profile -O k8s.profile
--2025-01-14 20:58:42--  https://raw.githubusercontent.com/ubuntu/microk8s/master/tests/lxc/microk8s.profile
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 816 [text/plain]
Saving to: ‘k8s.profile’

k8s.profile                                     100%[=====================================================================================================>]     816  --.-KB/s    in 0s      

2025-01-14 20:58:42 (33.4 MB/s) - ‘k8s.profile’ saved [816/816]

root@k0s-incus:~# cat k8s.profile | incus profile edit k8s
root@k0s-incus:~# rm k8s.profile

root@k0s-incus:~# incus profile show k8s
config:
  boot.autostart: "true"
  linux.kernel_modules: ip_vs,ip_vs_rr,ip_vs_wrr,ip_vs_sh,ip_tables,ip6_tables,netlink_diag,nf_nat,overlay,br_netfilter
  raw.lxc: |
    lxc.apparmor.profile=unconfined
    lxc.mount.auto=proc:rw sys:rw cgroup:rw
    lxc.cgroup.devices.allow=a
    lxc.cap.drop=
  security.nesting: "true"
  security.privileged: "true"
description: ""
devices:
  aadisable:
    path: /sys/module/nf_conntrack/parameters/hashsize
    source: /sys/module/nf_conntrack/parameters/hashsize
    type: disk
  aadisable2:
    path: /dev/kmsg
    source: /dev/kmsg
    type: unix-char
  aadisable3:
    path: /sys/fs/bpf
    source: /sys/fs/bpf
    type: disk
  aadisable4:
    path: /proc/sys/net/netfilter/nf_conntrack_max
    source: /proc/sys/net/netfilter/nf_conntrack_max
    type: disk
name: k8s
used_by: []
project: default

Comme Incus a la faculté d’utiliser cloud-init, je crée un nouveau profile destiné à cet usage :

root@k0s-incus:~# incus profile show cloud
config:
  cloud-init.user-data: |
    #cloud-config
    package_update: true
    package_upgrade: true
    package_reboot_if_required: true
    packages:
      - vim
      - wget
      - git
      - curl
      - htop
      - openssh-server
    bootcmd:
      - systemctl enable ssh
      - systemctl start ssh
    ssh_authorized_keys:
      - ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQCpbsaaVUMa2TM9q8VkeBmbKvJpbreXTcqI5F5N3riGsoZ7Z/IIN7eR6J47UP2bj3IBTdgHmij1uOexm60QBO2PY4abIhsN+xnVS4a0LSyI8v6nYECWbEehL/gFn6uDmSLA4m0hZCF5BSpLxQYzKS28dHIdXsLC4CDd67nAXIhOiVpM0q/AUCuSy+mA0VwFa/JAkFCk8TpQBorgwJIq635imrgxYIpEUA2wHXOhw23mO3zTUlay13LSlA2a1xyTkP8hSDWdRYVxr2DEB/MtmTX2BdWlA5rDRmzXE7R2/csE245WAxG+XfSu4zNqhHzm8Df3zmZn3/UyKLcx4eJF//mVZyrM7RQHRteA/im8I4IavrReGyCUKY+OsSfygYVFyO87rYQ+IOauOnB4LxBohBjSBN3Skk4X7krYFIi8D9R1lmL+VvBfpvy0YMurOahY1VJFzD0dUeK2bDUdeWzfFkcX039d9/RRXRxieNpxwp1BLPi5/DXG8FihzgwVTf6h60J9/fkYzY+BO8CKG2kYTUsy1ykuXLzLY5sTCREiEoEKcJ9IGz8OimZ1AmkgJJCrQnI6mT/KiNDU6YCc75ONKTKX5HKVPhZWT255Aw4f5LBbBrj06cJX3GuunV0I30+BYyHwLbPBoqgd4GUk3YJlr8wS3qre/YUSc2iKNDTOzFCC8Q== root@k0s-incus
description: incus with cloud-init
devices: {}
name: cloud
used_by: []
project: default

Je suis prêt pour la création de trois containers qui me serviront de pivot à la création d’un cluster Kubernetes :

root@k0s-incus:~# for i in {1..3}; do incus launch -p default -p k8s -p cloud images:ubuntu/24.04/cloud k0s-$i; done
Launching k0s-1
Launching k0s-2                                    
Launching k0s-3

root@k0s-incus:~# incus list                       
+-------+---------+-----------------------+-----------------------------------------------+-----------+-----------+
| NAME  |  STATE  |         IPV4          |                     IPV6                      |   TYPE    | SNAPSHOTS |
+-------+---------+-----------------------+-----------------------------------------------+-----------+-----------+
| k0s-1 | RUNNING | 10.224.160.99 (eth0)  | fd42:4641:b619:c782:216:3eff:fea4:53d3 (eth0) | CONTAINER | 0         |
+-------+---------+-----------------------+-----------------------------------------------+-----------+-----------+
| k0s-2 | RUNNING | 10.224.160.54 (eth0)  | fd42:4641:b619:c782:216:3eff:feee:7af8 (eth0) | CONTAINER | 0         |
+-------+---------+-----------------------+-----------------------------------------------+-----------+-----------+
| k0s-3 | RUNNING | 10.224.160.215 (eth0) | fd42:4641:b619:c782:216:3eff:fef3:709b (eth0) | CONTAINER | 0         |
+-------+---------+-----------------------+-----------------------------------------------+-----------+-----------+

root@k0s-incus:~# cat .ssh/config 
Host *
   StrictHostKeyChecking no
   UserKnownHostsFile=/dev/null
   
root@k0s-incus:~# ssh ubuntu@10.224.160.99

Welcome to Ubuntu 24.04.1 LTS (GNU/Linux 6.8.0-51-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/pro

The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.

To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

ubuntu@k0s-1:~$

Récupération de k0sctl pour la création d’un cluster Kubernetes avec k0s :

Using k0sctl - Documentation

Documentation for k0s, the Zero friction Kubernetes distribution.

docs.k0sproject.io

root@k0s-incus:~# wget -c https://github.com/k0sproject/k0sctl/releases/download/v0.21.0/k0sctl-linux-amd64 && chmod +x k0sctl-linux-amd64 && mv k0sctl-linux-amd64 /usr/local/bin/k0sctl

Saving to: ‘k0sctl-linux-amd64’

k0sctl-linux-amd64                              100%[=====================================================================================================>]  18.21M  --.-KB/s    in 0.1s    

2025-01-14 21:22:23 (122 MB/s) - ‘k0sctl-linux-amd64’ saved [19091608/19091608]

root@k0s-incus:~# k0sctl 
NAME:
   k0sctl - k0s cluster management tool

USAGE:
   k0sctl [global options] command [command options]

COMMANDS:
   version     Output k0sctl version
   apply       Apply a k0sctl configuration
   kubeconfig  Output the admin kubeconfig of the cluster
   init        Create a configuration template
   reset       Remove traces of k0s from all of the hosts
   backup      Take backup of existing clusters state
   config      Configuration related sub-commands
   completion  
   help, h     Shows a list of commands or help for one command

GLOBAL OPTIONS:
   --debug, -d  Enable debug logging (default: false) [$DEBUG]
   --trace      Enable trace logging (default: false) [$TRACE]
   --no-redact  Do not hide sensitive information in the output (default: false)
   --help, -h   show help

root@k0s-incus:~# k0sctl init --k0s > k0sctl.yaml

root@k0s-incus:~# cat k0sctl.yaml 
apiVersion: k0sctl.k0sproject.io/v1beta1
kind: Cluster
metadata:
  name: k0s-cluster
  user: admin
spec:
  hosts:
  - ssh:
      address: 10.224.160.99 
      user: ubuntu
      port: 22
      keyPath: /root/.ssh/id_rsa
    role: controller
  - ssh:
      address: 10.224.160.54
      user: ubuntu
      port: 22
      keyPath: /root/.ssh/id_rsa
    role: worker
  - ssh:
      address: 10.224.160.215
      user: ubuntu
      port: 22
      keyPath: /root/.ssh/id_rsa
    role: worker
  k0s:
    config:
      apiVersion: k0s.k0sproject.io/v1beta1
      kind: Cluster
      metadata:
        name: k0s
      spec:
        api:
          k0sApiPort: 9443
          port: 6443
        installConfig:
          users:
            etcdUser: etcd
            kineUser: kube-apiserver
            konnectivityUser: konnectivity-server
            kubeAPIserverUser: kube-apiserver
            kubeSchedulerUser: kube-scheduler
        konnectivity:
          adminPort: 8133
          agentPort: 8132
        network:
          kubeProxy:
            disabled: false
            mode: iptables
          kuberouter:
            autoMTU: true
            mtu: 0
            peerRouterASNs: ""
            peerRouterIPs: ""
          podCIDR: 10.244.0.0/16
          provider: kuberouter
          serviceCIDR: 10.96.0.0/12
        podSecurityPolicy:
          defaultPolicy: 00-k0s-privileged
        storage:
          type: etcd
        telemetry:
          enabled: true

Lancement de la création :

                                                      
root@k0s-incus:~# k0sctl apply --config k0sctl.yaml 

⠀⣿⣿⡇⠀⠀⢀⣴⣾⣿⠟⠁⢸⣿⣿⣿⣿⣿⣿⣿⡿⠛⠁⠀⢸⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠀█████████ █████████ ███
⠀⣿⣿⡇⣠⣶⣿⡿⠋⠀⠀⠀⢸⣿⡇⠀⠀⠀⣠⠀⠀⢀⣠⡆⢸⣿⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀███          ███    ███
⠀⣿⣿⣿⣿⣟⠋⠀⠀⠀⠀⠀⢸⣿⡇⠀⢰⣾⣿⠀⠀⣿⣿⡇⢸⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠀███          ███    ███
⠀⣿⣿⡏⠻⣿⣷⣤⡀⠀⠀⠀⠸⠛⠁⠀⠸⠋⠁⠀⠀⣿⣿⡇⠈⠉⠉⠉⠉⠉⠉⠉⠉⢹⣿⣿⠀███          ███    ███
⠀⣿⣿⡇⠀⠀⠙⢿⣿⣦⣀⠀⠀⠀⣠⣶⣶⣶⣶⣶⣶⣿⣿⡇⢰⣶⣶⣶⣶⣶⣶⣶⣶⣾⣿⣿⠀█████████    ███    ██████████
k0sctl v0.21.0 Copyright 2023, k0sctl authors.
By continuing to use k0sctl you agree to these terms:
https://k0sproject.io/licenses/eula
INFO ==> Running phase: Set k0s version  
INFO Looking up latest stable k0s version         
INFO Using k0s version v1.31.3+k0s.0              
INFO ==> Running phase: Connect to hosts 
INFO [ssh] 10.224.160.215:22: connected           
INFO [ssh] 10.224.160.99:22: connected            
INFO [ssh] 10.224.160.54:22: connected            
INFO ==> Running phase: Detect host operating systems 
INFO [ssh] 10.224.160.215:22: is running Ubuntu 24.04.1 LTS 
INFO [ssh] 10.224.160.99:22: is running Ubuntu 24.04.1 LTS 
INFO [ssh] 10.224.160.54:22: is running Ubuntu 24.04.1 LTS 
INFO ==> Running phase: Acquire exclusive host lock 
INFO ==> Running phase: Prepare hosts    
INFO ==> Running phase: Gather host facts 
INFO [ssh] 10.224.160.215:22: using k0s-3 as hostname 
INFO [ssh] 10.224.160.54:22: using k0s-2 as hostname 
INFO [ssh] 10.224.160.99:22: using k0s-1 as hostname 
INFO [ssh] 10.224.160.215:22: discovered eth0 as private interface 
INFO [ssh] 10.224.160.54:22: discovered eth0 as private interface 
INFO [ssh] 10.224.160.99:22: discovered eth0 as private interface 
INFO ==> Running phase: Validate hosts   
INFO ==> Running phase: Validate facts   
INFO ==> Running phase: Download k0s on hosts 
INFO [ssh] 10.224.160.215:22: downloading k0s v1.31.3+k0s.0 
INFO [ssh] 10.224.160.54:22: downloading k0s v1.31.3+k0s.0 
INFO [ssh] 10.224.160.99:22: downloading k0s v1.31.3+k0s.0 
INFO ==> Running phase: Install k0s binaries on hosts 
INFO [ssh] 10.224.160.99:22: validating configuration 
INFO ==> Running phase: Configure k0s    
INFO [ssh] 10.224.160.99:22: installing new configuration 
INFO ==> Running phase: Initialize the k0s cluster 
INFO [ssh] 10.224.160.99:22: installing k0s controller 
INFO [ssh] 10.224.160.99:22: waiting for the k0s service to start 
INFO [ssh] 10.224.160.99:22: wait for kubernetes to reach ready state 
INFO ==> Running phase: Install workers  
INFO [ssh] 10.224.160.99:22: generating a join token for worker 1 
INFO [ssh] 10.224.160.99:22: generating a join token for worker 2 
INFO [ssh] 10.224.160.215:22: validating api connection to https://10.224.160.99:6443 using join token 
INFO [ssh] 10.224.160.54:22: validating api connection to https://10.224.160.99:6443 using join token 
INFO [ssh] 10.224.160.215:22: writing join token to /etc/k0s/k0stoken 
INFO [ssh] 10.224.160.54:22: writing join token to /etc/k0s/k0stoken 
INFO [ssh] 10.224.160.54:22: installing k0s worker 
INFO [ssh] 10.224.160.215:22: installing k0s worker 
INFO [ssh] 10.224.160.215:22: starting service    
INFO [ssh] 10.224.160.215:22: waiting for node to become ready 
INFO [ssh] 10.224.160.54:22: starting service     
INFO [ssh] 10.224.160.54:22: waiting for node to become ready 
INFO ==> Running phase: Release exclusive host lock 
INFO ==> Running phase: Disconnect from hosts 
INFO ==> Finished in 42s                 
INFO k0s cluster version v1.31.3+k0s.0 is now installed 
INFO Tip: To access the cluster you can now fetch the admin kubeconfig using: 
INFO      k0sctl kubeconfig

Le cluster est actif :

root@k0s-incus:~# curl -LO https://dl.k8s.io/release/v1.31.3/bin/linux/amd64/kubectl && chmod +x kubectl && mv kubectl /usr/local/bin/
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   138  100   138    0     0    923      0 --:--:-- --:--:-- --:--:--   926
100 53.7M  100 53.7M    0     0   476k      0  0:01:55  0:01:55 --:--:-- 1023k

root@k0s-incus:~# mkdir .kube
root@k0s-incus:~# k0sctl kubeconfig --config k0sctl.yaml > .kube/config

root@k0s-incus:~# kubectl cluster-info
Kubernetes control plane is running at https://10.224.160.99:6443
CoreDNS is running at https://10.224.160.99:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

root@k0s-incus:~# kubectl get nodes -o wide
NAME    STATUS   ROLES    AGE    VERSION       INTERNAL-IP      EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
k0s-2   Ready    <none>   5m1s   v1.31.3+k0s   10.224.160.54    <none>        Ubuntu 24.04.1 LTS   6.8.0-51-generic   containerd://1.7.24
k0s-3   Ready    <none>   5m1s   v1.31.3+k0s   10.224.160.215   <none>        Ubuntu 24.04.1 LTS   6.8.0-51-generic   containerd://1.7.24

root@k0s-incus:~# kubectl get po,svc -A
NAMESPACE     NAME                                  READY   STATUS    RESTARTS   AGE
kube-system   pod/coredns-645c5d6f5b-kgnsf          1/1     Running   0          5m2s
kube-system   pod/coredns-645c5d6f5b-n2rbk          1/1     Running   0          5m2s
kube-system   pod/konnectivity-agent-2dg8l          1/1     Running   0          5m4s
kube-system   pod/konnectivity-agent-5l5dl          1/1     Running   0          5m4s
kube-system   pod/kube-proxy-cx47n                  1/1     Running   0          5m7s
kube-system   pod/kube-proxy-sp5fd                  1/1     Running   0          5m7s
kube-system   pod/kube-router-6l4qv                 1/1     Running   0          5m7s
kube-system   pod/kube-router-b9t89                 1/1     Running   0          5m7s
kube-system   pod/metrics-server-78c4ccbc7f-jxpzz   1/1     Running   0          5m1s

NAMESPACE     NAME                     TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                  AGE
default       service/kubernetes       ClusterIP   10.96.0.1      <none>        443/TCP                  5m17s
kube-system   service/kube-dns         ClusterIP   10.96.0.10     <none>        53/UDP,53/TCP,9153/TCP   5m7s
kube-system   service/metrics-server   ClusterIP   10.109.44.51   <none>        443/TCP                  5m1s

Il est alors possible de procéder à l’installation d’HolmesGPT via Pipx :

GitHub - robusta-dev/holmesgpt: On-Call Assistant for Prometheus Alerts - Get a head start on…

On-Call Assistant for Prometheus Alerts - Get a head start on fixing alerts with AI investigation …

github.com

root@k0s-incus:~# apt install pipx -y

root@k0s-incus:~# pipx ensurepath

Success! Added /root/.local/bin to the PATH environment variable.

Consider adding shell completions for pipx. Run 'pipx completions' for instructions.

You will need to open a new terminal or re-login for the PATH changes to take effect.

Otherwise pipx is ready to go! ✨ 🌟 ✨

root@k0s-incus:~# pipx install "https://github.com/robusta-dev/holmesgpt/archive/refs/heads/master.zip"
  installed package holmesgpt 0.1.0, installed using Python 3.12.3
  These apps are now globally available
    - holmes
done! ✨ 🌟 ✨
root@k0s-incus:~# holmes version
/root/.local/share/pipx/venvs/holmesgpt/lib/python3.12/site-packages/pydantic/_internal/_config.py:345: UserWarning: Valid config keys have changed in V2:
* 'fields' has been removed
  warnings.warn(message, UserWarning)
HEAD -> master-bfafbde3

Pour l’accompagner, je récupère K9s qui fournit une interface utilisateur de terminal pour interagir avec vos clusters Kubernetes. L’objectif de ce projet est de faciliter la navigation, l’observation et la gestion de vos applications dans la nature. K9s surveille continuellement Kubernetes pour les changements et offre des commandes ultérieures pour interagir avec vos ressources observées.

GitHub - derailed/k9s: 🐶 Kubernetes CLI To Manage Your Clusters In Style!

🐶 Kubernetes CLI To Manage Your Clusters In Style! - derailed/k9s

github.com

root@k0s-incus:~# wget -c https://github.com/derailed/k9s/releases/download/v0.32.7/k9s_linux_amd64.deb
HTTP request sent, awaiting response... 200 OK
Length: 31832132 (30M) [application/octet-stream]
Saving to: ‘k9s_linux_amd64.deb’

k9s_linux_amd64.deb                             100%[=====================================================================================================>]  30.36M  --.-KB/s    in 0.1s    

2025-01-14 21:40:07 (291 MB/s) - ‘k9s_linux_amd64.deb’ saved [31832132/31832132]

root@k0s-incus:~# apt install -f ./k9s_linux_amd64.deb 
root@k0s-incus:~# k9s --help
K9s is a CLI to view and manage your Kubernetes clusters.

Usage:
  k9s [flags]
  k9s [command]

Available Commands:
  completion  Generate the autocompletion script for the specified shell
  help        Help about any command
  info        List K9s configurations info
  version     Print version/build info

Flags:
  -A, --all-namespaces                 Launch K9s in all namespaces
      --as string                      Username to impersonate for the operation
      --as-group stringArray           Group to impersonate for the operation
      --certificate-authority string   Path to a cert file for the certificate authority
      --client-certificate string      Path to a client certificate file for TLS
      --client-key string              Path to a client key file for TLS
      --cluster string                 The name of the kubeconfig cluster to use
  -c, --command string                 Overrides the default resource to load when the application launches
      --context string                 The name of the kubeconfig context to use
      --crumbsless                     Turn K9s crumbs off
      --headless                       Turn K9s header off
  -h, --help                           help for k9s
      --insecure-skip-tls-verify       If true, the server's caCertFile will not be checked for validity
      --kubeconfig string              Path to the kubeconfig file to use for CLI requests
      --logFile string                 Specify the log file (default "/root/.local/state/k9s/k9s.log")
  -l, --logLevel string                Specify a log level (error, warn, info, debug, trace) (default "info")
      --logoless                       Turn K9s logo off
  -n, --namespace string               If present, the namespace scope for this CLI request
      --readonly                       Sets readOnly mode by overriding readOnly configuration setting
  -r, --refresh int                    Specify the default refresh rate as an integer (sec) (default 2)
      --request-timeout string         The length of time to wait before giving up on a single server request
      --screen-dump-dir string         Sets a path to a dir for a screen dumps
      --token string                   Bearer token for authentication to the API server
      --user string                    The name of the kubeconfig user to use
      --write                          Sets write mode by overriding the readOnly configuration setting

Use "k9s [command] --help" for more information about a command.

Ollama, une alternative à ChatGPT, peut être déployée pour fournir des capacités de traitement du langage naturel directement dans votre environnement. Cela permet de bénéficier des capacités de traitement du langage naturel de Ollama sans dépendre de services cloud externes.

Ollama

Get up and running with large language models.

ollama.com

En intégrant Ollama à vos outils de dépannage, vous pouvez générer des réponses et des solutions basées sur l’analyse des logs et des données de votre cluster Kubernetes.

Pour son exécution, je suis amené à utiliser RunPod, une plateforme qui permet d’exécuter des tâches de traitement du langage naturel et d’autres tâches IA. RunPod vous permet en effet de créer des environnements de pod personnalisés pour exécuter des modèles de langage comme Ollama ou d’autres applications IA :

RunPod - The Cloud Built for AI

Develop, train, and scale AI models in one cloud. Spin up on-demand GPUs with GPU Cloud, scale ML inference with…

www.runpod.io

Création d’un Pod GPU qui me permet donc d’utiliser Ollama …

Set up Ollama on your GPU Pod | RunPod Documentation

Learn how to set up Ollama, a powerful language model, on a GPU Pod using RunPod, and interact with it through HTTP API…

docs.runpod.io

Je peux m’y connecter via SSH :

Welcome to Ubuntu 22.04.3 LTS (GNU/Linux 6.5.0-44-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

This system has been minimized by removing packages and content that are
not required on a system that users do not log into.

To restore this content, you can run the 'unminimize' command.

The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.

 ______                 ______            _
(_____ \               (_____ \          | |
 _____) ) _   _  ____   _____) )___    __| |
|  __  / | | | ||  _ \ |  ____// _ \  / _  |
| |  \ \ | |_| || | | || |    | |_| |( (_| |
|_|   |_||____/ |_| |_||_|     \___/  \____|

For detailed documentation and guides, please visit:
https://docs.runpod.io/ and https://blog.runpod.io/


root@5ed8df208cf4:~# nvidia-smi
Tue Jan 14 22:03:15 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 Ti     On  |   00000000:81:00.0 Off |                  N/A |
|  0%   28C    P8             11W /  285W |       2MiB /  12282MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Exécution d’Ollama :

root@5ed8df208cf4:~# apt update 2> /dev/null && apt install -qq lshw -y 2> /dev/null

root@5ed8df208cf4:~# export OLLAMA_HOST=0.0.0.0:11434
root@5ed8df208cf4:~# (curl -fsSL https://ollama.com/install.sh | sh && ollama serve > ollama.log 2>&1) &
[1] 950
root@5ed8df208cf4:~# >>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
######################################################################## 100.0%
>>> Creating ollama user...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
WARNING: systemd is not running
>>> NVIDIA GPU installed.
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.

root@5ed8df208cf4:~# netstat -tunlp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:7861            0.0.0.0:*               LISTEN      52/nginx: master pr 
tcp        0      0 0.0.0.0:8081            0.0.0.0:*               LISTEN      52/nginx: master pr 
tcp        0      0 0.0.0.0:8001            0.0.0.0:*               LISTEN      52/nginx: master pr 
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      70/sshd: /usr/sbin/ 
tcp        0      0 0.0.0.0:3001            0.0.0.0:*               LISTEN      52/nginx: master pr 
tcp        0      0 0.0.0.0:9091            0.0.0.0:*               LISTEN      52/nginx: master pr 
tcp        0      0 127.0.0.11:39145        0.0.0.0:*               LISTEN      -                   
tcp6       0      0 :::22                   :::*                    LISTEN      70/sshd: /usr/sbin/ 
tcp6       0      0 :::11434                :::*                    LISTEN      1006/ollama         
udp        0      0 127.0.0.11:33663        0.0.0.0:*                           -

Récupération d’un LLM avec Llama3.2 :

llama3.2

Meta's Llama 3.2 goes small with 1B and 3B models.

ollama.com

Llama 3.2: Revolutionizing edge AI and vision with open, customizable models

Today, we're releasing Llama 3.2, which includes small and medium-sized vision LLMs, and lightweight, text-only models…

ai.meta.com

root@5ed8df208cf4:~# ollama pull llama3.2:3b-instruct-q4_K_S
pulling manifest 
pulling d5e517daeee4... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 1.9 GB                         
pulling 966de95ca8a6... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 1.4 KB                         
pulling fcc5a6bec9da... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 7.7 KB                         
pulling a70ff7e570d9... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 6.0 KB                         
pulling 56bb8bd477a5... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏   96 B                         
pulling 9c65e8607c0c... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  561 B                         
verifying sha256 digest 
writing manifest 
success 

root@5ed8df208cf4:~# ollama list       
NAME                           ID              SIZE      MODIFIED       
llama3.2:3b-instruct-q4_K_S    80f2089878c9    1.9 GB    31 seconds ago

L’endpoint d’Ollama est disponible publiquement via le proxy offert par RunPod :

Modification de K9s pour y intégrer cet endpoint et HolmesGPT sous forme d’un plug-in :

root@k0s-incus:~# cat ~/.config/k9s/plugins.yaml
plugins:
  holmesgpt:
    shortCut: Shift-H
    description: Ask HolmesGPT
    scopes:
      - all
    command: bash
    background: false
    confirm: false
    args:
      - -c
      - |
        holmes ask "why is $NAME of $RESOURCE_NAME in -n $NAMESPACE not working as expected" --model="openai/llama3.2:3b-instruct-q4_K_S"
        echo "Press 'q' to exit"
        while : ; do
        read -n 1 k <&1
        if [[ $k = q ]] ; then
        break
        fi
        done

root@k0s-incus:~# export OPENAI_API_BASE="https://vsr6spvysc6jly-11434.proxy.runpod.net/v1"
root@k0s-incus:~# export OPENAI_API_KEY=123

Déploiement d’un exemple de Pod problématique dans le cluster Kubernetes via les exemples fournis par Robusta :

root@k0s-incus:~# kubectl apply -f https://raw.githubusercontent.com/robusta-dev/kubernetes-demos/main/crashpod/broken.yaml
deployment.apps/payment-processing-worker created

root@k0s-incus:~# kubectl get po
NAME                                         READY   STATUS             RESTARTS     AGE
payment-processing-worker-747ccfb9db-njgmw   0/1     CrashLoopBackOff   1 (4s ago)   9s

Je peux lancer la requête relative au plug-in avec HolmesGPT et la combinaison CTRL+H pour obtenir cette première réponse :

The payment-processing-container container has crashed and is being restarted for the 6th time due to a CrashLoopBackOff. The last state indicates that the container was terminated with an  
exit code of 0, which suggests that the command executed successfully but did not complete as expected.                                                                                       

To investigate further, you can check the logs of the payment-processing-container container to see if there are any error messages or clues about what is causing the issue. You can also    
check the Kubernetes events for any other errors or warnings that may be related to this issue.                                                                                               

Additionally, you can try to debug the command executed by the payment-processing-container container to see if it's correct and working as expected. The command is:                         

                                                                                                                                                                                              
 if [[ -z "${DEPLOY_ENV}" ]]; then echo Environment variable DEPLOY_ENV is undefined ; else while true; do echo hello; sleep 10;done; fi

This command checks if the DEPLOY_ENV environment variable is set, and if it's not, it prints a message. If it is set, it enters an infinite loop that prints "hello" every 10 seconds.       

If you're running this container in a Kubernetes pod, you can try to debug the issue by checking the pod's logs or using a tool like kubectl to inspect the container's state and logs.       
Press 'q' to exit

Modification de la requête et autre réponse :

root@k0s-incus:~# cat ~/.config/k9s/plugins.yaml
plugins:
  holmesgpt:
    shortCut: Shift-H
    description: Ask HolmesGPT
    scopes:
      - all
    command: bash
    background: false
    confirm: false
    args:
      - -c
      - |
        holmes ask "why is $NAME of $RESOURCE_NAME in -n $NAMESPACE not working and why $NAME is crashed?" --model="openai/llama3.2:3b-instruct-q4_K_S"
        echo "Press 'q' to exit"
        while : ; do
        read -n 1 k <&1
        if [[ $k = q ]] ; then
        break
        fi
        done

The payment-processing-container container has crashed and is being restarted for the 6th time due to a CrashLoopBackOff. The last state indicates that the container was terminated with an  
exit code of 0, which suggests that the command executed successfully but did not produce any output.                                                                                         

To investigate further, you can check the logs of the payment-processing-container container to see if there are any error messages or clues about what is causing the crash:                 

                                                                                                                                                                                              
 kubectl logs payment-processing-worker-747ccfb9db-njgmw -c payment-processing-container

Additionally, you can check the configuration of the payment-processing-container container to ensure that it is running with the correct environment variables and settings.                 

                                                                                                                                                                                              
 kubectl describe pod payment-processing-worker-747ccfb9db-njgmw -c payment-processing-container

This will provide more detailed information about the container's configuration and any errors that may be occurring.

HolmesGPT peut s’intégrer plus globalement à la plateforme Robusta via une installation dans le cluster Kubernetes et Helm …

AI Analysis - Robusta documentation

Robusta can integrate with Holmes GPT to analyze health issues on your cluster, and to run AI based root cause analysis…

docs.robusta.dev

Pour cela génération du fichier YAML de ce type en configuration :

root@k0s-incus:~# cat generated_values.yaml 
globalConfig:
  signing_key: 568927d5-6e65-4c13-b3fe-fdc50e616fde
  account_id: a4d7cea6-fba3-4ce6-ba3d-941b55ec83db
sinksConfig:
  - robusta_sink:
      name: robusta_ui_sink
      token: <TOKEN>
enablePrometheusStack: true
kube-prometheus-stack:
  grafana:
    persistence:
      enabled: true
enablePlatformPlaybooks: true
runner:
  sendAdditionalTelemetry: true
enableHolmesGPT: true
holmes:
  additionalEnvVars:
    - name: ROBUSTA_AI
      value: "true"

Utilisation des commandes et du fichier de configuration YAML fournis par la plateforme Robusta :

root@k0s-incus:~# helm repo add robusta https://robusta-charts.storage.googleapis.com && helm repo update
"robusta" has been added to your repositories
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "robusta" chart repository
Update Complete. ⎈Happy Helming!⎈
root@k0s-incus:~# helm install robusta robusta/robusta -f ./generated_values.yaml --set clusterName="k0s-cluster" \
--set isSmallCluster=true \
--set holmes.resources.requests.memory=512Mi \
--set kube-prometheus-stack.prometheus.prometheusSpec.retentionSize=9GB \
--set kube-prometheus-stack.prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=10Gi \
--set kube-prometheus-stack.prometheus.prometheusSpec.resources.requests.memory=512Mi
NAME: robusta
LAST DEPLOYED: Tue Jan 14 22:59:09 2025
NAMESPACE: default
STATUS: deployed
REVISION: 1
NOTES:
Thank you for installing Robusta 0.20.0

As an open source project, we collect general usage statistics.
This data is extremely limited and contains only general metadata to help us understand usage patterns.
If you are willing to share additional data, please do so! It really help us improve Robusta.

You can set sendAdditionalTelemetry: true as a Helm value to send exception reports and additional data.
This is disabled by default.

To opt-out of telemetry entirely, set a ENABLE_TELEMETRY=false environment variable on the robusta-runner deployment.
Note that if the Robusta UI is enabled, telemetry cannot be disabled even if ENABLE_TELEMETRY=false is set.

Visit the web UI at: https://platform.robusta.dev/

root@k0s-incus:~# helm ls -A
NAME    NAMESPACE       REVISION        UPDATED                                 STATUS          CHART           APP VERSION
robusta default         2               2025-01-14 23:10:13.935906491 +0000 UTC deployed        robusta-0.20.0  0.20.0

root@k0s-incus:~# kubectl get po,svc
NAME                                                         READY   STATUS             RESTARTS         AGE
pod/alertmanager-robusta-kube-prometheus-st-alertmanager-0   0/2     Pending            0                2m
pod/payment-processing-worker-747ccfb9db-njgmw               0/1     CrashLoopBackOff   10 (2m33s ago)   28m
pod/prometheus-robusta-kube-prometheus-st-prometheus-0       0/2     Pending            0                2m
pod/robusta-forwarder-cd847ccc-wxc6d                         1/1     Running            0                2m5s
pod/robusta-grafana-8588b8fb85-fv5vj                         3/3     Running            0                2m5s
pod/robusta-holmes-55dd58ff6d-m4zth                          1/1     Running            0                2m5s
pod/robusta-kube-prometheus-st-operator-6885c8f675-szncg     1/1     Running            0                2m5s
pod/robusta-kube-state-metrics-8667fd9775-s49z4              1/1     Running            0                2m5s
pod/robusta-prometheus-node-exporter-c6jvb                   1/1     Running            0                2m5s
pod/robusta-prometheus-node-exporter-j6zp5                   1/1     Running            0                2m5s
pod/robusta-runner-5d667b7d9c-dm2z7                          1/1     Running            0                2m5s

NAME                                              TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
service/alertmanager-operated                     ClusterIP   None             <none>        9093/TCP,9094/TCP,9094/UDP   2m1s
service/kubernetes                                ClusterIP   10.96.0.1        <none>        443/TCP                      94m
service/prometheus-operated                       ClusterIP   None             <none>        9090/TCP                     2m1s
service/robusta-forwarder                         ClusterIP   10.102.7.41      <none>        80/TCP                       2m5s
service/robusta-grafana                           ClusterIP   10.106.69.72     <none>        80/TCP                       2m5s
service/robusta-holmes                            ClusterIP   10.110.124.241   <none>        80/TCP                       2m5s
service/robusta-kube-prometheus-st-alertmanager   ClusterIP   10.105.101.210   <none>        9093/TCP,8080/TCP            2m5s
service/robusta-kube-prometheus-st-operator       ClusterIP   10.103.213.208   <none>        443/TCP                      2m5s
service/robusta-kube-prometheus-st-prometheus     ClusterIP   10.107.13.104    <none>        9090/TCP,8080/TCP            2m5s
service/robusta-kube-state-metrics                ClusterIP   10.103.53.30     <none>        8080/TCP                     2m5s
service/robusta-prometheus-node-exporter          ClusterIP   10.102.243.65    <none>        9104/TCP                     2m5s
service/robusta-runner                            ClusterIP   10.97.82.15      <none>        80/TCP                       2m5s

Je peux procéder à l’installation complête via cette formule :

root@k0s-incus:~# helm upgrade robusta robusta/robusta -f ./generated_values.yaml --set clusterName="k0s-cluster"
Release "robusta" has been upgraded. Happy Helming!
NAME: robusta
LAST DEPLOYED: Tue Jan 14 23:14:02 2025
NAMESPACE: default
STATUS: deployed
REVISION: 5
NOTES:
Thank you for installing Robusta 0.20.0

As an open source project, we collect general usage statistics.
This data is extremely limited and contains only general metadata to help us understand usage patterns.
If you are willing to share additional data, please do so! It really help us improve Robusta.

You can set sendAdditionalTelemetry: true as a Helm value to send exception reports and additional data.
This is disabled by default.

To opt-out of telemetry entirely, set a ENABLE_TELEMETRY=false environment variable on the robusta-runner deployment.
Note that if the Robusta UI is enabled, telemetry cannot be disabled even if ENABLE_TELEMETRY=false is set.

Visit the web UI at: https://platform.robusta.dev/

Le cluster apparaît sur Robusta :

Et là également via HolmesGPT, intterogation de la plateforme sur les éventuelles problématiques rencontrées dans le cluster Kubernetes :

Le tout avec une consommation moindre dans le cluster …

L’utilisation de l’IA pour le dépannage et l’analyse des incidents réduit le temps et l’effort humain nécessaire, permettant aux équipes de se concentrer sur des tâches plus stratégiques.

Les outils comme HolmesGPT et Ollama peuvent être mis à l’échelle en fonction de la demande, ce qui est particulièrement utile dans les environnements de production où la charge de travail peut varier significativement.

On peut donc en conclure que l’intégration de l’IA dans les clusters Kubernetes à l’aide d’outils comme HolmesGPT, Ollama et de fournisseur d’instances GPU comme RunPod, offre des avantages significatifs en termes d’efficiacité, de scalabilité et de tolérance aux pannes.

Ces technologies permettent de rationaliser le cycle de vie des applications, de simplifier le dépannage et d’améliorer la gestion des ressources, rendant ainsi les opérations Kubernetes plus robustes et plus performantes …

À suivre !

AIOps : Investigation par l’IA dans Kubernetes avec HolmesGPT, Ollama et RunPod …

AIOps : Déboguer son cluster Kubernetes en utilisant l’intelligence artificielle générative via…

Les avancées dans l’Intelligence Artificielle Générative et sa mise en oeuvre simplifiée via certains outils…

GitHub - robusta-dev/holmesgpt: On-Call Assistant for Prometheus Alerts - Get a head start on…

On-Call Assistant for Prometheus Alerts - Get a head start on fixing alerts with AI investigation …

Linux Containers - Incus - Introduction

The umbrella project behind Incus, LXC, LXCFS, Distrobuilder and more.

incus webui

Open the web interface Synopsis: Description:, Open the web interface,,,, Options inherited from parent commands: SEE…

Pinggy - Simple Localhost Tunnels

Pinggy is the simplest way to bring your localhost projects online. It the the fastest way to create secure tunnels to…

MicroK8s - MicroK8s in LXD | MicroK8s

MicroK8s is the simplest production-grade upstream K8s. Lightweight and focused. Single command install on Linux…

Using k0sctl - Documentation

Documentation for k0s, the Zero friction Kubernetes distribution.

GitHub - robusta-dev/holmesgpt: On-Call Assistant for Prometheus Alerts - Get a head start on…

On-Call Assistant for Prometheus Alerts - Get a head start on fixing alerts with AI investigation …

GitHub - derailed/k9s: 🐶 Kubernetes CLI To Manage Your Clusters In Style!

🐶 Kubernetes CLI To Manage Your Clusters In Style! - derailed/k9s

Ollama

Get up and running with large language models.

RunPod - The Cloud Built for AI

Develop, train, and scale AI models in one cloud. Spin up on-demand GPUs with GPU Cloud, scale ML inference with…

Set up Ollama on your GPU Pod | RunPod Documentation

Learn how to set up Ollama, a powerful language model, on a GPU Pod using RunPod, and interact with it through HTTP API…

llama3.2

Meta's Llama 3.2 goes small with 1B and 3B models.

Llama 3.2: Revolutionizing edge AI and vision with open, customizable models

Today, we're releasing Llama 3.2, which includes small and medium-sized vision LLMs, and lightweight, text-only models…

AI Analysis - Robusta documentation

Robusta can integrate with Holmes GPT to analyze health issues on your cluster, and to run AI based root cause analysis…

Written by Karim

No responses yet