Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
381e252
Add Prime install
hierynomus Oct 24, 2024
6f2f4d3
Fix param order
hierynomus Oct 24, 2024
b73ad1d
Add observability service token methods
hierynomus Oct 25, 2024
4f55817
Add ai-model workload
hierynomus Nov 4, 2024
82fea31
Fixup helm charts and add cli install
hierynomus Nov 5, 2024
afe86ea
Make lint happy
hierynomus Nov 5, 2024
24d8f60
Remove hardcoded clusterissuer
hierynomus Nov 5, 2024
dae759c
Remove hardcoded annotation
hierynomus Nov 5, 2024
37edf32
Added monitor and assets
hierynomus Nov 7, 2024
1affbcb
Switched parameters
hierynomus Nov 7, 2024
a78615b
Fix download script
hierynomus Nov 7, 2024
4b15967
Add fleet assets
hierynomus Nov 7, 2024
05e0290
Added cpu throttling monitor in assets
hierynomus Nov 12, 2024
b092e60
New monitors
hierynomus Jan 16, 2025
01a3da0
Fix monitor naming
hierynomus Jan 16, 2025
4cd3229
New remediation guide
hierynomus Jan 17, 2025
e72cd45
30s interval for cert expiration
hierynomus Jan 21, 2025
d933f57
Added Observability platform install functions
hierynomus Jan 24, 2025
9e048fe
Add longhorn install
hierynomus Jan 24, 2025
742f7eb
Fix quo5te
hierynomus Jan 24, 2025
58346bb
Add checks for setup
hierynomus Jan 30, 2025
15d4f11
added retry limit
Jan 31, 2025
14332cf
Merge pull request #33 from rmahique/develop
hierynomus Jan 31, 2025
6729405
fix
Jan 31, 2025
39bd1bc
Merge pull request #34 from rmahique/develop
rmahique Jan 31, 2025
6ea79dc
Fix for if-statement cert-manager
hierynomus Feb 3, 2025
52fda4d
Remove dependency on opensource.suse.com
hierynomus Feb 3, 2025
17fada2
added nowait
hierynomus Feb 4, 2025
f038e5b
use kubectl wait to wait for pods
hierynomus Feb 6, 2025
c9e9924
Fix script
hierynomus Feb 6, 2025
b8f269b
fix on fix
hierynomus Feb 6, 2025
62cfd82
More logging for create cluster
hierynomus Feb 6, 2025
f844ccf
added functions to retrieve rancher and kubernetes versions
Feb 7, 2025
35903c1
Merge branch 'SUSE:prime' into prime
rmahique Feb 7, 2025
d7f20eb
Merge pull request #35 from rmahique/prime
rmahique Feb 7, 2025
6510ef0
Make download and unzip quieter
hierynomus Feb 21, 2025
6307586
Fix longhorn install
hierynomus Mar 20, 2025
ec33d04
Break if CAPI not ready
hierynomus Mar 24, 2025
56aed3f
Try to get a hand on why CAPI sometimes is not ready
hierynomus Mar 25, 2025
23475fe
Fix ingestion-api-key -> service-token
hierynomus Jul 24, 2025
8dde096
Fix auth of sts command
hierynomus Jul 30, 2025
a9362e3
Fixes for rodeo
hierynomus Oct 3, 2025
ad058c5
Echo resp to see whether it's OK
hierynomus Oct 3, 2025
1206e0e
Added logging
hierynomus Oct 3, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions assets/fleet/clustergroup.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
apiVersion: fleet.cattle.io/v1alpha1
kind: ClusterGroup
metadata:
name: build-a-dino
annotations:
{}
# key: string
labels:
{}
# key: string
namespace: fleet-default
spec:
selector:
matchLabels:
gpu-enabled: 'true'
app: build-a-dino
24 changes: 24 additions & 0 deletions assets/fleet/gitrepo.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
apiVersion: fleet.cattle.io/v1alpha1
kind: GitRepo
metadata:
name: build-a-dino
annotations:
{}
# key: string
labels:
{}
# key: string
namespace: fleet-default
spec:
branch: main
correctDrift:
enabled: true
# force: boolean
# keepFailHistory: boolean
insecureSkipTLSVerify: false
paths:
- /fleet/build-a-dino
# - string
repo: https://github.com/wiredquill/prime-rodeo
targets:
- clusterGroup: build-a-dino
48 changes: 48 additions & 0 deletions assets/monitors/certificate-expiration.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
nodes:
- _type: Monitor
arguments:
criticalThreshold: 1w
deviatingThreshold: 30d
query: type = "secret" AND label = "secret-type:certificate"
resourceName: Certificate
timestampProperty: certificateExpiration
description: Verify certificates that are close to it's expiration date
function: {{ get "urn:stackpack:common:monitor-function:topology-timestamp-threshold-monitor" }}
id: -12
identifier: urn:custom:monitor:certificate-expiration-v2
intervalSeconds: 30
name: Certificate Expiration V2
remediationHint: |

Certificate expiration date `\{{certificateExpiration\}}`.

### Obtain new TLS certificates

If you're using a Certificate Authority (CA) or a third-party provider, follow their procedures to obtain a new TLS certificate.
Once validated, download the new TLS certificate and the corresponding private key from the third-party provider's dashboard or via their API.
When you have downloaded these two files, you can update the Secret with the new certificate and key data.

```
kubectl create secret tls \{{name\}} --cert=path/to/new/certificate.crt --key=path/to/new/private.key
```

2. **Generate new self-signed certificates**:

If you're using self-signed certificates, you can generate new ones locally and update the Secret with the new certificate and key data.
Use tools like OpenSSL to generate new self-signed certificates.

```
openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout path/to/new/private.key -out path/to/new/certificate.crt
```

Update the Secret with the new certificate and key data.

```
kubectl create secret tls \{{name\}} --cert=path/to/new/certificate.crt --key=path/to/new/private.key
```

Alternatively you can edit the existing secret with **`kubectl edit secret \{{name\}}`** and replace the certificate and key data with the new ones obtained from the third-party provider or generated locally.
status: ENABLED
tags:
- certificate
- secret
81 changes: 81 additions & 0 deletions assets/monitors/http-error-ratio-for-service.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
_version: 1.0.85
nodes:
- _type: Monitor
arguments:
deviatingThreshold: 0.05
loggingLevel: WARN
timeWindow: 2 minutes
description: |-
HTTP responses with a status code in the 5xx range indicate server-side errors such as a misconfiguration, overload or internal server errors.
To ensure a good user experience, the percentage of 5xx responses should be less than the configured percentage (5% is the default) of the total HTTP responses for a Kubernetes (K8s) service.
To understand the full monitor definition check the details.
Because the exact threshold and severity might be application dependent, the thresholds can be overriden via a Kubernetes annotation on the service. For example to override the pre-configured deviating threshold and instead only have a critical threshold at 6% put this annotation on your service:
```
monitor.kubernetes-v2.stackstate.io/http-error-ratio-for-service: |
{
"criticalThreshold": 0.06,
"deviatingThreshold": null
}
```
Omitting the deviating threshold from this json snippet would have kept it at the configured 5%, with the critical threshold at 6% that means that the monitor would only result in a deviating state for an error ratio between 5% and 6%.
function: {{ get "urn:stackpack:prime-kubernetes:shared:monitor-function:http-error-ratio-for-service" }}
id: -8
identifier: urn:stackpack:custom:shared:monitor:http-error-ratio-for-service-v2
intervalSeconds: 10
name: HTTP - 5xx error ratio
remediationHint: |-
We have detected that more than 5% of the total responses from your Kubernetes service have a 5xx status code,
this signals that a significant number of users are experiencing downtime and service interruptions.
Take the following steps to diagnose the problem:

## Possible causes
- Slow dependency or dependency serving errors
- Recent update of the application
- Load on the application has increased
- Code has memory leaks
- Environment issues (e.g. certain nodes, database or services that the service depends on)

### Slow dependency or dependency serving errors
Check, in the related health violations of this monitor (which can be found in the expanded version if you read this in the pinned minimised version) if there are any health violations on one of the services or pods that this service depends on (focus on the lowest dependency). If you find a violation (deviating or critical health), click on that component to see the related health violations in table next to it. You can than click on those health violations to follow the instructions to resolve the issue.

### New behavior of the service
If there are no dependencies that have health violations, it could be that the pod backing this service is returning errors. If this behavior is new, it could be caused by a recent deployment.

This can be checked by looking at the Events shown on the [service highlights page](/#/components/\{{ componentUrnForUrl \}}/highlights) and checking whether a `Deployment` event happened recently after which the HTTP Error ratio behaviour changed.

To troubleshoot further, you can have a look at the pod(s) backing this service.
- Click on the "Pods of this service" in the "Related resource" section of the [service highlight page](/#/components/\{{ componentUrnForUrl \}})
- Click on the pod name(s) to go to their highlights pages
- Check the logs of the pod(s) to see if they're returning any errors.

### Recent update of the service
Check if the service was recently updated:
- See the Age in the "About" section to identify on the [service highlight page](/#/components/\{{ componentUrnForUrl \}})
is this is recently deployed
- Check if any of the pods are recently updated by clicking on "Pods of this service" in "Related resource" section of
the [service highlight page](/#/components/\{{ componentUrnForUrl \}}) and look if their Age is recent.
- If application has just started, it might be that the service has not warmed up yet. Compare the response time metrics
for the current deployment with the previous deployment by checking the response time metric chart with a time interval including both.
- Check if application is using more resources than before, consider scaling it up or giving it more resources.
- If increased latency is crucial, consider rolling back the service to the previous version:
- if that helps, then the issue is likely with new deployment
- if that does not help, then the issue may be in the environment (e.g. network issues or issues with the underlying infrastructure, e. g. database)
### Load on the service has increased
- Check if the amount of requests to the service has increased by looking at the "Throughput (HTTP responses/s)" chart for the "HTTP response metrics for all clients (incoming requests)" on the [service highlight page](/#/components/\{{ componentUrnForUrl \}}).
If so, consider scaling up the service or giving it more resources.
### Code has memory leaks
- Check if memory or CPU usage have been increasing over time. If so, there might be a memory leak.
You can find the pods supporting this service by clicking on "Pods of this service" in "Related resource"
section of the [service highlight page](/#/components/\{{ componentUrnForUrl \}}).
Check which pods are using the most disk space by clicking on the left side of the [service highlight page](/#/components/\{{ componentUrnForUrl \}}) on "Pods of this service"
- Check all the pods supporting this service by clicking on the pod name
- Check the resource usage on the "Resource usage" section
- Restart the pod(s) of this service that is having the issue or add more memory/cpu
### Environment issues
- Check latency of particular pods of the service. If only certain pods are having issues, might be an issue with the node the pod is running on:
- Try to move the pod to another node
- Check if other pods of other services are also having latency increased on that node. Drain the node if that is the case.
status: ENABLED
tags:
- services
timestamp: 2025-01-16T13:16:53.208687Z[Etc/UTC]
81 changes: 81 additions & 0 deletions assets/monitors/out-of-memory-containers.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
nodes:
- _type: Monitor
arguments:
comparator: GTE
failureState: DEVIATING
metric:
aliasTemplate: OOM Killed count
query: max(increase(kubernetes_containers_last_state_terminated{reason="OOMKilled"}[10m]))
by (cluster_name, namespace, pod_name, container)
unit: short
threshold: 1.0
urnTemplate: urn:kubernetes:/${cluster_name}:${namespace}:pod/${pod_name}
description: |-
It is important to ensure that the containers running in your Kubernetes cluster have enough memory to function properly. Out of memory (OOM) conditions can cause containers to crash or become unresponsive, leading to restarts and potential data loss.
To monitor for these conditions, we set up a check that detects and reports OOM events in the containers running in the cluster. This check will help you identify any containers that are running out of memory and allow you to take action to prevent issues before they occur.
To understand the full monitor definition check the details.
function: {{ get "urn:stackpack:common:monitor-function:threshold" }}
id: -13
identifier: urn:custom:monitor:out-of-memory-containers-v2
intervalSeconds: 30
name: Out of memory for containers V2
remediationHint: |-
An Out of Memory (OOM) event in Kubernetes occurs when a container's memory usage exceeds the limit set for it.
The Linux kernel's OOM killer process is triggered, which attempts to free up memory by killing one or more processes.
This can cause the container to terminate, leading to issues such as lost data, service interruption, and increased
resource usage.

Check the container [Logs](/#/components/\{{ componentUrnForUrl \}}#logs) for any hints on how the application is behaving.

### Recognize a memory leak

A memory leak can be recognized by looking at the "Memory Usage" metric on the [pod metrics page](/#/components/\{{ componentUrnForUrl \}}/metrics).

If the metric resembles a `saw-tooth` pattern that is a clear indication of a slow memory leak being present in your application.
The memory usage increases over time, but the memory is not released until the container is restarted.

If the metric resembles a `dash` pattern that is an indication of a memory leak via a spike.
The memory usage suddenly increases that causes the limit to be violated and the container killed.

You will notice that the container continually restarts.

Common issues that can cause this problem include:
1. New deployments that introduce a memory leak.
2. Elevated traffic that causes a temporary increase of memory usage.
3. Incorrectly configured memory limits.

### 1. New deployments that introduce a memory leak

If the memory leak behaviour is new, it is likely that a new deployment introduced a memory leak.

This can be checked by looking at the Events shown on the [pod highlights page](/#/components/\{{ componentUrnForUrl \}}/highlights) and checking whether a `Deployment` event happened recently after which the memory usage behaviour changed.

If the memory leak is caused by a deployment, you can investigate which change led to the memory leak by checking the [Show last change](/#/components/\{{ componentUrnForUrl \}}#lastChange), which will highlight the latest changeset for the deployment. You can then revert the change or fix the memory leak.

### 2. Elevated traffic that causes a temporary increase of memory usage
This can be checked by looking at the "Network Throughput for pods (received)" metric on the [pod metrics page](/#/components/\{{ componentUrnForUrl \}}/metrics) and comparing the usage to the "Memory Usage" metric. If the memory usage increases at the same time as the network throughput, it is likely that the memory usage is caused by the increased traffic.

As a temporary fix you can elevate the memory limit for the container. However, this is not a long-term solution as the memory usage will likely increase again in the future. You can also consider using Kubernetes autoscaling feature to scale up and down the number of replicas based on resource usage.

### 3. Incorrectly configured memory Limits
This can be checked by looking at the "Memory Usage" metric on the [pod metrics page](/#/components/\{{ componentUrnForUrl \}}/metrics) and comparing the usage to the requests and limits set for the pod. If the memory usage is higher than the limit set for the pod, the container will be terminated by the OOM killer.

To fix this issue, you can increate the memory limit for the pod, by changing the Kubernetes resource YAML and increasing the memory limit values e.g.
```
metadata:
spec:
containers:
resources:
limits:
cpu: "2"
memory: "3Gi"
requests:
cpu: "2"
memory: "3Gi"
```
status: ENABLED
tags:
- containers
- pods
85 changes: 85 additions & 0 deletions assets/monitors/pod-cpu-throttling.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
nodes:
- _type: Monitor
arguments:
comparator: GT
failureState: DEVIATING
metric:
aliasTemplate: CPU Throttling for ${container} of ${pod_name}
query: 100 * sum by (cluster_name, namespace, pod_name, container) (container_cpu_throttled_periods{})
/ sum by (cluster_name, namespace, pod_name, container) (container_cpu_elapsed_periods{})
unit: percent
threshold: 95.0
urnTemplate: urn:kubernetes:/${cluster_name}:${namespace}:pod/${pod_name}
description: |-
In Kubernetes, CPU throttling refers to the process where limits are applied to the amount of CPU resources a container can use.
This typically occurs when a container approaches the maximum CPU resources allocated to it, causing the system to throttle or restrict
its CPU usage to prevent a crash.

While CPU throttling can help maintain system stability by avoiding crashes due to CPU exhaustion, it can also significantly slow down workload
performance. Ideally, CPU throttling should be avoided by ensuring that containers have access to sufficient CPU resources.
This proactive approach helps maintain optimal performance and prevents the slowdown associated with throttling.
function: {{ get "urn:stackpack:common:monitor-function:threshold" }}
id: -13
identifier: urn:custom:monitor:pod-cpu-throttling-v2
intervalSeconds: 60
name: CPU Throttling V2
remediationHint: |-

### Application behaviour

Check the container [Logs](/#/components/\{{ componentUrnForUrl \}}#logs) for any hints on how the application is behaving under CPU Throttling

### Understanding CPU Usage and CPU Throttling

On the [pod metrics page](/#/components/\{{ componentUrnForUrl \}}/metrics) you will find the CPU Usage and CPU Throttling charts.

#### CPU Trottling

The percentage of CPU throttling over time. CPU throttling occurs when a container reaches its CPU limit, restricting its CPU usage to
prevent it from exceeding the specified limit. The higher the percentage, the more throttling is occurring, which means the container's
performance is being constrained.

#### CPU Usage

This chart shows three key CPU metrics over time:

1. Request: The amount of CPU the container requests as its minimum requirement. This sets the baseline CPU resources the container is guaranteed to receive.
2. Limit: The maximum amount of CPU the container can use. If the container's usage reaches this limit, throttling will occur.
3. Current: The actual CPU usage of the container in real-time.

The `Request` and `Limit` settings in the container can be seen in `Resource` section in [configuration](/#/components/\{{ componentUrnForUrl\}}#configuration)

#### Correlation

The two charts are correlated in the following way:

- As the `Current` CPU usage approaches the CPU `Limit`, the CPU throttling percentage increases. This is because the container tries to use more CPU than it is allowed, and the system restricts it, causing throttling.
- The aim is to keep the `Current` usage below the `Limit` to minimize throttling. If you see frequent high percentages in the CPU throttling chart, it suggests that you may need to adjust the CPU limits or optimize the container's workload to reduce CPU demand.


### Adjust CPU Requests and Limits

On the [pod highlights page](/#/components/\{{ componentUrnForUrl \}}/highlights) and checking whether a `Deployment` event happened recently after which the cpu usage behaviour changed.

You can investigate which change led to the cpu throttling by checking the [Show last change](/#/components/\{{ componentUrnForUrl \}}#lastChange),
which will highlight the latest changeset for the deployment. You can then revert the change or fix the cpu request and limit.


Review the pod's resource requests and limits to ensure they are set appropriately.
Show component [configuration](/#/components/\{{ componentUrnForUrl \}}#configuration)

If the CPU usage consistently hits the limit, consider increasing the CPU limit of the pod. <br/>
Edit the pod or deployment configuration file to modify the `resources.limits.cpu` and `resources.requests.cpu` as needed.
```
resources:
requests:
cpu: "500m" # Adjust this value based on analysis
limits:
cpu: "1" # Adjust this value based on analysis
```
If CPU throttling persists, consider horizontal pod autoscaling to distribute the workload across more pods, or adjust the cluster's node resources to meet the demands. Continuously monitor and fine-tune resource settings to optimize performance and prevent further throttling issues.
status: ENABLED
tags:
- cpu
- performance
- pod
Loading