Skip to content

Conversation

BenTheElder
Copy link
Member

@BenTheElder BenTheElder commented Aug 15, 2025

prevent release blocking jobs from setting timeouts that obviously violate the release blocking jobs policy, which says:

Blocking jobs must:
[...]

  • Have the average of 75% percentile duration of all runs for a week finishing in 120 minutes or less
  • Run at least every 3 hours

IE they should nominally take two hours to run and they should not exceed 3h in scheduling frequency.

I'm not attempting crons yet.

For now, 2h30m seems like a reasonable middle ground max timeout (should usually take 2h, and should run every 3h).

For max interval the policy is clear: <= 3h.

We have a lot of jobs failing these two checks.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/config Issues or PRs related to code in /config sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 15, 2025
@k8s-ci-robot k8s-ci-robot requested review from dims and upodroid August 15, 2025 20:21
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. area/jobs sig/node Categorizes an issue or PR as relevant to SIG Node. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 15, 2025
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 15, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BenTheElder

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@BenTheElder
Copy link
Member Author

For max interval the policy is clear: <= 3h.

Hmm, I suppose we want to intentionally relax that one for the older release branches ... probably worth simply dropping that commit and focusing on the timeouts for now.

@BenTheElder
Copy link
Member Author

Actually, we can restrict the interval check to non-release branches pretty easily and start there.

@BenTheElder BenTheElder force-pushed the release-jobs-not-slow branch from 133ae50 to 73ffdad Compare August 15, 2025 21:15
@k8s-ci-robot k8s-ci-robot added area/release-eng Issues or PRs related to the Release Engineering subproject sig/network Categorizes an issue or PR as relevant to SIG Network. sig/release Categorizes an issue or PR as relevant to SIG Release. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. area/provider/gcp Issues or PRs related to gcp provider size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 15, 2025
@BenTheElder
Copy link
Member Author

there are a few more but they're all in config/jobs/kubernetes/generated/generated.yaml which can't be edited directly and isn't super easy to work with

@BenTheElder BenTheElder force-pushed the release-jobs-not-slow branch from 7a9447d to 7f7998a Compare August 15, 2025 21:27
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 15, 2025
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Aug 15, 2025

@BenTheElder: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-test-infra-unit-test 7f7998a link true /test pull-test-infra-unit-test
pull-test-infra-unit-test-race-detector-nonblocking 7f7998a link false /test pull-test-infra-unit-test-race-detector-nonblocking

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@BenTheElder
Copy link
Member Author

remaining jobs:

{Failed  === RUN   TestKubernetesReleaseBlockingJobsCIPolicy
    jobs_test.go:1210: ci-kubernetes-e2e-gce-cos-k8sbeta-alphafeatures: release-blocking job must have timeout <= 2h30m and nominally run in <=2h, yet timeout is: 3h20m0s
    jobs_test.go:1210: ci-kubernetes-e2e-gce-cos-k8sbeta-ingress: release-blocking job must have timeout <= 2h30m and nominally run in <=2h, yet timeout is: 2h50m0s
    jobs_test.go:1210: ci-kubernetes-e2e-gce-cos-k8sbeta-reboot: release-blocking job must have timeout <= 2h30m and nominally run in <=2h, yet timeout is: 3h20m0s
    jobs_test.go:1210: ci-kubernetes-e2e-gce-cos-k8sstable1-alphafeatures: release-blocking job must have timeout <= 2h30m and nominally run in <=2h, yet timeout is: 3h20m0s
    jobs_test.go:1210: ci-kubernetes-e2e-gce-cos-k8sstable1-ingress: release-blocking job must have timeout <= 2h30m and nominally run in <=2h, yet timeout is: 2h50m0s
    jobs_test.go:1210: ci-kubernetes-e2e-gce-cos-k8sstable1-reboot: release-blocking job must have timeout <= 2h30m and nominally run in <=2h, yet timeout is: 3h20m0s
    jobs_test.go:1210: ci-kubernetes-e2e-gce-cos-k8sstable2-alphafeatures: release-blocking job must have timeout <= 2h30m and nominally run in <=2h, yet timeout is: 3h20m0s
    jobs_test.go:1210: ci-kubernetes-e2e-gce-cos-k8sstable2-ingress: release-blocking job must have timeout <= 2h30m and nominally run in <=2h, yet timeout is: 2h50m0s
    jobs_test.go:1210: ci-kubernetes-e2e-gce-cos-k8sstable2-reboot: release-blocking job must have timeout <= 2h30m and nominally run in <=2h, yet timeout is: 3h20m0s
    jobs_test.go:1210: ci-kubernetes-e2e-gce-cos-k8sstable3-alphafeatures: release-blocking job must have timeout <= 2h30m and nominally run in <=2h, yet timeout is: 3h20m0s
    jobs_test.go:1210: ci-kubernetes-e2e-gce-cos-k8sstable3-ingress: release-blocking job must have timeout <= 2h30m and nominally run in <=2h, yet timeout is: 2h50m0s
    jobs_test.go:1210: ci-kubernetes-e2e-gce-cos-k8sstable3-reboot: release-blocking job must have timeout <= 2h30m and nominally run in <=2h, yet timeout is: 3h20m0s
    jobs_test.go:1210: ci-kubernetes-e2e-gci-gce-alpha-features: release-blocking job must have timeout <= 2h30m and nominally run in <=2h, yet timeout is: 3h20m0s
    jobs_test.go:1250: summary:   13/3675 jobs fail to meet kubernetes/kubernetes release-blocking CI policy
--- FAIL: TestKubernetesReleaseBlockingJobsCIPolicy (0.04s)
}

These are all from the releng generated jobs #32340

decorate: true
decoration_config:
timeout: 240m
timeout: 1h
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious why this one got 1h? the one above has 2h and that's within limit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit messages mention this for each change, we don't want to set high timeouts when we don't need to, these jobs are running well within 1h. If they start to take longer, that's a red flag.

Copy link
Member Author

@BenTheElder BenTheElder Aug 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only set them to the maximum to avoid regressing them. In general these jobs tend to have excessively high timeouts, which will mask failure modes. If a job suddenly goes from <1h to 2h that's almost certainly an excessive retry/timeout that will lead to failure or a massive regression.

@haircommander haircommander moved this from Triage to Archive-it in SIG Node CI/Test Board Aug 29, 2025
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 5, 2025
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/config Issues or PRs related to code in /config area/jobs area/provider/gcp Issues or PRs related to gcp provider area/release-eng Issues or PRs related to the Release Engineering subproject cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: Archive-it
Development

Successfully merging this pull request may close these issues.

3 participants