Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update to latest base image with containerd 2.0.2 #3848

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

BenTheElder
Copy link
Member

contains the base image built from #3828 in https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/post-kind-push-base-image/1880330906284068864

TODO: node image (prow e2es will use this, github actions will use the default node image only, though I expect we are more likely to catch issues in the full kubernetes e2e tests anyhow for this particular change)

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BenTheElder

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from aojea January 17, 2025 19:26
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 17, 2025
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 17, 2025
@BenTheElder
Copy link
Member Author

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 17, 2025
@BenTheElder
Copy link
Member Author

/retest
[pod scheduling timeout, CI never ran]

@BenTheElder
Copy link
Member Author

BenTheElder commented Jan 17, 2025

All of CI passed the first try (ignoring failure to schedule the CI workload itself, unrelated, timing mismatch with autoscaling vs job scheduler), however when building a node image locally:

Failed to pull docker.io/kindest/kindnetd:v20241212-9f82dd49 with error: command "docker exec --privileged kind-build-1737147227-770353019 ctr --namespace=k8s.io content fetch --platform=linux/arm64 docker.io/kindest/kindnetd:v20241212-9f82dd49" failed with error: exit status 1
time="2025-01-17T20:53:53Z" level=warning msg="Failed to check deprecations" error="connection error: desc = \"transport: Error while dialing: dial unix:///run/containerd/containerd.sock: timeout\""
ctr: connection error: desc = "transport: Error while dialing: dial unix:///run/containerd/containerd.sock: timeout"
Failed to pull docker.io/kindest/local-path-provisioner:v20241212-8ac705d0 with error: command "docker exec --privileged kind-build-1737147227-770353019 ctr --namespace=k8s.io content fetch --platform=linux/arm64 docker.io/kindest/local-path-provisioner:v20241212-8ac705d0" failed with error: exit status 1
time="2025-01-17T20:53:53Z" level=warning msg="Failed to check deprecations" error="connection error: desc = \"transport: Error while dialing: dial unix:///run/containerd/containerd.sock: timeout\""
ctr: connection error: desc = "transport: Error while dialing: dial unix:///run/containerd/containerd.sock: timeout"
Failed to pull docker.io/kindest/local-path-helper:v20241212-8ac705d0 with error: command "docker exec --privileged kind-build-1737147227-770353019 ctr --namespace=k8s.io content fetch --platform=linux/arm64 docker.io/kindest/local-path-helper:v20241212-8ac705d0" failed with error: exit status 1
time="2025-01-17T20:53:53Z" level=warning msg="Failed to check deprecations" error="connection error: desc = \"error reading server preface: read unix @->/run/containerd/containerd.sock: use of closed network connection\""
ctr: connection error: desc = "error reading server preface: read unix @->/run/containerd/containerd.sock: use of closed network connection"

Debugging, my current suspicion is that we need to wait for containerd to be ready, it takes longer to start?

Around 1s on a fairly large cloud VM:

INFO[2025-01-17T20:58:38.686355016Z] containerd successfully booted in 0.956085s`

If i start containerd (v1.7.24) in the previous base image like this to simulate the build process:

docker run -d --entrypoint=sleep --name="test-old-base" --platform=linux/arm64 --security-opt=seccomp=unconfined docker.io/kindest/base:v20241212-9f82dd49 infinity
docker exec -it test-old-base containerd

INFO[2025-01-17T21:03:28.470803043Z] containerd successfully booted in 0.357321s

These times are pretty representative of repeated attempts, containerd 2.0.2 takes about 3x to start versus v1.7.24

(NOTE: these are the arm64 cross-build on an amd64 host)

circled back in #3828 (comment)

@BenTheElder
Copy link
Member Author

BenTheElder commented Jan 17, 2025

To replicate:

  1. start both versions
docker run -d --entrypoint=sleep --name="test-old-base" --platform=linux/arm64 --security-opt=seccomp=unconfined docker.io/kindest/base:v20241212-9f82dd49 infinity
docker run -d --entrypoint=sleep --name="test-new-base" --platform=linux/arm64 --security-opt=seccomp=unconfined docker.io/kindest/base:v20250117-f528b021 infinity
  1. try starting containerd in each of these:
docker exec -it test-old-base containerd
docker exec -it test-new-base containerd

(then watch for the log line like "containerd successfully booted in 0.325773s" and "containerd successfully booted in 0.972910s")

@BenTheElder
Copy link
Member Author

BenTheElder commented Jan 17, 2025

Update: this is probably not worth discussing upstream, because 2.0.2 is still < 0.07s with amd64 + amd64 host. It is however consistently longer than 1.7.4. Something with arm64 qemu must be even more pathological.

I think let's add image pull retries + waiting for it to start. We only need to do this for pulling, not imports (we do all pulling first), which is a nice idea anyhow to handle transient network issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants