-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittently getting "Cannot connect to the Docker daemon at unix:///var/run/docker.sock" #3794
Comments
Hello! Thank you for filing an issue. The maintainers will triage your issue shortly. In the meantime, please take a look at the troubleshooting guide for bug reports. If this is a feature request, please review our contribution guidelines. |
@AurimasNav When you tried this without I ran into a similar issue with One workflow would fail with
and the
I think this is because both runners were trying to use Anyway, I was also using |
There is no service mesh nor any kind sidecar injection, it is a k3s install on a single node server, but I guess it could potentially be a problem with 2 runners, even though I reduced the max runners to 1 instance, I have another actions runner controller set instance for different github org, running on the same k3s. |
Whenever a pipeline runs, two pods are being created on our arc-runner-set (even though NAME READY STATUS RESTARTS AGE
comp-9wssh-runner-kftrj 2/2 Running 0 12s
comp-9wssh-runner-k8h45 1/2 Error 0 12s with the error:
if we are lucky the job is run on the "healthy" runner - everything is fine, but it seems to be 50/50 which one is selected, if we end up on the failed one, the job fails with:
From searching the internet it seems to be a problem with iptables configuration concurrency. |
added configuration for restatPolicy to template:
spec:
hostNetwork: true
restartPolicy: OnFailure so far it seems to restart the failed pod and the jobs are no longer failing, but I wonder is there some kind of downside given that by default it was set to never restart. |
We see this too with 0.9.0 and 0.10.1. Self hosted runners. it comes in waves where most jobs fail, then all is well for a day or so. We are not using |
For what it's worth, this completely solved the problem for me. If you are already customizing the template of your docker-in-docker runner, you can move the dind container from a standard container to an init container and set a retry policy on it so that it behaves as a "sidecar container". |
Checks
Controller Version
0.9.3
Helm Chart Version
0.9.3
CertManager Version
1.16.1
Deployment Method
ArgoCD
cert-manager installation
cert-manager is working
Checks
Resource Definitions
To Reproduce
Describe the bug
Running an action including docker command like:
docker build . --file Dockerfile --tag $env:FullImageName --secret id=npm_token,env=NPM_TOKEN --build-arg NODE_ENV=production
intermitently results in an error:
Describe the expected behavior
Being able to connect to unix:///var/run/docker.sock 100% of the runs.
Whole Controller Logs
Whole Runner Pod Logs
Additional Context
In the dind container log I can see:
Not sure why that happens or how it can be solved? Might this have something to do with my config in values.yaml
(if I don't specify this, my containers in actions have no internet access).
The text was updated successfully, but these errors were encountered: