Dagster Not Tracking Completion of Long Running Jobs Launched via execute_k8s_job
.
#27195
Labels
type: bug
Something isn't working
execute_k8s_job
.
#27195
What's the issue?
When we trigger jobs using
execute_k8s_job
the job is launched on kubernetes and completes but dagster does not update or denote the step as completed. Nor does it return any error. The step just runs indefinitely.After launching a long running graph I went back to check on the status and saw my steps were still running in dagster:
When reviewing the logs for one of these steps I can see dagster is still waiting for the job to complete:
However, after reviewing the status of the jobs and pods on GKE I can see that the job dagster was waiting for has completed. Here is a sample of some of the logs from GKE:
What did you expect to happen?
I expected the steps that were completed on K8S should also be complete in dagster. This is the case for smaller projects with this graph which do not take as long to complete.
How to reproduce?
I'm not sure you can easily replicate this without running on our cluster. But here is the tail end of our op so that you can see where things are getting hung up:
Dagster version
dagster, 1.9.8
Deployment type
Dagster Helm chart
Deployment details
We are running dagster open source on GKE with autopilot enabled.
Additional information
Is there any further troubleshooting I can do in GKE to get to the bottom of why the jobs wouldn't get marked as completed on GKE? If so let me know and I'll dig more and post back to this issue.
Message from the maintainers
Impacted by this issue? Give it a 👍! We factor engagement into prioritization.
The text was updated successfully, but these errors were encountered: