Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dagster Not Tracking Completion of Long Running Jobs Launched via execute_k8s_job. #27195

Open
jimjeffers opened this issue Jan 17, 2025 · 0 comments
Labels
type: bug Something isn't working

Comments

@jimjeffers
Copy link

jimjeffers commented Jan 17, 2025

What's the issue?

When we trigger jobs using execute_k8s_job the job is launched on kubernetes and completes but dagster does not update or denote the step as completed. Nor does it return any error. The step just runs indefinitely.

After launching a long running graph I went back to check on the status and saw my steps were still running in dagster:
Image

When reviewing the logs for one of these steps I can see dagster is still waiting for the job to complete:
Image

However, after reviewing the status of the jobs and pods on GKE I can see that the job dagster was waiting for has completed. Here is a sample of some of the logs from GKE:
Image

What did you expect to happen?

I expected the steps that were completed on K8S should also be complete in dagster. This is the case for smaller projects with this graph which do not take as long to complete.

How to reproduce?

I'm not sure you can easily replicate this without running on our cluster. But here is the tail end of our op so that you can see where things are getting hung up:

     try:
        execute_k8s_job(
            context=context,
            image="us-west1-docker.pkg.dev/edna-explorer-web-services/edna-explorer-data-pipelines/t-rex:latest",
            command=command,
            namespace=namespace,
            image_pull_policy="Always",
            resources={
                "requests": {
                    "cpu": "21",
                    "memory": "168Gi",
                    "ephemeral-storage": "500Gi", 
                },
                "limits": {
                    "cpu": "21",
                    "memory": "168Gi",
                    "ephemeral-storage": "500Gi", 
                },
            },
            pod_template_spec_metadata={
                "labels": {"duration": "extended"},
                "annotations": {
                    "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
                },
            },
            pod_spec_config={
                "node_selector": {
                    "cloud.google.com/compute-class": "Performance",
                    "cloud.google.com/machine-family": "c3",
                },
            },
            env_vars=["PROJECT_BUCKET", "STAGE", "GOOGLE_APPLICATION_CREDENTIALS"],
        )
        # STEP NEVER CONTINUES PAST THE ABOVE JOB EXECUTION
        context.log.info(f"Successfully ran assign for primer {primer}")

        # Create new file in queue to trigger output-for-primer job
        bucket.blob(
            f"queue/{project_id}/{primer}.run"
        ).upload_from_string("success")
        context.log.info(f"Triggering output job for primer {primer}")
    except Exception as e:
        context.log.error(f"Error running assign: {e}")
        raise e
    finally:
        print("K8S job finished")

Dagster version

dagster, 1.9.8

Deployment type

Dagster Helm chart

Deployment details

We are running dagster open source on GKE with autopilot enabled.

Additional information

Is there any further troubleshooting I can do in GKE to get to the bottom of why the jobs wouldn't get marked as completed on GKE? If so let me know and I'll dig more and post back to this issue.

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

@jimjeffers jimjeffers added the type: bug Something isn't working label Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant