Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Targets goes unhealthy randomly and timeouts when there is more than 1 pod #3979

Open
matisiekpl opened this issue Dec 11, 2024 · 3 comments

Comments

@matisiekpl
Copy link

matisiekpl commented Dec 11, 2024

Describe the bug
I'm trying to deploy web application with pretty standard stack, that contains deployment, service and ingress. We are using 3 worker nodes on EKS, and when I scale replicas to at least 2, Load Balancer UI shows that targets randomly goes unhealthy. Also app responds with 504 Gateway timed out every few requests. It seems that traffic goes only to one pod/worker node, completely ignoring routing to other pods on other nodes.

Worth to notice is that in ALB UI shows that statuses of targets are some kind of random. In one moment all targets are healthy, after refresh only one node is healthy. After second refresh to of them are healthy, with one unhealthy. At least one node is always healthy, as this is the worker that responds for successful requests.

Steps to reproduce

  • set up v1.30 EKS cluster with 2 or 3 worker nodes
  • introduce AWS Load Balancer controller via chart:
resource "helm_release" "alb_controller" {
  name       = "aws-load-balancer-controller"
  repository = "https://aws.github.io/eks-charts"
  chart      = "aws-load-balancer-controller"
  namespace  = "kube-system"
  depends_on = [
    kubernetes_service_account.load_balancer_service_account
  ]

  set {
    name  = "serviceAccount.create"
    value = "false"
  }

  set {
    name  = "serviceAccount.name"
    value = "aws-load-balancer-controller"
  }

  set {
    name  = "clusterName"
    value = module.eks.cluster_name
  }
}
  • create deployment of app with at least 2 replicas on different nodes
  • create simplest NodePort service for app
  • create simplest Ingress for that service with declared spec.rules.0.host

Expected outcome

  • traffic goes to all healthy pods

Environment

AWS Load Balancer controller version: v2.8.2
Kubernetes version: v1.30
Using EKS: eks.20

Additional Context:
Ingress annotations:

kubernetes.io/ingress.class: alb
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/group.name: main
alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
alb.ingress.kubernetes.io/load-balancer-name: lb-eks-{{ .Values.environment }}
alb.ingress.kubernetes.io/healthcheck-path: /healthz

Potentially related issue: kubernetes/ingress-nginx#9990

@zac-nixon
Copy link
Collaborator

Is the health check failing during deployments? I suspect it's because the application is not deployed to each node.

@matisiekpl
Copy link
Author

@zac-nixon
Kubernetes itself reports no issues, readiness healthchecks on each node reports successes. The problem is on the Kubernetes-AWS LB layer.

@zac-nixon
Copy link
Collaborator

I see. I will probably need more info. I did a quick repro with a simple ingress: https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.8/examples/echo_server/ deploying 10 replicas on 5 eks nodes and can't reproduce this issue.

Is it possible that your application is reaching out to an external source that can't handle the additional load of more than one replica?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants