Targets goes unhealthy randomly and timeouts when there is more than 1 pod #3979

matisiekpl · 2024-12-11T13:44:16Z

Describe the bug
I'm trying to deploy web application with pretty standard stack, that contains deployment, service and ingress. We are using 3 worker nodes on EKS, and when I scale replicas to at least 2, Load Balancer UI shows that targets randomly goes unhealthy. Also app responds with 504 Gateway timed out every few requests. It seems that traffic goes only to one pod/worker node, completely ignoring routing to other pods on other nodes.

Worth to notice is that in ALB UI shows that statuses of targets are some kind of random. In one moment all targets are healthy, after refresh only one node is healthy. After second refresh to of them are healthy, with one unhealthy. At least one node is always healthy, as this is the worker that responds for successful requests.

Steps to reproduce

set up v1.30 EKS cluster with 2 or 3 worker nodes
introduce AWS Load Balancer controller via chart:

resource "helm_release" "alb_controller" {
  name       = "aws-load-balancer-controller"
  repository = "https://aws.github.io/eks-charts"
  chart      = "aws-load-balancer-controller"
  namespace  = "kube-system"
  depends_on = [
    kubernetes_service_account.load_balancer_service_account
  ]

  set {
    name  = "serviceAccount.create"
    value = "false"
  }

  set {
    name  = "serviceAccount.name"
    value = "aws-load-balancer-controller"
  }

  set {
    name  = "clusterName"
    value = module.eks.cluster_name
  }
}

create deployment of app with at least 2 replicas on different nodes
create simplest NodePort service for app
create simplest Ingress for that service with declared spec.rules.0.host

Expected outcome

traffic goes to all healthy pods

Environment

AWS Load Balancer controller version: v2.8.2
Kubernetes version: v1.30
Using EKS: eks.20

Additional Context:
Ingress annotations:

kubernetes.io/ingress.class: alb
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/group.name: main
alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
alb.ingress.kubernetes.io/load-balancer-name: lb-eks-{{ .Values.environment }}
alb.ingress.kubernetes.io/healthcheck-path: /healthz

Potentially related issue: kubernetes/ingress-nginx#9990

The text was updated successfully, but these errors were encountered:

zac-nixon · 2024-12-12T02:07:20Z

Is the health check failing during deployments? I suspect it's because the application is not deployed to each node.

matisiekpl · 2024-12-12T11:29:29Z

@zac-nixon
Kubernetes itself reports no issues, readiness healthchecks on each node reports successes. The problem is on the Kubernetes-AWS LB layer.

zac-nixon · 2024-12-12T19:26:45Z

I see. I will probably need more info. I did a quick repro with a simple ingress: https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.8/examples/echo_server/ deploying 10 replicas on 5 eks nodes and can't reproduce this issue.

Is it possible that your application is reaching out to an external source that can't handle the additional load of more than one replica?

shraddhabang added the triage/needs-investigation label Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Targets goes unhealthy randomly and timeouts when there is more than 1 pod #3979

Targets goes unhealthy randomly and timeouts when there is more than 1 pod #3979

matisiekpl commented Dec 11, 2024 •

edited

Loading

zac-nixon commented Dec 12, 2024

matisiekpl commented Dec 12, 2024

zac-nixon commented Dec 12, 2024

Targets goes unhealthy randomly and timeouts when there is more than 1 pod #3979

Targets goes unhealthy randomly and timeouts when there is more than 1 pod #3979

Comments

matisiekpl commented Dec 11, 2024 • edited Loading

zac-nixon commented Dec 12, 2024

matisiekpl commented Dec 12, 2024

zac-nixon commented Dec 12, 2024

matisiekpl commented Dec 11, 2024 •

edited

Loading