Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a timeout in when the proxy agent dials to the remote #179

Merged
merged 1 commit into from
Jul 23, 2021

Conversation

ScheererJ
Copy link
Contributor

This pull request fixes the issue if the net.Dial call based on a dial request blocks the receive loop and with that all communication from the proxy server. We have seen instances where all traffic was blocked for roughly 10 minutes.
The problem is solved by first making the dial request processing asynchronous. Secondly, we added a timeout to the net.Dial call to the target.

@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


  • If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
  • If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
  • If you have done the above and are still having issues with the CLA being reported as unsigned, please log a ticket with the Linux Foundation Helpdesk: https://support.linuxfoundation.org/
  • Should you encounter any issues with the Linux Foundation Helpdesk, send a message to the backup e-mail support address at: [email protected]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Mar 3, 2021
@k8s-ci-robot k8s-ci-robot requested review from dberkov and Sh4d1 March 3, 2021 08:41
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 3, 2021
@k8s-ci-robot
Copy link
Contributor

Welcome @ScheererJ!

It looks like this is your first PR to kubernetes-sigs/apiserver-network-proxy 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/apiserver-network-proxy has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @ScheererJ. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 3, 2021
@ScheererJ
Copy link
Contributor Author

Signed CLA.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Mar 3, 2021
@Jefftree
Copy link
Member

Jefftree commented Apr 5, 2021

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 5, 2021
@ydp
Copy link

ydp commented Apr 29, 2021

Hi @ScheererJ I am learning the code base, does your change just move all the action in PacketType_DIAL_REQ into a go routine, which solved your problem?

@ScheererJ
Copy link
Contributor Author

Hi @ScheererJ I am learning the code base, does your change just move all the action in PacketType_DIAL_REQ into a go routine, which solved your problem?

As mentioned in the pull request description above, the change essentially consists of only two changes, but the resulting indentation make the change look bigger than it is. We make the processing of PacketType_DIAL_REQ asynchronous by using a go routine. In addition to that, we added a timeout to the net.Dial call.

@caesarxuchao
Copy link
Member

@ydp if you add "?w=1" in to the url, you can see the diff with all the whitespace differences ignored, that would make this change easier to review.

@ScheererJ the dial timeout looks good. Moving the dial to a separate goroutine is reasonable as well. One concern I have is that if a goroutine is spawned to dial, and before the dial completes, the proxy server forwards a PACKET_CLOSE_REQ, then because the connManager doesn't have the connection yet, then the code will go to this else block without calling ctx.cleanup(). That would lead to leaking goroutines of removeToProxy and proxyToRemote, and leak the network connection if there is no timeout mechanism built there.

Perhaps we should bookkeep the pending dials, and cleanup pending dials upon receiving PACKET_CLOSE_REQ.

@ScheererJ
Copy link
Contributor Author

@caesarxuchao do you suggest to have separate book keeping of the pending dials or simply move the a.connManager.Add(connID, ctx) call before the dial call and ensure that the cleanFunc handles the situation properly by synchronizing with the dial (ensuring it finished)?

@caesarxuchao
Copy link
Member

I think we will have to add a separate book keeping, because before the dial returns, ctx.conn is not available.

@caesarxuchao
Copy link
Member

caesarxuchao commented May 14, 2021

something like this:

type connContext struct {
  // add dialDone
  dialDone chan struct{}
  conn      net.Conn
  cleanFunc func()
  ...
  
}

// in client.Serve()
...
case PACKET_DIAL_REQUEST:
  dialDone := new(chan struct{})
  ctx := connContext {
    dialDone: dialDone,
  }
  connManager.Add(ctx, connID)
  go func {
    conn, err := net.DialTimeout(...)
    ctx.conn = conn
    close(dialDone)
  }
  ctx.cleanFunc := func() {
    <- dialDone
    if ctx.conn != nil {
       ctx.conn.Close()
    }
    ...
  }

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 17, 2021
@ScheererJ
Copy link
Contributor Author

I added the proposed change in a new commit, but I did not add the if ctx.conn != nil check as it cannot be nil after a successful dial.
Please advise in case I missed something.

@ScheererJ ScheererJ force-pushed the master branch 2 times, most recently from 29a0c1b to a13be7a Compare May 17, 2021 12:59
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 17, 2021
@ScheererJ
Copy link
Contributor Author

and now it should work again...

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 18, 2021
@ScheererJ
Copy link
Contributor Author

Thanks @ScheererJ.

We will need unit tests to make sure we don't introduce goroutine leaks. Let me know if you have time for adding tests. If not, I can add tests in followups.

@caesarxuchao I would prefer if you could add the tests as a follow-up.

@rtheis
Copy link

rtheis commented Jun 1, 2021

Posted test results with this fix to the associated issue: #180 (comment)

@cheftako
Copy link
Contributor

cheftako commented Jun 1, 2021

/assign @caesarxuchao

@cheftako
Copy link
Contributor

cheftako commented Jun 7, 2021

/test pull-apiserver-network-proxy-make-lint

@relyt0925
Copy link
Contributor

cc @cheftako is this still possible to get in as well since it's an important fix?

Copy link
Member

@caesarxuchao caesarxuchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for dropping the ball. I took another pass and left another comment about the race.

Given that @jkh52 and his team are working hard stabilizing ANP right now, we need to be extra cautious merging this PR.

How about we split this PR into two parts? In the first part we only add a 10s timeout on the net.Dial, so that we can avoid a 10min block. And in the second part we make this more risky goroutine change.

Hopefully in the future the ANP repo can have a multi-branch structure like k/k so that we can stabilize the code in one branch and merge the more risky PRs in another branch.

Type: client.PacketType_DIAL_RSP,
Payload: &client.Packet_DialResponse{DialResponse: &client.DialResponse{}},
}
go func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we move the entire handling of DIAL_REQ into a goroutine like this, we can still enter a race with the handling of CLOSE_REQ that I mentioned in #179 (comment).

How about we only move the net.Dial into a goroutine? That way a.connManager.Add is called sequentially and that would prevent the race. This is also what I suggested in #179 (comment).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@caesarxuchao So what you are suggesting is that the go routine should start after the a.connManager.Add call. Right?
In the beginning the go routine simply started shortly before the net.Dial call, but after your suggested refactoring we can indeed move the start of the go routine to a later point in time, i.e. directly before the net.Dial call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the goroutine should start after the a.connManager.Add call. Or in other words, "move the start of the go routine to a later point in time, i.e. directly before the net.Dial call".

@ScheererJ
Copy link
Contributor Author

Sorry for dropping the ball. I took another pass and left another comment about the race.

Given that @jkh52 and his team are working hard stabilizing ANP right now, we need to be extra cautious merging this PR.

How about we split this PR into two parts? In the first part we only add a 10s timeout on the net.Dial, so that we can avoid a 10min block. And in the second part we make this more risky goroutine change.

Hopefully in the future the ANP repo can have a multi-branch structure like k/k so that we can stabilize the code in one branch and merge the more risky PRs in another branch.

@caesarxuchao Should I only include the 10s timeout, i.e. the minimal change, into the first pull request or also include the reordering and change with regards to a.connManager.Add?

@relyt0925
Copy link
Contributor

This is just a note that without this it appears the Kubernetes e2e tests will not pass when Konnectivity is in use. cc @caesarxuchao

I think this is critical toward stabilization of the system. See @rtheis's comment #179 (comment) for more details.

I know we (IBM + RedHat) have had to carry this patch in order to utilize Konnectivity in our systems.

Copy link
Member

@caesarxuchao caesarxuchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not being responsive. I wan't monitoring this PR closely. Feel free to ping me on Slack.

Type: client.PacketType_DIAL_RSP,
Payload: &client.Packet_DialResponse{DialResponse: &client.DialResponse{}},
}
go func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the goroutine should start after the a.connManager.Add call. Or in other words, "move the start of the go routine to a later point in time, i.e. directly before the net.Dial call".

@caesarxuchao
Copy link
Member

Should I only include the 10s timeout, i.e. the minimal change, into the first pull request or also include the reordering and change with regards to a.connManager.Add?

In the first PR, let's only include the 10s timeout.

In the second PR, let's include the goroutine change, including the reordering and change with regards to a.connManager.Add.

@relyt0925
Copy link
Contributor

@ScheererJ do you still have time to make the changes? If not I can see if we can find someone to continue your great work here.

@ScheererJ
Copy link
Contributor Author

Added #252 as the second part of this improvement with the go routine change.

@relyt0925 I will adapt this pull request now to only include the timeout change.

@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 23, 2021
@ScheererJ
Copy link
Contributor Author

Now, this pull request only includes the timeout change and the extended asynchronous processing via go routine is done in #252.

@caesarxuchao
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 23, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: caesarxuchao, ScheererJ

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 23, 2021
@k8s-ci-robot k8s-ci-robot merged commit d964a21 into kubernetes-sigs:master Jul 23, 2021
@caesarxuchao caesarxuchao changed the title Make dial request processing asynchonous. Add a timeout in when the proxy agent dials to the remote Jul 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants