-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a timeout in when the proxy agent dials to the remote #179
Conversation
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Welcome @ScheererJ! |
Hi @ScheererJ. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Signed CLA. |
/ok-to-test |
Hi @ScheererJ I am learning the code base, does your change just move all the action in |
As mentioned in the pull request description above, the change essentially consists of only two changes, but the resulting indentation make the change look bigger than it is. We make the processing of |
@ydp if you add "?w=1" in to the url, you can see the diff with all the whitespace differences ignored, that would make this change easier to review. @ScheererJ the dial timeout looks good. Moving the dial to a separate goroutine is reasonable as well. One concern I have is that if a goroutine is spawned to dial, and before the dial completes, the proxy server forwards a Perhaps we should bookkeep the pending dials, and cleanup pending dials upon receiving |
@caesarxuchao do you suggest to have separate book keeping of the pending dials or simply move the |
I think we will have to add a separate book keeping, because before the dial returns, ctx.conn is not available. |
something like this: type connContext struct {
// add dialDone
dialDone chan struct{}
conn net.Conn
cleanFunc func()
...
}
// in client.Serve()
...
case PACKET_DIAL_REQUEST:
dialDone := new(chan struct{})
ctx := connContext {
dialDone: dialDone,
}
connManager.Add(ctx, connID)
go func {
conn, err := net.DialTimeout(...)
ctx.conn = conn
close(dialDone)
}
ctx.cleanFunc := func() {
<- dialDone
if ctx.conn != nil {
ctx.conn.Close()
}
...
} |
I added the proposed change in a new commit, but I did not add the |
29a0c1b
to
a13be7a
Compare
and now it should work again... |
@caesarxuchao I would prefer if you could add the tests as a follow-up. |
Posted test results with this fix to the associated issue: #180 (comment) |
/assign @caesarxuchao |
/test pull-apiserver-network-proxy-make-lint |
cc @cheftako is this still possible to get in as well since it's an important fix? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for dropping the ball. I took another pass and left another comment about the race.
Given that @jkh52 and his team are working hard stabilizing ANP right now, we need to be extra cautious merging this PR.
How about we split this PR into two parts? In the first part we only add a 10s timeout on the net.Dial, so that we can avoid a 10min block. And in the second part we make this more risky goroutine change.
Hopefully in the future the ANP repo can have a multi-branch structure like k/k so that we can stabilize the code in one branch and merge the more risky PRs in another branch.
pkg/agent/client.go
Outdated
Type: client.PacketType_DIAL_RSP, | ||
Payload: &client.Packet_DialResponse{DialResponse: &client.DialResponse{}}, | ||
} | ||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we move the entire handling of DIAL_REQ into a goroutine like this, we can still enter a race with the handling of CLOSE_REQ that I mentioned in #179 (comment).
How about we only move the net.Dial into a goroutine? That way a.connManager.Add
is called sequentially and that would prevent the race. This is also what I suggested in #179 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@caesarxuchao So what you are suggesting is that the go routine should start after the a.connManager.Add
call. Right?
In the beginning the go routine simply started shortly before the net.Dial
call, but after your suggested refactoring we can indeed move the start of the go routine to a later point in time, i.e. directly before the net.Dial
call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the goroutine should start after the a.connManager.Add
call. Or in other words, "move the start of the go routine to a later point in time, i.e. directly before the net.Dial call".
@caesarxuchao Should I only include the 10s timeout, i.e. the minimal change, into the first pull request or also include the reordering and change with regards to |
This is just a note that without this it appears the Kubernetes e2e tests will not pass when Konnectivity is in use. cc @caesarxuchao I think this is critical toward stabilization of the system. See @rtheis's comment #179 (comment) for more details. I know we (IBM + RedHat) have had to carry this patch in order to utilize Konnectivity in our systems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for not being responsive. I wan't monitoring this PR closely. Feel free to ping me on Slack.
pkg/agent/client.go
Outdated
Type: client.PacketType_DIAL_RSP, | ||
Payload: &client.Packet_DialResponse{DialResponse: &client.DialResponse{}}, | ||
} | ||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the goroutine should start after the a.connManager.Add
call. Or in other words, "move the start of the go routine to a later point in time, i.e. directly before the net.Dial call".
In the first PR, let's only include the 10s timeout. In the second PR, let's include the goroutine change, including the reordering and change with regards to a.connManager.Add. |
@ScheererJ do you still have time to make the changes? If not I can see if we can find someone to continue your great work here. |
Added #252 as the second part of this improvement with the go routine change. @relyt0925 I will adapt this pull request now to only include the timeout change. |
Now, this pull request only includes the timeout change and the extended asynchronous processing via go routine is done in #252. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: caesarxuchao, ScheererJ The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This pull request fixes the issue if the net.Dial call based on a dial request blocks the receive loop and with that all communication from the proxy server. We have seen instances where all traffic was blocked for roughly 10 minutes.
The problem is solved by first making the dial request processing asynchronous. Secondly, we added a timeout to the net.Dial call to the target.