-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Go (golang bindings): 8 database connections at once causes CGO signal and panic #11856
Comments
details of my fdb setup: It is version 7.3.57, the latest, running on just one node.
|
The crash seems to be happening on a finalizer for a tenant,
|
running under gdb gives the following stacktrace, indicating
|
To the best I can figure, this is the code being called
and then somehow, for some odd reason, a transaction option is being invoked???
|
running on macos under lldb (the only place I could get the library to build to turn on debug mode build), I see
so maybe it is simply that the cast to ITenant is bad? Still not clear to me. the backtrace from there:
|
going up one stack frame from the halt shows
and then up again
and again...
|
going back down the stack and printing this or ptr as we go reveals At the lowest frame (where lldb stopped us on the segfault / EXC_BAD_ACCESS):
|
still on the lowest frame, lets list the surrounding source code for more context
The only thing that catches is my eye is that db is nullptr.
It appears... that the address of delref has gotten messed up some how??? |
Lets run it again, since I killed it by continuing, and look at the assembly where it halts:
I'm not an assembly expert, so I'll consult with an AI here... AI: From this disassembly and register state, we can see that the segfault is happening during a virtual function call, specifically at:
The issue is likely one of these:
To debug further, you could:
This looks like a classic use-after-free or memory corruption issue where the virtual function table pointer is invalid.
AI: Ah, this confirms the diagnosis! The fact that the memory read failed means we're trying to access an invalid memory address (0xbcb37c51c80). This is almost certainly a use-after-free bug where:
To track this down:
Common patterns that lead to this:
The fact this is happening in a delref function suggests this might be related to reference counting - possibly the object is being deleted twice or accessed after its reference count hits zero. |
Fascinating: I notice that I must run both tests in the db_test.go for the use-after-delete crash to occur. Since it is a finalizer (on a tenant structure that itself has a reference to the database) that is causing the crash, and they can run at any time or never... this suggests... perhaps the problem is a delayed finalizer from the first test that gets run during the second test. "Note the many caveats listed in the runtime.SetFinalizer documentation. You should never depend on or expect a finalizer to ever get run. Instead, always make sure to call the appropriate cleanup function(s) (e.g. defer file.Close() right after checking the errors of os.Open)." -- from https://stackoverflow.com/questions/8595457/which-objects-are-finalized-in-go-by-default-and-what-are-some-of-the-pitfalls-o So a fundamental design flaw in the FDB Go bindings seems to be using finalizers at all... and having them decrease refcounts (in this case on a database that has already been deleted... <- current hypothesis, not final) reading the docs in the link above on finalizers, these things stand out: "If a cyclic structure includes a block with a finalizer, that cycle is not guaranteed to be garbage collected and the finalizer is not guaranteed to run, because there is no ordering that respects the dependencies." "The finalizer is scheduled to run at some arbitrary time after the program can no longer reach the object to which obj points." "There is no guarantee that finalizers will run before a program exits, so typically they are useful only for releasing non-memory resources associated with an object during a long-running program. " "A finalizer may run as soon as an object becomes unreachable. In order to use finalizers correctly, the program must ensure that the object is reachable until it is no longer required. " |
Ah hah. We have the smoking gun. where I instrumented the C++ ref-counts and log the this-pointer of the ThreadSafeDatabase in fdbclient/include/fdbclient/ThreadSafeTransaction.h We can clearly see a database from the first test (0x600003e1c000 is its "this" pointer) having its ref-count decremented during Test002 (line 269 of crash.log.3.with.refcounts.and.this.addresses.txt). I've added this log to the repo for reference and inspection. Attaching the C++ diff/patch-like file (actually a git diff) if anyone wishes to reproduce in the C++. |
This is an interesting issue, although the problem is that tenant feature people are currently in Snowflake rather than Apple. I am familiar with the refcount part, but I have never thought go programming using go. |
But "There is no guarantee that finalizers will run before a program exits, so typically they are useful only for releasing non-memory resources associated with an object during a long-running program. " might be the root of the pain. |
Indeed. There are so many issues with the Go bindings. I've fixed a bunch of bugs and races, but there are still more to do. It hasn't gotten much love in the last five years; I can't imagine it is widely used. (Update: main does have some updates, but apparently v7.3.57 the latest release is much older). Still, I'm trying to keep backwards-compatibility even though this is painful. I'll submit a pull request when I'm done. I'm not rushing it because I'd like to get rid of all of the finalizers first. I was about to mention my branch, but it is based on 7.3.57, while trunk main appears to be substantially different; for instance there is no tenant.go file at all on main; and the finalizer for the database has already been deleted. However finalizers for memory reclamation for Futures and Transactions are still in use, which is bad; time bombs /bugs waiting to happen. I'll need to read more of the history and try and understand what's going on between the branches. @xis19 Is tenant support on main, or only on a Snowflake branch? I thought v7.3.57, the latest released, was a non-Snowflake branch, but it has tenant support, and main (at least the Go bindings, I don't know about the server) does not, so I'm not sure what is going on. Any context would be helpful. Update: for example, the most recent common ancestor between 7.3.57 and main is 21 months old; suggesting serious divergence.
|
several SetFinalizer time-bombs remain. We are working to get rid of them. fix broken OpenTenant(). add db.TenantExists(), db.EnsureTenant(), db.DeleteTenantsIfExist(). fix a bunch of bugs, races, and design problems adds back tenant.go; why is it missing on main branch?
To my understanding, tenant is not being used by us and we are not supporting it (@jzhou77 please correct me if I am wrong), the full tenant support is available in Snowflake private repository, which is maintained by Snowflake people. So we may not want to touch it unless it breaks something on our side. There is a similar story in Python, where the "destructor" |
I second this; there's no clean way to map C++ destructors to Go finalizers and obtain reliable behavior out of it. Let me know if you need help with this.
I fixed already a few bugs related to finalizers, but there are still occasional issues related to network thread and destruction on C++ side which is not coordinated with the Go side, since I occasionally still see some SIGSEGV in tests. |
@gm42 ouch! that is painful. I'm blocked on doing anything here because I don't understand the versioning structure in use here, so I would not know on which branch to base any fixes. Do you comprehend it? Usually patches go on top of master/main, but the latest release 7.3.57 appears to have nothing to do with main. |
I do not understand what do you mean by "have nothing to do withmain". But if your focus is on Go part, I would think you can start with 7.3.57. We can deal with the merging issue. |
Hi @xis19 - Thanks your your help. I was referring to my observations, above, in #11856 (comment),
|
@glycerine I think it's better to fix on |
@glycerine if it can be of any help, I have submitted all my PRs to |
@gm42 makes perfect sense, thanks. Just to update: I'm not actively working on at a patch at the moment. There were other issues besides this one that caused me to reconsider if FoundationDB is the right choice for my current project. So @gm42 Giuseppe, if you or someone else wants to dig into a fix for this and/or other Go issues, we won't be duplicating effort. Feel free to take the lead. I can assist with testing and review/discussion if you do, but I'm not actively working towards sending a PR at this point. |
go version go1.23.3 linux/amd64
I'm given to understand from reading the FoundationDB documentation[1][2] that the key to getting acceptable write throughput is to write through many connections to the database in parallel.
However: when I try to do this, I saw increasing throughput going from 1 to 2, and from 2 to 4 connections;
but I get consistent panics of the "CGO received a signal" type when I try with 8 database connections.
See the crash logs linked below and the repo's kvfile/ directory for a reproducer (run the tests).
https://github.com/glycerine/fdb_demo_golang
https://github.com/glycerine/fdb_demo_golang/blob/master/crash.log.macos.txt
https://github.com/glycerine/fdb_demo_golang/blob/master/crash.log.txt
https://github.com/glycerine/fdb_demo_golang/blob/master/crash.log2.txt
run go test -v in the ./kvfile directory of the fdb_demo_golang repo to reproduce the panic.
The Test002 test in particular; you can change the database pool size here: https://github.com/glycerine/fdb_demo_golang/blob/master/kvfile/db_test.go#L97
Thanks,
Jason
@johscheuer because other Go issues here mention you (but apologies and please re-direct if this is not your area now)
references
[1] https://apple.github.io/foundationdb/features.html#concurrent-connections
"FoundationDB is able to handle large numbers of concurrent client connections. Because it uses a
threadless communications and concurrency model, FoundationDB does not have to create a
thread per connection. This allows full performance even with hundreds of thousands of in-flight
requests."
[2] https://apple.github.io/foundationdb/performance.html
"Its asynchronous design allows it to handle very high concurrency, and for a typical workload with
90% reads and 10% writes, maximum throughput is reached at about 200 concurrent operations.
This number of operations was achieved with 20 concurrent transactions per FoundationDB process
each running 10 operations with 16 byte keys and values between 8 and 100 bytes."
"The implication of this [Little's Law] relation is that, at a given latency, we can maximize throughput
only by concurrently submitting enough outstanding requests. A FoundationDB cluster might have a
commit latency of 2 ms and yet be capable of far more than 500 commits per second. In fact, tens of
thousands of commits per second are easily achievable. To achieve this rate, there must be hundreds
of requests happening concurrently. Not having enough pending requests is the single biggest
reason for low performance."
The text was updated successfully, but these errors were encountered: