Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Go (golang bindings): 8 database connections at once causes CGO signal and panic #11856

Open
glycerine opened this issue Jan 1, 2025 · 22 comments

Comments

@glycerine
Copy link

glycerine commented Jan 1, 2025

go version go1.23.3 linux/amd64

I'm given to understand from reading the FoundationDB documentation[1][2] that the key to getting acceptable write throughput is to write through many connections to the database in parallel.

However: when I try to do this, I saw increasing throughput going from 1 to 2, and from 2 to 4 connections;
but I get consistent panics of the "CGO received a signal" type when I try with 8 database connections.

See the crash logs linked below and the repo's kvfile/ directory for a reproducer (run the tests).

https://github.com/glycerine/fdb_demo_golang
https://github.com/glycerine/fdb_demo_golang/blob/master/crash.log.macos.txt
https://github.com/glycerine/fdb_demo_golang/blob/master/crash.log.txt
https://github.com/glycerine/fdb_demo_golang/blob/master/crash.log2.txt

run go test -v in the ./kvfile directory of the fdb_demo_golang repo to reproduce the panic.
The Test002 test in particular; you can change the database pool size here: https://github.com/glycerine/fdb_demo_golang/blob/master/kvfile/db_test.go#L97

Thanks,
Jason

@johscheuer because other Go issues here mention you (but apologies and please re-direct if this is not your area now)

references
[1] https://apple.github.io/foundationdb/features.html#concurrent-connections

"FoundationDB is able to handle large numbers of concurrent client connections. Because it uses a
threadless communications and concurrency model, FoundationDB does not have to create a
thread per connection. This allows full performance even with hundreds of thousands of in-flight
requests."

[2] https://apple.github.io/foundationdb/performance.html

"Its asynchronous design allows it to handle very high concurrency, and for a typical workload with
90% reads and 10% writes, maximum throughput is reached at about 200 concurrent operations.
This number of operations was achieved with 20 concurrent transactions per FoundationDB process
each running 10 operations with 16 byte keys and values between 8 and 100 bytes."

"The implication of this [Little's Law] relation is that, at a given latency, we can maximize throughput
only by concurrently submitting enough outstanding requests. A FoundationDB cluster might have a
commit latency of 2 ms and yet be capable of far more than 500 commits per second. In fact, tens of
thousands of commits per second are easily achievable. To achieve this rate, there must be hundreds
of requests happening concurrently. Not having enough pending requests is the single biggest
reason for low performance."

@glycerine
Copy link
Author

glycerine commented Jan 1, 2025

details of my fdb setup: It is version 7.3.57, the latest, running on just one node.
The client (Go program) and server are co-located on the same node.

$ fdbcli --version
FoundationDB CLI 7.3 (v7.3.57)
source version 63035b5c3ecb65a615126cf3d62e36df58a9c994
protocol fdb00b073000000

jaten@rog ~/go/src/github.com/glycerine/fdb_demo_golang (master) $ fdbcli --exec status
fdbcli --exec status
Using cluster file `/etc/foundationdb/fdb.cluster'.

Configuration:
  Redundancy mode        - single
  Storage engine         - ssd-redwood-1
  Log engine             - ssd-2
  Encryption at-rest     - disabled
  Coordinators           - 1
  Desired Commit Proxies - 3
  Desired GRV Proxies    - 1
  Desired Resolvers      - 1
  Desired Logs           - 3
  Usable Regions         - 1

Cluster:
  FoundationDB processes - 1
  Zones                  - 1
  Machines               - 1
  Memory availability    - 8.0 GB per process on machine with least available
  Fault Tolerance        - 0 machines
  Server time            - 12/31/24 21:12:00

Data:
  Replication health     - Healthy
  Moving data            - 0.000 GB
  Sum of key-value sizes - 315 MB
  Disk space used        - 5.184 GB

Operating space:
  Storage server         - 749.0 GB free on most full server
  Log server             - 748.8 GB free on most full server

Workload:
  Read rate              - 21 Hz
  Write rate             - 0 Hz
  Transactions started   - 6 Hz
  Transactions committed - 0 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Client time: 12/31/24 21:12:00

@glycerine
Copy link
Author

The crash seems to be happening on a finalizer for a tenant,
as the trace shows tenant.go:182, being

func (t *tenant) destroy() {
    if t.ptr == nil {
        return
    }

    C.fdb_tenant_destroy(t.ptr) // line 182 of tenant.go; we are in this call when we segfault
}
// file version: tenant.go<[email protected]>
db.go:272 2024-12-31T21:52:53.052-06:00 goro 2 done
SIGSEGV: segmentation violation
PC=0x7f1ac7820e03 m=12 sigcode=1 addr=0x40
signal arrived during cgo execution

goroutine 5 gp=0xc000007c00 m=12 mp=0xc000500008 [syscall]:
runtime.cgocall(0x580510, 0xc0000b45a8)
        /usr/local/go1.23.3/src/runtime/cgocall.go:167 +0x4b fp=0xc0000b4580 sp=0xc0000b4548 pc=0x46ba0b
github.com/apple/foundationdb/bindings/go/src/fdb._Cfunc_fdb_tenant_destroy(0x7f1a74006bc0)
        _cgo_gotypes.go:507 +0x3f fp=0xc0000b45a8 sp=0xc0000b4580 pc=0x5573ff
github.com/apple/foundationdb/bindings/go/src/fdb.(*tenant).destroy.func1(0xc0000b4690?)
        /home/jaten/go/pkg/mod/github.com/apple/foundationdb/bindings/[email protected]/src/fdb/tenant.go:182 +0x3e fp=0xc0000b45e8 sp=0xc0000b45a8 pc=0x55ab3e
github.com/apple/foundationdb/bindings/go/src/fdb.(*tenant).destroy(0x0?)
        /home/jaten/go/pkg/mod/github.com/apple/foundationdb/bindings/[email protected]/src/fdb/tenant.go:182 +0x19 fp=0xc0000b4600 sp=0xc0000b45e8 pc=0x55aad9
runtime.call16(0x0, 0x5db2f8, 0xc0002c0000, 0x10, 0x10, 0x10, 0xc0000b4690)
        /usr/local/go1.23.3/src/runtime/asm_amd64.s:775 +0x43 fp=0xc0000b4620 sp=0xc0000b4600 pc=0x477663
runtime.runfinq()
        /usr/local/go1.23.3/src/runtime/mfinal.go:255 +0x3f1 fp=0xc0000b47e0 sp=0xc0000b4620 pc=0x419c71
runtime.goexit({})

@glycerine
Copy link
Author

running under gdb gives the following stacktrace, indicating
the problem is in the fdb_transaction_set_option() C function.

...
db.go:266 2024-12-31T22:10:12.636-06:00 goro 2 done

db.go:266 2024-12-31T22:10:12.637-06:00 goro 1 done

Thread 3 "kvfile.test" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffae0e1700 (LWP 27756)]
0x0000000000000000 in ?? ()
(gdb) bt
bt
#0  0x0000000000000000 in ?? ()
#1  0x00007ffff77abe06 in fdb_transaction_set_option () from /usr/lib/libfdb_c.so
#2  0x00007ffff77abefe in fdb_transaction_set_option () from /usr/lib/libfdb_c.so
#3  0x00007ffff6ea327a in fdb_tenant_destroy () from /usr/lib/libfdb_c.so
#4  0x0000000000478da4 in runtime.asmcgocall () at /usr/local/go1.23.3/src/runtime/asm_amd64.s:923
#5  0x000000c000184700 in ?? ()
#6  0x0000000000000000 in ?? ()
(gdb) 

@glycerine
Copy link
Author

glycerine commented Jan 1, 2025

To the best I can figure, this is the code being called

fdb_c.cpp:914  at git tag 7.3.57:

extern "C" DLLEXPORT void fdb_tenant_destroy(FDBTenant* tenant) {
        try {
                TENANT(tenant)->delref();
        } catch (...) {
        }
}

and then somehow, for some odd reason, a transaction option is being invoked???
I'm not sure how the delref() gets us to fdb_transaction_set_option_impl, so I'm mystified.
Maybe the stack was corrupted and the backtrace is wrong?

fdb_c.cpp:1196 at git tag 7.3.57:

fdb_error_t fdb_transaction_set_option_impl(FDBTransaction* tr,
                                            FDBTransactionOption option,
                                            uint8_t const* value,
                                            int value_length) {
        CATCH_AND_RETURN(TXN(tr)->setOption((FDBTransactionOptions::Option)option,
                                            value ? StringRef(value, value_length) : Optional<StringRef>()););
}

@glycerine
Copy link
Author

glycerine commented Jan 1, 2025

running on macos under lldb (the only place I could get the library to build to turn on debug mode build), I see

...
db.go:266 2025-01-01T11:13:15.670Z goro 2 done

db.go:266 2025-01-01T11:13:15.699Z goro 6 done
Process 48043 stopped                                                                                 
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x5fbbbdb0bc00)
    frame #0: 0x0000000107108e33 libfdb_c.dylib`void delref<ThreadSafeDatabase>(ptr=0x000060000218bbc0) at FastRef.h:95:7
   92
   93   template <class P>
   94   void delref(P* ptr) {
-> 95           ptr->delref();
   96   }
   97
   98   template <class P>
Target 0: (kvfile.test) stopped.
(lldb) 

so maybe it is simply that the cast to ITenant is bad? Still not clear to me.

the backtrace from there:

(lldb) (lldb) bt
bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x5fbbbdb\
0bc00)
  * frame #0: 0x0000000107108e33 libfdb_c.dylib`void delref<ThreadSafeDatabase>(ptr=0x000060000218bbc\
0) at FastRef.h:95:7
    frame #1: 0x0000000107108dfa libfdb_c.dylib`Reference<ThreadSafeDatabase>::~Reference(this=0x0000\
6000034ac850) at FastRef.h:126:4
    frame #2: 0x00000001070f82f5 libfdb_c.dylib`Reference<ThreadSafeDatabase>::~Reference(this=0x0000\
6000034ac850) at FastRef.h:124:15
    frame #3: 0x00000001070fcebf libfdb_c.dylib`ThreadSafeTenant::~ThreadSafeTenant(this=0x0000600003\
4ac840) at ThreadSafeTransaction.cpp:357:1
    frame #4: 0x00000001070fcfb5 libfdb_c.dylib`ThreadSafeTenant::~ThreadSafeTenant(this=0x0000600003\
4ac840) at ThreadSafeTransaction.cpp:353:39
    frame #5: 0x00000001070fcfd9 libfdb_c.dylib`ThreadSafeTenant::~ThreadSafeTenant(this=0x0000600003\
4ac840) at ThreadSafeTransaction.cpp:353:39
    frame #6: 0x0000000107104ca6 libfdb_c.dylib`ThreadSafeReferenceCounted<ThreadSafeTenant>::delref(\
this=0x00006000034ac848) const at FastRef.h:51:4
    frame #7: 0x0000000107103d29 libfdb_c.dylib`ThreadSafeTenant::delref(this=0x00006000034ac840) at \
ThreadSafeTransaction.h:115:73
    frame #8: 0x0000000105f67579 libfdb_c.dylib`fdb_tenant_destroy(tenant=0x00006000034ac840) at fdb_\
c.cpp:916:19
    frame #9: 0x0000000100077f44 kvfile.test`runtime.asmcgocall.abi0 + 100
    frame #10: 0x000000010006a555 kvfile.test`runtime.cgocall + 117
    frame #11: 0x000000010015055f kvfile.test`github.com/apple/foundationdb/bindings/go/src/fdb._Cfun\
c_fdb_tenant_destroy.abi0 + 63
    frame #12: 0x0000000100153c9e kvfile.test`github.com/apple/foundationdb/bindings/go/src/fdb.(*tenant).destroy.func1 + 62
    frame #13: 0x0000000100153c39 kvfile.test`github.com/apple/foundationdb/bindings/go/src/fdb.(*tenant).destroy + 25
    frame #14: 0x00000001000767c3 kvfile.test`runtime.call16.abi0 + 67
    frame #15: 0x0000000100018f11 kvfile.test`runtime.runfinq + 1009
    frame #16: 0x00000001000782a1 kvfile.test`runtime.goexit.abi0 + 1
(lldb)

@glycerine
Copy link
Author

going up one stack frame from the halt shows

(lldb)  up
up
frame #1: 0x0000000107108dfa libfdb_c.dylib`Reference<ThreadSafeDatabase>::~Reference(this=0x00006000\
034ac850) at FastRef.h:126:4
   123
   124          ~Reference() {
   125                  if (ptr)
-> 126                          delref(ptr);
   127          }
   128          Reference& operator=(const Reference& r) {
   129                  P* oldPtr = ptr;
(lldb) (lldb) 

and then up again

(lldb) (lldb) up
up
frame #2: 0x00000001070f82f5 libfdb_c.dylib`Reference<ThreadSafeDatabase>::~Reference(this=0x00006000\
034ac850) at FastRef.h:124:15
   121                  r.setPtrUnsafe(nullptr);
   122          }
   123
-> 124          ~Reference() {
   125                  if (ptr)
   126                          delref(ptr);
   127          }
(lldb) (lldb) 

and again...

(lldb) (lldb) up
up
frame #3: 0x00000001070fcebf libfdb_c.dylib`ThreadSafeTenant::~ThreadSafeTenant(this=0x00006000034ac8\
40) at ThreadSafeTransaction.cpp:357:1
   354          Tenant* t = this->tenant;
   355          if (t)
   356                  onMainThreadVoid([t]() { t->delref(); });
-> 357  }
   358
   359  ThreadSafeTransaction::ThreadSafeTransaction(DatabaseContext* cx,
   360                                               ISingleThreadTransaction::Type type,
(lldb) (lldb) 
(lldb) (lldb) up
up
frame #4: 0x00000001070fcfb5 libfdb_c.dylib`ThreadSafeTenant::~ThreadSafeTenant(this=0x00006000034ac8\
40) at ThreadSafeTransaction.cpp:353:39
   350          });
   351  }
   352
-> 353  ThreadSafeTenant::~ThreadSafeTenant() {
   354          Tenant* t = this->tenant;
   355          if (t)
   356                  onMainThreadVoid([t]() { t->delref(); });
(lldb) (lldb) 
(lldb) (lldb) up
up
frame #5: 0x00000001070fcfd9 libfdb_c.dylib`ThreadSafeTenant::~ThreadSafeTenant(this=0x00006000034ac8\
40) at ThreadSafeTransaction.cpp:353:39
   350          });
   351  }
   352
-> 353  ThreadSafeTenant::~ThreadSafeTenant() {
   354          Tenant* t = this->tenant;
   355          if (t)
   356                  onMainThreadVoid([t]() { t->delref(); });
(lldb) (lldb) up
up
frame #6: 0x0000000107104ca6 libfdb_c.dylib`ThreadSafeReferenceCounted<ThreadSafeTenant>::delref(this\
=0x00006000034ac848) const at FastRef.h:51:4
   48           }
   49           void delref() const {
   50                   if (delref_no_destroy())
-> 51                           delete (Subclass*)this;
   52           }
   53           void setrefCountUnsafe(int32_t count) const { referenceCount.store(count); }
   54           int32_t debugGetReferenceCount() const { return referenceCount.load(); }
(lldb) (lldb) up
frame #7: 0x0000000107103d29 libfdb_c.dylib`ThreadSafeTenant::delref(this=0x00006000034ac840) at Thre\
adSafeTransaction.h:115:73
   112          ThreadFuture<bool> flushBlobRange(const KeyRangeRef& keyRange, bool compact, Optional\
<Version> version) override;
   113
   114          void addref() override { ThreadSafeReferenceCounted<ThreadSafeTenant>::addref(); }
-> 115          void delref() override { ThreadSafeReferenceCounted<ThreadSafeTenant>::delref(); }
   116
   117  private:
   118          Reference<ThreadSafeDatabase> db;
(lldb) up
frame #8: 0x0000000105f67579 libfdb_c.dylib`fdb_tenant_destroy(tenant=0x00006000034ac840) at fdb_c.cp\
p:916:19
   913
   914  extern "C" DLLEXPORT void fdb_tenant_destroy(FDBTenant* tenant) {
   915          try {
-> 916                  TENANT(tenant)->delref();
   917          } catch (...) {
   918          }
   919  }
(lldb) up
frame #9: 0x0000000100077f44 kvfile.test`runtime.asmcgocall.abi0 + 100
kvfile.test`runtime.asmcgocall.abi0:
->  0x100077f44 <+100>: movq   0x8(%rsp), %rdi
    0x100077f49 <+105>: movq   0x8(%rdi), %rsi
    0x100077f4d <+109>: subq   (%rsp), %rsi
    0x100077f51 <+113>: movq   %rdi, %gs:0x30
(lldb) (lldb) up
up
frame #10: 0x000000010006a555 kvfile.test`runtime.cgocall + 117
kvfile.test`runtime.cgocall:
->  0x10006a555 <+117>: xorps  %xmm15, %xmm15
    0x10006a559 <+121>: movq   %gs:0x30, %r14
    0x10006a562 <+130>: movl   0x10(%rsp), %eax
    0x10006a566 <+134>: movl   %eax, 0x1c(%rsp)
(lldb) (lldb) up
up
frame #11: 0x000000010015055f kvfile.test`github.com/apple/foundationdb/bindings/go/src/fdb._Cfunc_fd\
b_tenant_destroy.abi0 + 63
kvfile.test`github.com/apple/foundationdb/bindings/go/src/fdb._Cfunc_fdb_tenant_destroy.abi0:
->  0x10015055f <+63>: cmpb   $0x0, 0x1f61e8(%rip)      ; runtime.cgoAlwaysFalse
    0x100150566 <+70>: je     0x10015059b               ; <+123>
    0x100150568 <+72>: movq   0x28(%rsp), %rax
    0x10015056d <+77>: xorps  %xmm15, %xmm15
(lldb)

@glycerine
Copy link
Author

going back down the stack and printing this or ptr as we go reveals
that it is actually not the Tenant but a ThreadSafeDatabase* that is
having the problem:

At the lowest frame (where lldb stopped us on the segfault / EXC_BAD_ACCESS):

(lldb) (lldb) down
down
frame #1: 0x0000000107108dfa libfdb_c.dylib`Reference<ThreadSafeDatabase>::~Reference(this=0x00006000\
034ac850) at FastRef.h:126:4
   123
   124          ~Reference() {
   125                  if (ptr)
-> 126                          delref(ptr);
   127          }
   128          Reference& operator=(const Reference& r) {
   129                  P* oldPtr = ptr;
(lldb) (lldb) p this
p this
(Reference<ThreadSafeDatabase> *) 0x00006000034ac850
(lldb) (lldb) p ptr
p ptr
(ThreadSafeDatabase *) 0x000060000218bbc0
(lldb) (lldb) 

@glycerine
Copy link
Author

glycerine commented Jan 1, 2025

still on the lowest frame, lets list the surrounding source code for more context

(lldb) source list -l 80 -c 50
source list -l 80 -c 50
   80   };
   81
   82   #if FLOW_THREAD_SAFE
   83   #define ReferenceCounted ThreadSafeReferenceCounted
   84   #else
   85   #define ReferenceCounted ThreadUnsafeReferenceCounted
   86   #endif
   87
   88   template <class P>
   89   void addref(P* ptr) {
   90           ptr->addref();
   91   }
   92
   93   template <class P>
   94   void delref(P* ptr) {
   95           ptr->delref();
   96   }
   97
   98   template <class P>
   99   class Reference {
   100  public:
   101          Reference() : ptr(nullptr) {}
   102          explicit Reference(P* ptr) : ptr(ptr) {}
   103          static Reference<P> addRef(P* ptr) {
   104                  ptr->addref();
   105                  return Reference(ptr);
   106          }
   107
   108          Reference(const Reference& r) : ptr(r.getPtr()) {
   109                  if (ptr)
   110                          addref(ptr);
   111          }
   112          Reference(Reference&& r) noexcept : ptr(r.getPtr()) { r.ptr = nullptr; }
   113
   114          template <class Q>
   115          Reference(const Reference<Q>& r) : ptr(r.getPtr()) {
   116                  if (ptr)
   117                          addref(ptr);
   118          }
   119          template <class Q>
   120          Reference(Reference<Q>&& r) : ptr(r.getPtr()) {
   121                  r.setPtrUnsafe(nullptr);
   122          }
   123
   124          ~Reference() {
   125                  if (ptr)
-> 126                          delref(ptr);
   127          }
   128          Reference& operator=(const Reference& r) {
   129                  P* oldPtr = ptr;

 (lldb) f
f
frame #1: 0x0000000107108dfa libfdb_c.dylib`Reference<ThreadSafeDatabase>::~Reference(this=0x00006000\
034ac850) at FastRef.h:126:4
   123
   124          ~Reference() {
   125                  if (ptr)
-> 126                          delref(ptr);
   127          }
   128          Reference& operator=(const Reference& r) {
   129                  P* oldPtr = ptr;

(lldb) p *ptr
p *ptr
(ThreadSafeDatabase) {
  ThreadSafeReferenceCounted<ThreadSafeDatabase> = {
    referenceCount = {
      Value = 441
    }
  }
  isConfigDB = false
  db = nullptr
}
(lldb) (lldb) p *this
p *this
(Reference<ThreadSafeDatabase>) {
  ptr = 0x000060000218bbc0
}
(lldb) (lldb) p ptr
p ptr
(ThreadSafeDatabase *) 0x000060000218bbc0
(lldb) (lldb) p this
p this
(Reference<ThreadSafeDatabase> *) 0x00006000034ac850
(lldb)

The only thing that catches is my eye is that db is nullptr.
I wonder what the exact definition of delref is here.

(lldb) disas
disas
libfdb_c.dylib`Reference<ThreadSafeDatabase>::~Reference:
    0x107108dd0 <+0>:  pushq  %rbp
    0x107108dd1 <+1>:  movq   %rsp, %rbp
    0x107108dd4 <+4>:  subq   $0x10, %rsp
    0x107108dd8 <+8>:  movq   %rdi, -0x8(%rbp)
    0x107108ddc <+12>: movq   -0x8(%rbp), %rax
    0x107108de0 <+16>: movq   %rax, -0x10(%rbp)
    0x107108de4 <+20>: cmpq   $0x0, (%rax)
    0x107108de8 <+24>: je     0x107108e04               ; <+52> at FastRef.h:127:2
    0x107108dee <+30>: movq   -0x10(%rbp), %rax
    0x107108df2 <+34>: movq   (%rax), %rdi
    0x107108df5 <+37>: callq  0x107108e20               ; delref<ThreadSafeDatabase> at FastRef.h:94
->  0x107108dfa <+42>: jmp    0x107108dff               ; <+47> at FastRef.h:126:4
    0x107108dff <+47>: jmp    0x107108e04               ; <+52> at FastRef.h:127:2
    0x107108e04 <+52>: addq   $0x10, %rsp
    0x107108e08 <+56>: popq   %rbp
    0x107108e09 <+57>: retq
    0x107108e0a <+58>: movq   %rax, %rdi
    0x107108e0d <+61>: callq  0x105f5e1a0               ; __clang_call_terminate

(lldb) c
Process 48043 resuming
Process 48043 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1,
 address=0x5fbbbdb0bc00)
    frame #0: 0x0000000107108e33 libfdb_c.dylib`void delref<ThreadSafeDatabase>(ptr=0x000060000218bbc\
0) at FastRef.h:95:7
   92
   93   template <class P>
   94   void delref(P* ptr) {
-> 95           ptr->delref();
   96   }
   97
   98   template <class P>
Target 0: (kvfile.test) stopped.
(lldb) 

It appears... that the address of delref has gotten messed up some how???

@glycerine
Copy link
Author

glycerine commented Jan 1, 2025

Lets run it again, since I killed it by continuing, and look at the assembly where it halts:


db.go:266 2025-01-01T11:51:02.484Z goro 2 done

db.go:266 2025-01-01T11:51:02.484Z goro 6 done
Process 48581 stopped
* thread #10, stop reason = EXC_BAD_ACCESS (code=1, address=0xbcb37c51cc0)
    frame #0: 0x0000000107108e33 libfdb_c.dylib`void delref<ThreadSafeDatabase>(ptr=0x0000600003771c8\
0) at FastRef.h:95:7
   92
   93   template <class P>
   94   void delref(P* ptr) {
-> 95           ptr->delref();
   96   }
   97
   98   template <class P>
Target 0: (kvfile.test) stopped.
(lldb) (lldb) disas
disas
libfdb_c.dylib`delref<ThreadSafeDatabase>:
    0x107108e20 <+0>:  pushq  %rbp
    0x107108e21 <+1>:  movq   %rsp, %rbp
    0x107108e24 <+4>:  subq   $0x10, %rsp
    0x107108e28 <+8>:  movq   %rdi, -0x8(%rbp)
    0x107108e2c <+12>: movq   -0x8(%rbp), %rdi
    0x107108e30 <+16>: movq   (%rdi), %rax
->  0x107108e33 <+19>: callq  *0x40(%rax)
    0x107108e36 <+22>: addq   $0x10, %rsp
    0x107108e3a <+26>: popq   %rbp
    0x107108e3b <+27>: retq
(lldb) (lldb) register read
register read
General Purpose Registers:
       rax = 0x00000bcb37c51c80
       rbx = 0x000000c0000585a8
       rcx = 0x0000000100861080
       rdx = 0x0000000000000005
       rdi = 0x0000600003771c80
       rsi = 0x0000000000000004
       rbp = 0x000070000b502de0
       rsp = 0x000070000b502dd0
        r8 = 0x000000c000400008
        r9 = 0x0000000000000000
       r10 = 0x0000000000000000
       r11 = 0x0000000000000000
       r12 = 0x0000000100153c20  kvfile.test`github.com/apple/foundationdb/bindings/go/src/fdb.(*tena\
nt).destroy
       r13 = 0x0000000000000000
       r14 = 0x000000c000007c00
       r15 = 0x0000000000000000
       rip = 0x0000000107108e33  libfdb_c.dylib`void delref<ThreadSafeDatabase>(ThreadSafeDatabase*) \
+ 19 at FastRef.h:95:7
    rflags = 0x0000000000010202
        cs = 0x000000000000002b
        fs = 0x0000000000000000
        gs = 0x0000000000000000

(lldb)
(lldb) memory read -f x $rax
memory read -f x $rax
error: memory read failed for 0xbcb37c51c00
(lldb) 

I'm not an assembly expert, so I'll consult with an AI here...

AI: From this disassembly and register state, we can see that the segfault is happening during a virtual function call, specifically at:

->  0x107108e33 <+19>: callq  *0x40(%rax)

The issue is likely one of these:

  1. Invalid vtable pointer:
  • The code is trying to call a virtual function through the vtable
  • It's accessing offset 0x40 (64 bytes) into what should be the vtable
  • rax = 0x00000bcb37c51c80 - this is the vtable pointer
  • The pointer might be corrupted or the object might have been deleted
  1. Common causes:
  • Use-after-free: Object was deleted but code still tries to use it
  • Memory corruption: vtable pointer was overwritten
  • Uninitialized pointer: Object was never properly constructed

To debug further, you could:

# Check what's at the vtable address
(lldb) memory read -f x $rax

# Look at the object pointer (in rdi)
(lldb) register read rdi

# Examine the stack trace
(lldb) bt

This looks like a classic use-after-free or memory corruption issue where the virtual function table pointer is invalid.

Me: when I try the first suggestion, I get:
(lldb) memory read -f x $rax
memory read -f x $rax
error: memory read failed for 0xbcb37c51c00
(lldb) 

AI:

Ah, this confirms the diagnosis! The fact that the memory read failed means we're trying to access an invalid memory address (0xbcb37c51c80). This is almost certainly a use-after-free bug where:

  1. The object was allocated and had a valid vtable
  2. The object was deleted/freed
  3. Code is still trying to use the object through a dangling pointer
  4. The memory address is no longer valid/accessible

To track this down:

# Get backtrace to see how we got here
(lldb) bt

# If you have source code, set a breakpoint earlier and watch the pointer
(lldb) b delref<ThreadSafeDatabase>
(lldb) run

# When stopped, watch the object pointer
(lldb) watchpoint set expression -- ptr  # replace 'ptr' with actual pointer name

Common patterns that lead to this:

  1. Reference counting errors
  2. Storing raw pointers to deleted objects
  3. Async/threading issues where object lifetime isn't properly managed
  4. Double-delete scenarios

The fact this is happening in a delref function suggests this might be related to reference counting - possibly the object is being deleted twice or accessed after its reference count hits zero.

@glycerine
Copy link
Author

glycerine commented Jan 1, 2025

Fascinating: I notice that I must run both tests in the db_test.go for the use-after-delete crash to occur.
If I run only one test (either) then there is no crash. Test001... always runs before Test002.

Since it is a finalizer (on a tenant structure that itself has a reference to the database) that is causing the crash, and they can run at any time or never... this suggests... perhaps the problem is a delayed finalizer from the first test that gets run during the second test.

"Note the many caveats listed in the runtime.SetFinalizer documentation. You should never depend on or expect a finalizer to ever get run. Instead, always make sure to call the appropriate cleanup function(s) (e.g. defer file.Close() right after checking the errors of os.Open)." -- from https://stackoverflow.com/questions/8595457/which-objects-are-finalized-in-go-by-default-and-what-are-some-of-the-pitfalls-o

So a fundamental design flaw in the FDB Go bindings seems to be using finalizers at all... and having them decrease refcounts (in this case on a database that has already been deleted... <- current hypothesis, not final)

reading the docs in the link above on finalizers, these things stand out:

"If a cyclic structure includes a block with a finalizer, that cycle is not guaranteed to be garbage collected and the finalizer is not guaranteed to run, because there is no ordering that respects the dependencies."

"The finalizer is scheduled to run at some arbitrary time after the program can no longer reach the object to which obj points."

"There is no guarantee that finalizers will run before a program exits, so typically they are useful only for releasing non-memory resources associated with an object during a long-running program. "

"A finalizer may run as soon as an object becomes unreachable. In order to use finalizers correctly, the program must ensure that the object is reachable until it is no longer required. "

@glycerine
Copy link
Author

glycerine commented Jan 1, 2025

Ah hah. We have the smoking gun.

See
https://github.com/glycerine/fdb_demo_golang/blob/master/crash.log.3.with.refcounts.and.this.addresses.txt#L269

where I instrumented the C++ ref-counts and log the this-pointer of the ThreadSafeDatabase in fdbclient/include/fdbclient/ThreadSafeTransaction.h

We can clearly see a database from the first test (0x600003e1c000 is its "this" pointer) having its ref-count decremented during Test002 (line 269 of crash.log.3.with.refcounts.and.this.addresses.txt). I've added this log to the repo for reference and inspection.

Attaching the C++ diff/patch-like file (actually a git diff) if anyone wishes to reproduce in the C++.
git.diffs.txt

glycerine pushed a commit to glycerine/foundationdb that referenced this issue Jan 1, 2025
glycerine pushed a commit to glycerine/foundationdb that referenced this issue Jan 1, 2025
@xis19
Copy link
Collaborator

xis19 commented Jan 1, 2025

This is an interesting issue, although the problem is that tenant feature people are currently in Snowflake rather than Apple. I am familiar with the refcount part, but I have never thought go programming using go.

@xis19
Copy link
Collaborator

xis19 commented Jan 1, 2025

But

"There is no guarantee that finalizers will run before a program exits, so typically they are useful only for releasing non-memory resources associated with an object during a long-running program. "

might be the root of the pain.

@glycerine
Copy link
Author

glycerine commented Jan 2, 2025

might be the root of the pain.

Indeed.

There are so many issues with the Go bindings. I've fixed a bunch of bugs and races, but there are still more to do. It hasn't gotten much love in the last five years; I can't imagine it is widely used. (Update: main does have some updates, but apparently v7.3.57 the latest release is much older). Still, I'm trying to keep backwards-compatibility even though this is painful.

I'll submit a pull request when I'm done. I'm not rushing it because I'd like to get rid of all of the finalizers first.

I was about to mention my branch, but it is based on 7.3.57, while trunk main appears to be substantially different; for instance there is no tenant.go file at all on main; and the finalizer for the database has already been deleted. However finalizers for memory reclamation for Futures and Transactions are still in use, which is bad; time bombs /bugs waiting to happen.

I'll need to read more of the history and try and understand what's going on between the branches.

@xis19 Is tenant support on main, or only on a Snowflake branch? I thought v7.3.57, the latest released, was a non-Snowflake branch, but it has tenant support, and main (at least the Go bindings, I don't know about the server) does not, so I'm not sure what is going on. Any context would be helpful.

Update: for example, the most recent common ancestor between 7.3.57 and main is 21 months old; suggesting serious divergence.

$ git merge-base main 7.3.57
warning: refname '7.3.57' is ambiguous.
c5b8908828d7cd1e209d08a54c7c751a482cfa3b

$ git log c5b8908828d7cd1e209d08a54c7c751a482cfa3b
commit c5b8908828d7cd1e209d08a54c7c751a482cfa3b
Author: Jingyu Zhou <[email protected]>
Date:   Tue Apr 4 14:31:21 2023 -0700

    Update 7.1.30 release notes (#9887)

glycerine pushed a commit to glycerine/foundationdb that referenced this issue Jan 2, 2025
  several SetFinalizer time-bombs remain. We
  are working to get rid of them.

  fix broken OpenTenant().

  add db.TenantExists(), db.EnsureTenant(), db.DeleteTenantsIfExist().

  fix a bunch of bugs, races, and design problems

  adds back tenant.go; why is it missing on main branch?
@xis19
Copy link
Collaborator

xis19 commented Jan 2, 2025

To my understanding, tenant is not being used by us and we are not supporting it (@jzhou77 please correct me if I am wrong), the full tenant support is available in Snowflake private repository, which is maintained by Snowflake people. So we may not want to touch it unless it breaks something on our side.

There is a similar story in Python, where the "destructor" __del__ is not guaranteed to be called. But we did not have this kind of problem though. Not sure if this information would help or not.

@gm42
Copy link
Collaborator

gm42 commented Jan 6, 2025

I'd like to get rid of all of the finalizers first.

I second this; there's no clean way to map C++ destructors to Go finalizers and obtain reliable behavior out of it. Let me know if you need help with this.

I was about to mention my branch, but it is based on 7.3.57, while trunk main appears to be substantially different

I fixed already a few bugs related to finalizers, but there are still occasional issues related to network thread and destruction on C++ side which is not coordinated with the Go side, since I occasionally still see some SIGSEGV in tests.

@glycerine
Copy link
Author

I fixed already a few bugs related to finalizers, but there are still occasional issues related to network thread and destruction on C++ side which is not coordinated with the Go side, since I occasionally still see some SIGSEGV in tests.

@gm42 ouch! that is painful.

I'm blocked on doing anything here because I don't understand the versioning structure in use here, so I would not know on which branch to base any fixes. Do you comprehend it? Usually patches go on top of master/main, but the latest release 7.3.57 appears to have nothing to do with main.

@xis19
Copy link
Collaborator

xis19 commented Jan 6, 2025

Usually patches go on top of master/main, but the latest release 7.3.57 appears to have nothing to do with main.

I do not understand what do you mean by "have nothing to do withmain". But if your focus is on Go part, I would think you can start with 7.3.57. We can deal with the merging issue.

@glycerine
Copy link
Author

I do not understand what do you mean by "have nothing to do with main". But if your focus is on Go part, I would think you can start with 7.3.57. We can deal with the merging issue.

Hi @xis19 - Thanks your your help. I was referring to my observations, above, in #11856 (comment),
where I note alot of divergence which was confusing to me. The Go part on main seems more up-to-date; so would it not be better to start there? Is main due to be released as 7.3.58 at some point? How is 7.1 different from 7.3, as I've seen it implied (however right or wrong) that both are being released from?

[my branch] is based on 7.3.57, while trunk main appears to be substantially
different; for instance there is no tenant.go file at all on main; and the finalizer for
the database has already been deleted. However finalizers for memory reclamation
for Futures and Transactions are still in use, which is bad; time bombs /bugs
waiting to happen.

I'll need to read more of the history and try and understand what's going on
between the branches.

...

Update: for example, the most recent common ancestor between 7.3.57
and main is 21 months old; suggesting serious divergence.

$ git merge-base main 7.3.57
warning: refname '7.3.57' is ambiguous.
c5b8908

$ git log c5b8908
commit c5b8908
Author: Jingyu Zhou [email protected]
Date: Tue Apr 4 14:31:21 2023 -0700

Update 7.1.30 release notes (#9887)

@jzhou77
Copy link
Contributor

jzhou77 commented Jan 6, 2025

@glycerine I think it's better to fix on main branch first. if needed, we can cherrypick to 7.3 branch. Since we are going to cut 8.0 release in March, maybe we don't have to cherrypick to 7.3.

@gm42
Copy link
Collaborator

gm42 commented Jan 7, 2025

@glycerine if it can be of any help, I have submitted all my PRs to main. I think that needs to be done anyways before any backport on release branches.

@glycerine
Copy link
Author

@gm42 makes perfect sense, thanks.

Just to update: I'm not actively working on at a patch at the moment. There were other issues besides this one that caused me to reconsider if FoundationDB is the right choice for my current project. So @gm42 Giuseppe, if you or someone else wants to dig into a fix for this and/or other Go issues, we won't be duplicating effort. Feel free to take the lead. I can assist with testing and review/discussion if you do, but I'm not actively working towards sending a PR at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants