Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentry encountering a transient problem where it cannot load certain issues #6569

Open
wxu-stripe opened this issue Oct 2, 2024 · 3 comments

Comments

@wxu-stripe
Copy link

Environment

self-hosted (https://develop.sentry.dev/self-hosted/)

What are you trying to accomplish?

I'm trying to investigate an issue we're seeing where our customers are occasionally unable to load certain issues. As mentioned, this is transient, but it looks like when the problem occurs, the same issues are problematic. In our logs, we see the error:

2024-10-02 17:01:43,284 Error submitting packet, dropping the packet and closing the socket
2024-10-02 17:02:13,294 Error running query: SELECT (project_id AS _snuba_project_id), (release AS `_snuba_tags[sentry:release]`), (count() AS _snuba_times_seen), (min((timestamp AS _snuba_timestamp)) AS _snuba_first_seen), (max(_snuba_timestamp) AS _snuba_last_seen) FROM errors_local FINAL PREWHERE in(_snuba_project_id, tuple(2, 3, 7, 12, 1039, 1040, 537, 1054, 547, 551, 1064, 1066, 1071, 561, 564, 1077, 1076, 1083, 574, 63, 1099, 1104, 1105, 1112, 90, 1116, 608, 1131, 621, 1135, 624, 1139, 629, 1146, 126, 644, 645, 646, 647, 1162, 1163, 1164, 651, 654, 656, 659, 155, 157, 1183, 161, 676, 680, 175, 1199, 1204, 699, 189, 1215, 193, 710, 1224, 1227, 212, 213, 219, 1243, 223, 741, 1258, 235, 236, 752, 1264, 764, 1276, 256, 773, 267, 268, 272, 784, 787, 277, 1305, 285, 798, 800, 1312, 1315, 295, 307, 309, 313, 314, 1339, 834, 844, 332, 348, 861, 1372, 863, 862, 864, 354, 355, 1381, 869, 871, 872, 369, 372, 373, 885, 890, 891, 381, 896, 901, 902, 904, 395, 909, 910, 911, 404, 408, 924, 420, 436, 445, 446, 961, 962, 449, 972, 464, 466, 981, 474, 990, 992, 993, 996, 491, 494, 1010, 1013, 1023)) WHERE equals(deleted, 0) AND greaterOrEquals(_snuba_timestamp, toDateTime('2024-08-31T01:16:00', 'Universal')) AND less(_snuba_timestamp, toDateTime('2024-10-02T17:01:44', 'Universal')) AND in(`_snuba_tags[sentry:release]`, tuple('358fab634f1e016657937d3f27a98641bea33bda', '6a134099338d69726988f0f6530a4437985d6ccd')) AND 1 AND in(_snuba_project_id, tuple(267, 1112, 272, 1315, 773, 1135, 1023, 645, 861, 314, 1183, 491, 911, 996, 309, 436, 646, 863, 348, 910, 236, 408, 680, 547, 268, 764, 784, 1064, 961, 798, 1204, 189, 161, 235, 420, 659, 561, 676, 990, 1372, 1258, 741, 1066, 551, 710, 564, 644, 3, 1243, 212, 1116, 800, 981, 404, 1227, 1077, 307, 1162, 834, 1164, 574, 992, 624, 1305, 1312, 972, 466, 63, 90, 844, 904, 1039, 395, 295, 213, 891, 537, 445, 446, 1163, 1071, 1224, 223, 373, 219, 175, 1199, 285, 752, 464, 1013, 862, 277, 494, 155, 608, 1104, 1076, 2, 699, 372, 369, 126, 871, 993, 909, 1264, 629, 902, 1146, 924, 1215, 872, 355, 157, 1054, 1139, 1083, 7, 1381, 896, 651, 381, 313, 656, 647, 354, 193, 12, 869, 864, 787, 1099, 1040, 890, 474, 1105, 901, 621, 256, 654, 332, 1131, 885, 962, 449, 1276, 1010, 1339)) GROUP BY _snuba_project_id, `_snuba_tags[sentry:release]` ORDER BY _snuba_times_seen DESC LIMIT 1000 OFFSET 0
timed out waiting for value
Traceback (most recent call last):
File "./snuba/state/cache/redis/backend.py", line 160, in get_readthrough
value = self.__executor.submit(function).result(task_timeout)
File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 446, in result
raise TimeoutError()
concurrent.futures._base.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "./snuba/web/db_query.py", line 426, in raw_query
result = execute_query_strategy(
File "./snuba/util.py", line 264, in wrapper
return func(*args, **kwargs)
File "./snuba/web/db_query.py", line 359, in execute_query_with_readthrough_caching
return cache.get_readthrough(
File "./snuba/state/cache/redis/backend.py", line 165, in get_readthrough
raise TimeoutError("timed out waiting for value") from error
TimeoutError: timed out waiting for value

How are you getting stuck?

We think the problem is with clickhouse, as we see the following error: Error on clickhouse-srv-http.service.consul:9000 ping: Unexpected EOF while reading bytes. Looking at the code:

elif result[0] == RESULT_EXECUTE:
, it looks like when we see this issue with Clickhouse, anything not in our redis cache fails to load.

I'm wondering if yall have any ideas on what could be wrong here? From a resource usage standpoint, I dont see any abnormal CPU/memory/disk usage from clickhouse.

Where in the product are you?

Issues

Link

No response

DSN

No response

Version

21.4.1

@getsantry
Copy link
Contributor

getsantry bot commented Oct 2, 2024

Auto-routing to @getsentry/product-owners-issues for triage ⏲️

@roggenkemper roggenkemper transferred this issue from getsentry/sentry Oct 2, 2024
@aldy505
Copy link
Contributor

aldy505 commented Oct 16, 2024

@hubertdeng123 Should be transfered to snuba

@hubertdeng123 hubertdeng123 transferred this issue from getsentry/self-hosted Nov 14, 2024
@evanh
Copy link
Member

evanh commented Dec 9, 2024

This is a bug we have been struggling with as well. We haven't been able to figure out the root issue yet. The actual problem flow appears to be:

  1. A query comes to Snuba, and Snuba starts running the query on CH
  2. A duplicate query comes to Snuba. The readthrough cache puts the duplicate into a waiting thread, which will wait for the first query to return and write the result to the cache
  3. The first query never writes a value (error or otherwise) to the cache
  4. The duplicate query times out and returns this error.

If this is happening a lot for you, you can disable the readthrough cache with the snuba setting randomize_query_id=True.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Status: No status
Development

No branches or pull requests

3 participants