Skip to content

Frequent timeouts from TDR #7452

@dsotirho-ucsc

Description

@dsotirho-ucsc

MaxRetryError caused by ReadTimeoutError for https://data.terra.bio/api/repository/v1/snapshots/roleMap.

A similar error has been observed on the dev deployments during requests for https://jade.datarepo-dev.broadinstitute.org/api/repository/v1/snapshots/roleMap.

Over the last 7 days, the occurrences of this error (timeout) and #7448 (hangup) has been:

timeouts hangups
dev 18 7
anvildev 23 12
prod 2 722
anvilprod 3 59

CloudWatch Logs Insights
region: us-east-1
log-group-names: /aws/apigateway/azul-service-prod, /aws/apigateway/azul-indexer-prod
start-time: 2025-09-29T11:43:09.009Z
end-time: 2025-09-30T03:17:04.792Z
query-string:

fields @timestamp, httpMethod as method, identity_sourceIp as sourceIp, responseLatency as latency, identity_userAgent as userAgent
| filter @message like 'a83e42c2-4bb1-4ad8-b69a-224b15923fa2'
| parse path /(?<path_str>.*?)(\/[a-zA-Z0-9]+-[a-zA-Z0-9-]+)?(\/(lMQ|ksQ)[a-zA-Z0-9_=-]+)?$/ # chop off UUID & manifest hash from path
| display @timestamp, sourceIp, method, path_str, error_responseType, status, latency, userAgent, integration_requestId
| sort by @timestamp asc
| limit 1000
@timestamp sourceIp method path_str error_responseType status latency userAgent integration_requestId
2025-09-29 18:46:35.498 208.59.176.247 GET /fetch/repository/files - 503 5100 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/18.6 Safari/605.1.15 a83e42c2-4bb1-4ad8-b69a-224b15923fa2

CloudWatch Logs Insights
region: us-east-1
log-group-names: /aws/lambda/azul-service-prod
start-time: 2025-09-29T16:10:54.440Z
end-time: 2025-09-29T21:44:54.824Z
query-string:

fields @timestamp, @message
| filter @message like /a83e42c2/
| sort @timestamp asc
| limit 1000
@timestamp @message
2025-09-29 18:46:35.522 START RequestId: a83e42c2-4bb1-4ad8-b69a-224b15923fa2 Version: $LATEST
2025-09-29 18:46:35.524 [INFO] 2025-09-29T18:46:35.524Z a83e42c2-4bb1-4ad8-b69a-224b15923fa2 azul.chalice Received GET request for '/fetch/repository/files/00867126-2924-4fbe-9166-cf11545ccf0d', with {"query": {"catalog": "dcp52", "version": "2021-07-16T17:33:53.696000Z"}, "headers": {"accept": "/", "accept-encoding": "gzip, deflate, br", "accept-language": "en-US,en;q=0.9", "cloudfront-forwarded-proto": "https", "cloudfront-is-desktop-viewer": "true", "cloudfront-is-mobile-viewer": "false", "cloudfront-is-smarttv-viewer": "false", "cloudfront-is-tablet-viewer": "false", "cloudfront-viewer-asn": "6079", "cloudfront-viewer-country": "US", "host": "service.azul.data.humancellatlas.org", "origin": "https://explore.data.humancellatlas.org", "priority": "u=3, i", "referer": "https://explore.data.humancellatlas.org/", "sec-fetch-dest": "empty", "sec-fetch-mode": "cors", "sec-fetch-site": "same-site", "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/18.6 Safari/605.1.15", "via": "2.0 df50de0db91dfb2bbd3a11e8d0539c5c.cloudfront.net (CloudFront)", "x-amz-cf-id": "NDSIy9xHu2T-1DyhYGJfTBnCjWnIX4nnZtW43H1zlZgaj7G5DEbTcA==", "x-amzn-trace-id": "Root=1-68dad40b-7af019b711969bf7187bcd54", "x-forwarded-for": "208.59.176.247, 15.158.2.71", "x-forwarded-port": "443", "x-forwarded-proto": "https"}}.
...
2025-09-29 18:46:35.582 [INFO] 2025-09-29T18:46:35.582Z a83e42c2-4bb1-4ad8-b69a-224b15923fa2 azul.terra Making GET request to 'https://data.terra.bio/api/repository/v1/snapshots/roleMap'
2025-09-29 18:46:35.583 [INFO] 2025-09-29T18:46:35.583Z a83e42c2-4bb1-4ad8-b69a-224b15923fa2 azul.terra … without a request body
2025-09-29 18:46:40.593 [WARNING] 2025-09-29T18:46:40.589Z a83e42c2-4bb1-4ad8-b69a-224b15923fa2 root Exception during request or response Traceback (most recent call last): File "/opt/python/urllib3/connectionpool.py", line 468, in _make_request six.raise_from(e, None) File "<string>", line 3, in raise_from File "/opt/python/urllib3/connectionpool.py", line 463, in _make_request httplib_response = conn.getresponse() ^^^^^^^^^^^^^^^^^^ File "/var/lang/lib/python3.12/http/client.py", line 1430, in getresponse response.begin() File "/var/lang/lib/python3.12/http/client.py", line 331, in begin version, status, reason = self._read_status() ^^^^^^^^^^^^^^^^^^^ File "/var/lang/lib/python3.12/http/client.py", line 292, in _read_status line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/var/lang/lib/python3.12/socket.py", line 720, in readinto return self._sock.recv_into(b) ^^^^^^^^^^^^^^^^^^^^^^^ File "/var/lang/lib/python3.12/ssl.py", line 1251, in recv_into return self.read(nbytes, buffer) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/var/lang/lib/python3.12/ssl.py", line 1103, in read return self._sslobj.read(len, buffer) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TimeoutError: The read operation timed out During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/python/urllib3/connectionpool.py", line 716, in urlopen httplib_response = self._make_request( ^^^^^^^^^^^^^^^^^^^ File "/opt/python/urllib3/connectionpool.py", line 470, in _make_request self._raise_timeout(err=e, url=url, timeout_value=read_timeout) File "/opt/python/urllib3/connectionpool.py", line 358, in _raise_timeout raise ReadTimeoutError( urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='data.terra.bio', port=443): Read timed out. (read timeout=5.0) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/var/task/azul/http.py", line 238, in urlopen response = super().urlopen(method, ^^^^^^^^^^^^^^^^^^^^^^^ File "/var/task/azul/http.py", line 45, in urlopen return self._inner.urlopen(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/python/google/auth/transport/urllib3.py", line 395, in urlopen response = self.http.urlopen( ^^^^^^^^^^^^^^^^^^ File "/var/task/azul/http.py", line 45, in urlopen return self._inner.urlopen(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/var/task/azul/http.py", line 371, in urlopen response = super().urlopen(method, url, *args, retries=inner_retries, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/var/task/azul/http.py", line 45, in urlopen return self._inner.urlopen(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/var/task/azul/http.py", line 82, in urlopen response = super().urlopen(method, url, *args, body=body, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/var/task/azul/http.py", line 45, in urlopen return self._inner.urlopen(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/var/task/azul/http.py", line 107, in urlopen return super().urlopen(method, url, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/var/task/azul/http.py", line 45, in urlopen return self._inner.urlopen(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/python/urllib3/poolmanager.py", line 376, in urlopen response = conn.urlopen(method, u.request_uri, **kw) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/python/urllib3/connectionpool.py", line 802, in urlopen retries = retries.increment( ^^^^^^^^^^^^^^^^^^ File "/opt/python/urllib3/util/retry.py", line 594, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='data.terra.bio', port=443): Max retries exceeded with url: /api/repository/v1/snapshots/roleMap (Caused by ReadTimeoutError("HTTPSConnectionPool(host='data.terra.bio', port=443): Read timed out. (read timeout=5.0)"))
2025-09-29 18:46:40.594 [INFO] 2025-09-29T18:46:40.594Z a83e42c2-4bb1-4ad8-b69a-224b15923fa2 azul.chalice Returning 503 response with headers {"headers": {"Access-Control-Allow-Origin": "*", "Access-Control-Allow-Headers": "Authorization,Content-Type,X-Amz-Date,X-Amz-Security-Token,X-Api-Key", "Content-Security-Policy": "default-src 'self';img-src 'self' data:;script-src 'self';style-src 'self';frame-ancestors 'none';form-action 'self'", "Referrer-Policy": "strict-origin-when-cross-origin", "Strict-Transport-Security": "max-age=63072000; includeSubDomains; preload", "X-Content-Type-Options": "nosniff", "X-Frame-Options": "DENY", "X-XSS-Protection": "1; mode=block", "Cache-Control": "no-store"}}.
2025-09-29 18:46:40.594 [INFO] 2025-09-29T18:46:40.594Z a83e42c2-4bb1-4ad8-b69a-224b15923fa2 azul.chalice … with a response body of length 142 being {"Code": "ServiceUnavailableError", "Message": "No response from https://data.terra.bio/api/repository/v1/snapshots/roleMap within 5 seconds"}
2025-09-29 18:46:40.596 END RequestId: a83e42c2-4bb1-4ad8-b69a-224b15923fa2
2025-09-29 18:46:40.596 REPORT RequestId: a83e42c2-4bb1-4ad8-b69a-224b15923fa2 Duration: 5073.84 ms Billed Duration: 5074 ms Memory Size: 2048 MB Max Memory Used: 253 MB

Metadata

Metadata

Assignees

Labels

demo[process] To be demonstrated at the end of the sprintservice[subject] The service part of Azul

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions