Skip to content

Commit 17ae0f0

Browse files
authored
docs: Troubleshooting readiness (#20838)
1 parent 8e18f10 commit 17ae0f0

File tree

1 file changed

+188
-4
lines changed

1 file changed

+188
-4
lines changed

docs/sources/operations/troubleshooting/troubleshoot-operations.md

Lines changed: 188 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1250,9 +1250,9 @@ The ring contains too many unhealthy instances to satisfy the replication factor
12501250

12511251
**Resolution:**
12521252

1253-
1. **Check the health of ring members**:
1253+
1. **Check the health of ring members**:
12541254
Open a browser and navigate to http://localhost:3100/ring. You should see the Loki ring page.
1255-
1255+
12561256
OR
12571257

12581258
```bash
@@ -1524,11 +1524,195 @@ After being disconnected from the memberlist cluster, the instance failed to rej
15241524

15251525
## Component readiness errors
15261526

1527-
<!-- Additional content in next PRs. Just leaving the headings here for context and so that I can keep things in order if PRs merge out of sequence. -->
1527+
Readiness errors occur when Loki components are not ready to serve requests. These errors are returned by the [`/ready` health check endpoint](http://localhost:3100/ready) and prevent load balancers from routing traffic to unready instances.
15281528

1529-
## gRPC and message size errors
1529+
### Error: Application is stopping
1530+
1531+
**Error message:**
1532+
1533+
```text
1534+
Application is stopping
1535+
```
1536+
1537+
**Cause:**
1538+
1539+
Loki is shutting down and no longer accepting new requests. This is normal during graceful shutdown.
1540+
1541+
**Resolution:**
1542+
1543+
1. **Wait for the instance to restart** if this is a rolling update.
1544+
1. **Check if the shutdown is expected** (maintenance, scaling down).
1545+
1. **Review orchestrator logs** (Kubernetes, systemd) if the shutdown is unexpected.
1546+
1547+
**Properties:**
1548+
1549+
- Enforced by: Loki readiness handler
1550+
- Retryable: Yes (after restart)
1551+
- HTTP status: 503 Service Unavailable
1552+
- Configurable per tenant: No
1553+
1554+
### Error: Some services are not running
1555+
1556+
**Error message:**
1557+
1558+
```text
1559+
Some services are not Running:
1560+
<state>: <count>
1561+
<state>: <count>
1562+
```
1563+
1564+
For example:
1565+
1566+
```text
1567+
Some services are not Running:
1568+
Starting: 1
1569+
Failed: 2
1570+
```
1571+
1572+
**Cause:**
1573+
1574+
One or more internal Loki services have failed to start or have stopped unexpectedly. The error message lists each service state with a count of services in that state.
1575+
1576+
**Resolution:**
1577+
1578+
1. **Check Loki logs** for errors from the listed services.
1579+
1. **Verify configuration** for the affected services.
1580+
1. **Check resource availability** (memory, disk, CPU).
1581+
1. **Restart the instance** if services are stuck.
1582+
1583+
**Properties:**
1584+
1585+
- Enforced by: Loki service manager
1586+
- Retryable: Yes (after services recover)
1587+
- HTTP status: 503 Service Unavailable
1588+
- Configurable per tenant: No
1589+
1590+
### Error: Ingester not ready
1591+
1592+
**Error message:**
1593+
1594+
```text
1595+
Ingester not ready: <details>
1596+
```
1597+
1598+
When the ingester's own state check fails, `<details>` contains the ingester state, giving the full message:
1599+
1600+
```text
1601+
Ingester not ready: ingester not ready: <state>
1602+
```
1603+
1604+
Where `<state>` is the service state, for example `Starting`, `Stopping`, or `Failed`.
1605+
1606+
**Cause:**
1607+
1608+
The ingester is not in a ready state to accept writes or serve reads. The detail message indicates the specific reason, such as:
1609+
1610+
- The ingester is still starting up and joining the ring (`Starting`)
1611+
- The lifecycler is not ready (lifecycler error text)
1612+
- The ingester is waiting for minimum ready duration after ring join
1613+
1614+
**Resolution:**
1615+
1616+
1. **Wait for startup to complete** - ingesters take time to join the ring and become ready.
1617+
1. **Check ring membership**:
1618+
1619+
```bash
1620+
curl -s http://ingester:3100/ring
1621+
```
15301622

1623+
1. **Review logs** for startup errors.
1624+
1. **Adjust the minimum ready duration** if startup is too slow:
15311625

1626+
```yaml
1627+
ingester:
1628+
lifecycler:
1629+
min_ready_duration: 15s
1630+
```
1631+
1632+
**Properties:**
1633+
1634+
- Enforced by: Ingester readiness check
1635+
- Retryable: Yes (after ingester becomes ready)
1636+
- HTTP status: 503 Service Unavailable
1637+
- Configurable per tenant: No
1638+
1639+
### Error: No queriers connected to query frontend
1640+
1641+
**Error message:**
1642+
1643+
```text
1644+
Query Frontend not ready: not ready: number of queriers connected to query-frontend is 0
1645+
```
1646+
1647+
**Cause:**
1648+
1649+
The query frontend has no querier workers connected. Without queriers, the frontend cannot process any queries. This typically occurs when:
1650+
1651+
- Queriers are not yet started
1652+
- Queriers cannot reach the frontend
1653+
- gRPC connectivity issues between queriers and frontend
1654+
1655+
**Resolution:**
1656+
1657+
1. **Check that queriers are running** and healthy.
1658+
1. **Verify querier configuration** points to the correct frontend address:
1659+
1660+
```yaml
1661+
frontend_worker:
1662+
frontend_address: query-frontend:9095
1663+
```
1664+
1665+
1. **Check gRPC connectivity** between queriers and the frontend:
1666+
1667+
```bash
1668+
# Test gRPC port connectivity
1669+
nc -zv query-frontend 9095
1670+
```
1671+
1672+
1. **Review querier logs** for connection errors.
1673+
1674+
**Properties:**
1675+
1676+
- Enforced by: Query frontend (v1) readiness check
1677+
- Retryable: Yes (after queriers connect)
1678+
- HTTP status: 503 Service Unavailable
1679+
- Configurable per tenant: No
1680+
1681+
### Error: No schedulers connected to frontend worker
1682+
1683+
**Error message:**
1684+
1685+
```text
1686+
Query Frontend not ready: not ready: number of schedulers this worker is connected to is 0
1687+
```
1688+
1689+
**Cause:**
1690+
1691+
The query frontend worker has no active connections to any query scheduler. This prevents the frontend from dispatching queries.
1692+
1693+
**Resolution:**
1694+
1695+
1. **Check that query schedulers are running** and healthy.
1696+
1. **Verify scheduler address configuration**:
1697+
1698+
```yaml
1699+
frontend_worker:
1700+
scheduler_address: query-scheduler:9095
1701+
```
1702+
1703+
1. **Check gRPC connectivity** between the frontend and schedulers.
1704+
1. **Review query scheduler logs** for errors.
1705+
1706+
**Properties:**
1707+
1708+
- Enforced by: Query frontend (v2) readiness check
1709+
- Retryable: Yes (after schedulers connect)
1710+
- HTTP status: 503 Service Unavailable
1711+
- Configurable per tenant: No
1712+
1713+
## gRPC and message size errors
1714+
1715+
<!-- Additional content in next PRs. Just leaving the headings here for context and so that I can keep things in order if PRs merge out of sequence. -->
15321716

15331717
## TLS and certificate errors
15341718

0 commit comments

Comments
 (0)