-
Notifications
You must be signed in to change notification settings - Fork 148
To enable operators to quickly diagnose Loggregator related issues.
This FAQ will try and consolidate some helpful troubleshooting steps to acknowledge some common questions that Loggregator has received.
- TODO: How do I enable syslog forwarding for a job?
- TODO: How can I debug my Loggregator components?
- How do I get etcd data when it is in TLS mode?
- How do I disable UAA for Traffic Controller?
- What do the Doppler properties mean?
- What do the Metron properties mean?
- What do the Traffic Controller properties mean?
- Why is the DEA Logging Agent run as root?
- Why do I get this can't forward message: loggregator client pool is empty error?
Loggregator is a complex subcomponent of Cloud Foundry with many components on its own. We'll try to describe how we can better help you troubleshoot Loggregator in case you are having problems seeing your logs.
Rough thoughts/ideas for further expansion. Topics to expand:
- Datadog
- visualize metrics
- Datadog Firehose Nozzle
- Datadog Config OSS
- Number of connections opened by component
lsof -c doppler-
lsof -c trafficco...
- Pprof
- Add SHA or release version from when this feature will be provided
curl http://<IP>:{6060|6061}/debug/pprof/go tool pprof http://<IP>:{6060|6061}/debug/pprof/heap- Memory Dump, Goroutine dump, CPU profile.
- Goroutine dump
- SIGUSR1 signal to process
-
--debugflag to the process- Not efficient because it requires process restart
- Calls to CC and UAA are timing out
- Check the access log in GoRouter to see if the request to CC and UAA are making it through. If you don't see it, it could be an IaaS issue. Provide AWS example. Soln: Switch from NAT gateway to NAT instance in AWS.
- etcd
- Check if Doppler's are advertising and Metron's are listening
- Check the health of the etcd cluster
If your CF environment has etcd deployed in TLS mode, you will no longer be able to simply curl the data out.
Here are a few steps in order to get the data out to help troubleshoot.
bosh ssh etcd_z1/0cd /var/vcap/packages/etcd/- In order to get the list of available keys,
./etcdctl \
--cert-file /var/vcap/jobs/etcd/config/certs/client.crt \
--key-file /var/vcap/jobs/etcd/config/certs/client.key \
--ca-file /var/vcap/jobs/etcd/config/certs/server-ca.crt \
-C https://etcd-z1-0.cf-etcd.service.cf.internal:4001 \
ls doppler/meta --recursiveYou should see output similar to the output below
/doppler/meta/z1
/doppler/meta/z1/doppler_z1
/doppler/meta/z1/doppler_z1/e27e8ab6-e29c-446d-a0dd-c692c7d16dd1
/doppler/meta/z1/doppler_z1/63af35d8-d233-422f-a389-e893f4d5b7ee
/doppler/meta/z1/doppler_z1/3a45b944-24dc-4563-bbae-fc53d5bacc43
/doppler/meta/z1/doppler_z1/51737ccd-5e14-4439-8dd1-c0e3ce2aca56
- Get the value of a key,
./etcdctl \
--cert-file /var/vcap/jobs/etcd/config/certs/client.crt \
--key-file /var/vcap/jobs/etcd/config/certs/client.key \
--ca-file /var/vcap/jobs/etcd/config/certs/server-ca.crt \
-C https://etcd-z1-0.cf-etcd.service.cf.internal:4001 \
get /doppler/meta/z1/doppler_z1/e27e8ab6-e29c-446d-a0dd-c692c7d16dd1
Note: The value https://etcd-z1-0.cf-etcd.service.cf.internal:4001 can be found within the EtcdUrls property in the config files. For example, /var/vcap/jobs/doppler/config/doppler.json
Traffic Controller has a property in its spec called traffic_controller.disable_access_control.
By default this is false. This is not a config property but rather a flag passed in to the traffic controller. See here.
Setting this property will make the logAccessAuthorizer and the adminAuthorizer always allow access to the app logs and firehose.
This feature was originally created so that Loggregator could be used in Lattice.
DEA Logging Agent runs as root because it needs to read the stdout and stderr unix sockets created for the jailed container application by warden.
This error message shows up in the Metron logs if it doesn't have any registered Dopplers in its client pool.
It could be that Metron or Doppler cannot communicate with its Key-Value store ETCD.
- Look for the error message
Failed to connect to etcdin the logs. - Verify you can access ETCD.
- Verify ETCD urls in the Metron config
/var/vcap/jobs/metron_agent/config/metron_agent.json. - Try pinging ETCD to see if Doppler has advertised itself correctly.
# Old Doppler Endpoint
curl http://<your_etcd_ip>:<port/4001>/v2/keys/healthstatus/doppler?recursive=true
# New Doppler Endpoint
curl http://<your_etcd_ip>:<port/4001>/v2/keys/doppler/meta?recursive=true
The older endpoint will contain just the Doppler IP. The newer endpoint will contain json that may look like this.
{ "version": 1, "endpoints":["udp://<doppler_ip>:<port>", "tls://<doppler_ip:<port>"]}If you see values being populated in either of the endpoints then it means your Doppler and Metron can both see ETCD and read/write to it.
-
Look at the ETCD key that Doppler is advertising. It should have the following structure.
# Old /healthstatus/doppler/<zone>/<job_name>/<index> # New /doppler/meta/<zone>/<job_name>/<index>Compare each of these properties to the config within Metron - they should match.
We have come across scenarios where Doppler was on a different zone and was advertising
zone1whereas Metron was configured with property"Zone": "zone2",.This makes Metron look for a different key and thus unable to find the Doppler IP and protocol.
We came across a situation where ETCD got into a weird state and its process needed to be restarted. The tracker story is here and should be resolved.
Basically killall etcd