Troubleshooting

TGNMS runs as a series of containers deployed inside a Docker Swarm. To diagnose and debug issues inside the Swarm, SSH access to the hosts running the NMS is necessary.

Make sure you are logged into one of the Swarm hosts before running any of the steps below.

# example: ssh root@192.168.1.100
$ ssh <a swarm host>

Common Troubleshooting Steps

Find which services are running and which are broken. The docker binary is included in the installation of the NMS.

$ docker service ls
# Sample output, yours may look different
ID             NAME                              MODE         REPLICAS   IMAGE                                                                                          PORTS
y8plg8c1p9sr   chihaya_chihaya                   replicated   1/1        quay.io/jzelinskie/chihaya:v2.0.0-rc.2
7dyluyqmde9e   database_db                       replicated   1/1        mysql:5
w517noxowsld   e2e-lab_f8_d_api_service          replicated   1/1        ghcr.io/terragraph/e2e-controller:latest
kyqgyw3wl83u   e2e-lab_f8_d_e2e_controller       replicated   1/1        ghcr.io/terragraph/e2e-controller:latest
kof22ttbk9u7   e2e-lab_f8_d_nms_aggregator       replicated   1/1        ghcr.io/terragraph/e2e-controller:latest
snbdjn3loeh0   e2e-lab_f8_d_stats_agent          replicated   1/1        ghcr.io/terragraph/e2e-controller:latest
e8qnow4dx596   efk_elasticsearch                 global       3/3        docker.elastic.co/elasticsearch/elasticsearch:7.4.0
h31spsfpml2u   efk_es_exporter                   replicated   1/1        justwatch/elasticsearch_exporter:1.0.2
mal90x8raeu4   efk_fluentd                       replicated   1/1        ghcr.io/terragraph/fluentd:stable
zwe4iai65an7   efk_kibana                        replicated   1/1        docker.elastic.co/kibana/kibana:7.4.0
mh62145b6985   kafka_kafka                       global       3/3        ghcr.io/terragraph/kafka:stable
xvpe9v0j9i68   kafka_zoo1                        replicated   1/1        zookeeper:latest
qy4vaolmq064   kafka_zoo2                        replicated   1/1        zookeeper:latest
9ln8ld38gx85   kafka_zoo3                        replicated   1/1        zookeeper:latest
p27dsw42pd3z   keycloak_keycloak                 replicated   1/1        jboss/keycloak:7.0.0
yn6nfzh6n9pr   monitoring_cadvisor               global       3/3        google/cadvisor:latest
srq0xdlooff1   msa_analytics                     replicated   1/1        ghcr.io/terragraph/analytics:rc
uw2c97t89gsh   msa_default_routes_service        replicated   1/1        ghcr.io/terragraph/default_routes_service:rc
law793veyulo   msa_network_test                  replicated   1/1        ghcr.io/terragraph/network_test:rc
zp07zhcf30zx   msa_scan_service                  replicated   1/1        ghcr.io/terragraph/scan_service:rc
qxaih3zv3ila   msa_topology_service              replicated   1/1        ghcr.io/terragraph/topology_service:rc
njqhnfwqae1q   msa_weather_service               replicated   1/1        ghcr.io/terragraph/weather_service:rc
wkey83a0arce   nms_docs                          replicated   1/1        ghcr.io/terragraph/nms_docs:rc
w97mx4cb37oq   nms_grafana                       replicated   1/1        grafana/grafana:latest
fzepjjsm7wa0   nms_jupyter                       replicated   1/1        jupyter/scipy-notebook:latest
ulwa7g607ojl   nms_nms                           replicated   1/1        ghcr.io/terragraph/nmsv2:rc
daaczr27inz6   stats_alertmanager                replicated   1/1        prom/alertmanager:latest
3vxhh207gcp8   stats_alertmanager_configurer     replicated   1/1        facebookincubator/alertmanager-configurer:1.0.1
p9mqpdft4qpz   stats_prometheus                  replicated   1/1        prom/prometheus:latest
yqxe9vmb71yy   stats_prometheus_cache            replicated   1/1        facebookincubator/prometheus-edge-hub:1.1.0
taz5i5dyuhft   stats_prometheus_configurer       replicated   1/1        facebookincubator/prometheus-configurer:1.0.1
v3ovbudhpuho   stats_query_service               replicated   1/1        ghcr.io/terragraph/cpp_backends:rc
ljvv6z9m9y80   tg-alarms_alarms                  replicated   1/1        ghcr.io/terragraph/tg-alarms:rc

If a service does not show n/n under REPLICAS, it is likely having problems. To investigate further, check the service logs.

$ docker service logs nms_nms