==============
Stats Pipeline
==============

Summary
==============


.. figure:: _static/stats_diagram_highlevel.png
    :alt: Architecture diagram

    The high-level flow is:

    Nodes → Kafka ← Query Service → Prometheus Cache ← Prometheus


Nodes
--------------

Nodes push stats to Kafka using the stats and hf_stats topics. Traffic is routed through the POP to the VMs hosting the Kafka services.

Node to Kafka communication can be configured through a number of methods. In both cases, the external IP/Port of the Kafka VM must be used.


1. Using the NMS

.. figure:: _static/stats_kafka_quicksettings.png
    :alt: Kafka quicksettings

2. Manually on the node through statsAgentParams in node_config.json::

    {
        "statsAgentParams": {
            "endpointParams": {
                "kafkaParams": {
                    "config": {
                        "brokerEndpointList": "PLAINTEXT://[2620:10d:c0bf::2000]:9093",
                    }
                }
            }
        }
    }


Kafka
--------------

Kafka is used as a durable queue for stats. This is to ensure that no stats are dropped or processed more than once.

.. figure:: _static/stats_kafka_kafdrop.png

    A Kafka UI is hosted within the cluster at /kafdrop. This screenshot shows the default set of topics.


The main topics relevant to the stats pipeline are:

:stats: Low-frequency stats which are sent by nodes every 30s by default
:hf_stats: High-frequency stats which are sent by nodes every 1s by default
:link_stats: Link-specific stats forwarded from stats topic by Query Service for calculating link health.

Query Service
--------------

Query Service consumes node stats from Kafka and converts them to Prometheus metrics.
It adds additional metadata using the Topologies configured in NMS before forwarding them to the Prometheus-Cache.

Query Service runs in a loop with 2 primary tasks:

* Cache all of the topologies specified in MySQL by querying API-Service
* Transform AggrStats from Kafka into Prometheus metrics, adding metadata using the corresponding topology of the node referenced in the stat.

Query Service reads node stats from the *stats* and *hf_stats* Kafka topics and first forwards link-metric stats onto the link_stats topic.
The list of link-specific stats is determined by comparing the key to a set of metric names stored in the link_metric table of the cxl database.

Next, it formats the stats as prometheus metrics.

When the stats come from the nodes, they are formatted as::

  # Aggregator.thrift::AggrStat
  {
   "key": "tgf.04:ce:14:fe:a5:3b.staPkt.rxOk",
   "timestamp": 1617215039,
   "value": 6442,
   "isCounter": false,
   "entity": "04:ce:14:fe:a5:7a"
  }


Query Service uses the mac address stored in the "entity" field to lookup the corresponding node in the NMS's topology.
Next, it creates a prometheus metric with the following format::

    metric_name{network="", ...} value

The labels present on all metrics:

    * network - Name of the topology
    * nodeMac - MAC address of the node
    * radioMac - MAC address of the radio sector on the node
    * nodeName - Name of the node, stored in the topology
    * pop - Boolean value if the node is a POP node
    * siteName - Name of the site the node is at, stored in the topology
    * intervalSec - The interval at which the stats were pushed to Kafka


The following labels are added if the metric is a link metric:

    * linkName - Name of the link, stored in the topology
    * linkDirection - A or Z. Distinguishes which of the two sides of the link this metric originated from.


The metric is then pushed to the Prometheus Cache.


Prometheus Cache
----------------

Since Prometheus utilizes a pull model, stats are pushed to the Prometheus Cache to then be scraped by prometheus.


Prometheus
--------------

Prometheus is the time-series database which powers the TGNMS UI, Grafana, and various other services. The basic Prometheus GUI can be accessed at /prometheus.


Troubleshooting
===============

This section will contain a series of troubleshooting steps.
The main user-visible problem will be lack of MCS/SNR overlays in the NMS. This can be further confirmed by querying for the mcs stat in Prometheus.

Bash functions to shorten the commands::

    # Swarm only: Lookup the local docker container ID for a swarm service
    function·svcname()·{·docker·ps·--format·'{{.Names}}'·--filter·"label=com.docker.swarm.service.name=$1";}

Nodes Troubleshooting
---------------------

Important Log files:

* /var/log/stats_agent/current
* /var/log/e2e_minion/current

Check if the node can ping the E2E VM::

    ping6 2620:10d:c0bf::2000

Check if there is a route through the POP::

    # VPP routing
    vppctl show ip6 fib

    # Kernel routing
    ip -6 route show

Check if the node has Kafka configured correctly::

    cat /data/cfg/node_config.json | grep broker


Check if the node has an IP6 address on the loopback interface::

    ip -6 address show dev lo

Check if the E2E IP/Port combo is correct::

    cat /data/cfg/node_config.json | grep e2e-ctrl-url


Kafka Troubleshooting
---------------------

Check if messages are coming through on the stats or hf_stats topics::

    # K8s
    kubectl exec statefulset/kafka -- kafka-console-consumer.sh --bootstrap-server 127.0.0.1:9092 --topic hf_stats
    # Swarm
    docker exec -it $(svcname kafka_kafka) kafka-console-consumer.sh --bootstrap-server 127.0.0.1:9092 --topic hf_stats

Check if there are any consumer groups for stats/hf_stats topics::

    # K8s
    kubectl exec statefulset/kafka -- kafka-consumer-groups.sh --bootstrap-server 127.0.0.1:9092 --all-groups --all-topics --describe
    # Swarm
    docker exec -it $(svcname kafka_kafka) kafka-consumer-groups.sh --bootstrap-server 127.0.0.1:9092 --all-groups --all-topics --describe

An healthy example of consumer groups::

    GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST                               CLIENT-ID
    health_service  link_stats      0          15335760        15335984        224             rdkafka-... /fd00::fb33 rdkafka

    GROUP                TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST                               CLIENT-ID
    qs_node_stats_reader stats           0          22770478        22770478        0               rdkafka-... /fd00::fb33 rdkafka
    qs_node_stats_reader hf_stats        0          14910508        14910508        0               rdkafka-... /fd00::fb33 rdkafka

    GROUP            TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                                                                                        HOST                              CLIENT-ID
    tg-alarm-service events          0          6681            6681            0               tg-alarm-service-... /fd00:cdee:0:0:fd43:e24:a84c:2b1a tg-alarm-service-...

Check if Kafka is binding properly::

  # An example of the K8s pod's environment variables
  - name: KAFKA_ADVERTISED_LISTENERS
    value: "INSIDE://:9092,OUTSIDE://[2620:10d:c0bf::2000]:9093"
  - name: KAFKA_LISTENERS
    value: "INSIDE://:9092,OUTSIDE://:9093"

Download the Kafka debugging tools, following the instructions on https://kafka.apache.org/documentation and then try to consume the Kafka topics from outside the Swarm/K8s cluster.
This will tell you if Kafka is binding properly externally.

Check if Zookeeper is healthy::

    # K8s
    kubectl logs -f statefulset/zookeeper

Check if Kafka is exposed properly::

    # K8s
    kubectl logs -f daemonset/nginx
    kubectl get configmap/stream-conf | grep kafka

    # Swarm
    docker service inspect kafka_kafka | grep -i port


Query Service Troubleshooting
-----------------------------

Check if Query Service can reach api-service::

    # K8s
    kubectl logs deploys/queryservice | grep CurlUtil

    # Example error
    CurlUtil.cpp:74] CURL request failed for http://e2e-test-network:8080/api/getTopology: Timeout was reached


Check the topology file in api-service::

    # K8s or Swarm
    cat /opt/terragraph/gfs/e2e/test-network/e2e_topology.conf

Does it have a "name"? Are there links, nodes, and sites?

Prometheus Troubleshooting
--------------------------

Check if the scrape targets are up by viewing /prometheus/targets in the browser:

.. figure:: _static/stats_prometheus_targets.png