docs/pages/deployment/monitoring.rst from nuts-foundation/nuts-node

docs/pages/deployment/monitoring.rst
Summary

Maintainability

Test Coverage

Issues
.. _nuts-node-monitoring:

Monitoring
##########

Health checks
*************

Status
======

The status endpoint check that the service has been started. It can be used as a ``readiness probe``.
It does not provide any information on the individual modules running as part of the executable.
The main goal of the service is to give a YES/NO answer for if the service is running:

.. code-block:: text

    GET /status

Returns an "OK" response body with status code ``200``.

.. note::

    The provided Docker containers are configured to perform this healthcheck out of the box.
    However, if the internal endpoints port (:8081) has been changed, the healthcheck will fail and Docker will mark the container as unhealthy.
    Override the default healthcheck to solve this.

Health
======

The health endpoint provides more fine grained health checks on the Nuts node. It can be used as a ``liveness probe``.
It reports in a format compatible with `Spring Boot's Health Actuator <https://docs.spring.io/spring-boot/docs/2.0.x/actuator-api/html/#health>`__.
The endpoint is available over HTTP:

.. code-block:: text

    GET /health

Each component in the health check can have one of the statuses ``UP``, ``UNKNOWN``, or ``DOWN``.
The overall status is determined by the lowest common denominator, so if one components is ``DOWN``, the overall system status is ``DOWN``.
The overall system statuses ``UP`` and ``UNKNOWN`` map to HTTP status code ``200``, and status ``DOWN`` maps to status code ``503``.

Example response when all checks succeeded (formatted for readability):

.. code-block:: json

    {
      "status": "UP",
      "details": {
        "crypto.filesystem": {
          "status": "UP"
        },
        "network.auth_config": {
          "status": "UP",
          "details": "no node DID"
        },
        "network.tls": {
          "status": "UP"
        }
      }
    }

Example response when one or more checks failed:

.. code-block:: json

    {
      "status": "DOWN",
      "details": {
        "network.tls": {
          "status": "DOWN",
          "details": "x509: certificate signed by unknown authority"
        }
      }
    }


Basic diagnostics
*****************

.. code-block:: text

    GET /status/diagnostics

.. note::

    this page is intended to be read by humans, not machines.
    all but the ``status`` entry are related to V5 functionality (gRPC network, VDRv1 and VCRv1 APIs).

Returns the status of the various services in ``yaml`` format:

.. code-block:: text

    network:
        connections:
            connected_peers:
                - id: d38c6df5-63d2-4b2c-87f4-2e8bbfa5612f
                  address: nuts.nl:5555
                  nodedid: did:nuts:abc123
            connected_peers_count: 1
        state:
            dag_xor: 6aada4464e380db16d0316e597956fcdaeada0e8f6023be82eeb9c798e1815c6
            stored_database_size_bytes: 106496005
            transaction_count: 9001
    vcr:
        credential_count: 7
        issuer:
            issued_credentials_count: 0
            revoked_credentials_count: 0
        verifier:
            revocations_count: 18
    vdr:
        did_documents_count: 5
        conflicted_did_documents:
            total_count: 2
            owned_count: 0
    status:
        git_commit: d36837bae48b780bfb76134e85b506472fc207a6
        os_arch: linux/amd64
        software_version: master
        uptime: 4h14m12s

If you supply ``application/json`` for the ``Accept`` HTTP header it will return the diagnostics in JSON format.

Explanation of ambiguous/complex entries in the diagnostics:

* ``vcr.credential_count`` holds the total number of credentials known to the node (public VCs, and private VCs issued to a DID on the local node)
* ``vcr.issuer.issued_credentials_count`` holds the total number of credentials issued by the local node
* ``vcr.issuer.revoked_credentials_count`` holds the total number of revoked credentials issued by the local node
* ``vcr.verifier.revocations_count`` holds the total number of revoked credentials (public and private VCs)
* ``vdr.conflicted_did_documents.total_count`` holds the total number of DID documents that are conflicted (have parallel updates). This may indicate a stolen private key
* ``vdr.conflicted_did_documents.owned_count`` holds the number of conflicted DID documents you control as a node owner

Note: the ``network`` and ``vdr`` entries only apply to ``did:nuts``.

Metrics
*******

The Nuts service executable has build-in support for **Prometheus**. Prometheus is a time-series database which supports a wide variety of services. It also allows for exporting metrics to different visualization solutions like **Grafana**. See https://prometheus.io/ for more information on how to run Prometheus. The metrics are exposed at ``/metrics``

Configuration
=============

In order for metrics to be gathered by Prometheus. A ``job`` has to be added to the ``prometheus.yml`` configuration file.
Below is a minimal configuration file that will only gather Nuts metrics:

.. code-block:: yaml

    # my global config
    global:
      scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
      evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
      # scrape_timeout is set to the global default (10s).

    # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
    rule_files:
    # - "first_rules.yml"
    # - "second_rules.yml"

    # A scrape configuration containing exactly one endpoint to scrape:
    scrape_configs:
      # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
      - job_name: 'nuts'
        metrics_path: '/metrics'
        scrape_interval: 5s
        static_configs:
          - targets: ['127.0.0.1:8081']

It's important to enter the correct IP/domain and port where the Nuts node can be found!

Exported metrics
================

The Nuts service executable exports the following metric namespaces:

* ``nuts_`` contains metrics related to the functioning of the Nuts node
* ``process_`` contains OS metrics related to the process
* ``go_`` contains Go metrics related to the process
* ``promhttp_`` contains metrics related to HTTP calls to the Nuts node's ``/metrics`` endpoint

CPU profiling
*************

It's possible to enable CPU profiling by passing the ``--cpuprofile=/some/location.dmp`` option.
This will write a CPU profile to the given location when the node shuts down.
The resulting file can be analyzed with Go tooling:

.. code-block:: shell

    go tool pprof /some/location.dmp

The tooling includes a help function to get you started. To get started use the ``web`` command inside the tooling.
It'll open a SVG in a browser and give an overview of what the node was doing.