docs/source/debugging.rst
Debugging
=========
.. testsetup:: *
import numpy as np
from libertem import api
from libertem.executor.inline import InlineJobExecutor
ctx = api.Context(executor=InlineJobExecutor())
data = np.random.random((16, 16, 32, 32)).astype(np.float32)
dataset = ctx.load("memory", data=data, sig_dims=2)
roi = np.random.choice([True, False], dataset.shape.nav)
There are different parts of LiberTEM which can be debugged with different tools and methods.
Debugging the Web GUI
---------------------
For debugging the GUI, you can use all standard debugging tools for web development. Most useful
in this context are the `Chrome DevTools <https://developer.chrome.com/docs/devtools/>`_
or `Firefox Developer Tools <https://firefox-source-docs.mozilla.org/devtools-user/index.html>`_, which can be
accessed by pressing F12. You can extend these with additional panels
`for React <https://react.dev/learn/react-developer-tools>`_
and `for Redux <https://github.com/reduxjs/redux-devtools>`_.
These tools can be used for inspecting all frontend-related processes, from network traffic
up to rendering behavior. Take note of the :code:`/api/events/` websocket connection, where all
asynchronous notification and results will be transferred.
Note that you should always debug using the development build of the GUI, using :code:`npm start`,
as described in :ref:`the contributing section <building the client>`. Otherwise the debugging
experience may be painful, for example worse error output from react, minified source and
minified component names, ...
Debugging the API server
------------------------
If the API server returns a server error (500), the detailed exception should be logged
in the output of :code:`libertem-server`. You can also try
`enabling the debug mode of tornado <https://www.tornadoweb.org/en/stable/guide/running.html#debug-mode-and-automatic-reloading>`_
(there is currently no command line flag for this, so you need to change
:py:mod:`libertem.web.server` accordingly.)
If an analysis based on the exception alone is inconclusive,
you can try to reproduce the problem using the Python API and follow the instructions below.
.. _`debugging udfs`:
Debugging UDFs or other Python code
-----------------------------------
If you are trying to write a UDF, or debug other Python parts of LiberTEM, you can
instruct LiberTEM to use simple single-threaded execution using the
:class:`~libertem.executor.inline.InlineJobExecutor`.
.. testsetup::
from libertem.udf.logsum import LogsumUDF
udf = LogsumUDF()
.. testcode::
from libertem import api as lt
ctx = lt.Context.make_with('inline')
ctx.run_udf(dataset=dataset, udf=udf)
You can then use all usual debugging facilities, including
`pdb <https://docs.python.org/3.12/library/pdb.html>`_ and
`the %pdb magic of ipython/Jupyter <https://ipython.org/ipython-doc/3/interactive/magics.html#magic-pdb>`_.
The :class:`libertem.executor.inline.InlineJobExecutor` uses a single CPU core
by default. It can be switched to GPU processing to test CuPy-enabled UDFs by
calling :meth:`libertem.common.backend.set_use_cuda` with the device ID to use.
:code:`libertem.common.backend.set_use_cpu(0)` switches back to CPU processing.
.. testsetup::
from libertem.udf.masks import ApplyMasksUDF
udf = ApplyMasksUDF(mask_factories=[lambda:np.ones(dataset.shape.sig)])
.. testcode::
from libertem import api as lt
from libertem.utils.devices import detect
from libertem.common.backend import set_use_cpu, set_use_cuda
ctx = lt.Context.make_with('inline')
d = detect()
if d['cudas'] and d['has_cupy']:
set_use_cuda(d['cudas'][0])
ctx.run_udf(dataset=dataset, udf=udf)
set_use_cpu(0)
If a problem is only reproducible using the default executor, you will have to follow the
`debugging instructions of dask-distributed <https://docs.dask.org/en/latest/debugging.html>`_.
As the API server can't use the synchronous :class:`~libertem.executor.inline.InlineJobExecutor`,
this is also the case when debugging problems that only occur in context of the API server.
Debugging failing test cases
----------------------------
When a test case fails, there are some options to find the root cause:
The :code:`--pdb` command line switch of pytest can be used to automatically
drop you into a :mod:`python:pdb` prompt in the failing test case, where you will either land
on the failing :code:`assert` statement, or the place in the code where an
exception was raised.
This does not help if the test case only fails in CI. Here, it may be easier to
use logging. Because we call pytest with the :code:`--log-level=DEBUG`
parameter, the failing test case output will have a section containing the
captured logging output.
You can sprinkle the code with `log.debug(...)` calls that output the relevant
variables. In some cases you may also leave the logging statements in the code
even after the problem is fixed, depending on the overhead.
.. _`tracing`:
Tracing using opentelemetry
---------------------------
.. versionadded:: 0.10.0
Tracing support using opentelemetry was added in version 0.10.0
Instead of sprinkling logging or print statements into your code,
it is also possible to diagnose issues or gain insight into the runtime
behavior of LiberTEM using opentelemetry tracing. This is also based on
adding instrumentation to the code, but follows a more structured approach.
Using tracing, instead of relatively unstructured "log lines",
rich and structured information can be logged as traces, which are organized
into spans. These traces can then be visualized, inspected, searched, ...
using different tools and databases, here for example using
`Jaeger <https://www.jaegertracing.io/>`_:
.. image:: ./images/jaeger-tracing-visualization.png
This becomes more interesting once your code goes beyond a single thread or
process, when it is important to see the temporal relation between different
events and functions executing concurrently. Crucially, it is possible to
gather traces in distributed systems, from different nodes.
For an overview of opentelemetry, please see `the official opentelemetry documentation
<https://opentelemetry.io/docs/specs/otel/overview/>`_ - here, we
will document the practical setup and usage. For the Python API docs, please
see the `opentelemetry Python API docs <https://opentelemetry-python.readthedocs.io/en/latest/>`_.
Getting tracing running
~~~~~~~~~~~~~~~~~~~~~~~
Some external services are needed to gather traces. We include
:code:`docker-compose` configuration for getting these up and running quickly
in the :code:`tracing` directory. Please note that this configuration
by default opens some ports, so be careful, as this may circumvent your
device's firewall!
To get these running, start :code:`docker-compose up` in said directory. This will
pull in all required docker images and start the required services, until they
are stopped using :code:`Ctrl+C`.
The Jaeger UI, as shown above, is then available on `localhost:16686
<http://localhost:16686>`_. An alternative UI, called Zipkin, is available on
`localhost:9411 <http://localhost:9411>`_. Both of these should now be viewable
by your browser.
The actual trace collection API endpoint is started on port 4317, but is only
used under the hood.
In your LiberTEM virtual environment, you need to install the :code:`tracing`
extra, for example via :code:`pip install -e .[tracing]`.
The Python code then needs to be told to enable tracing, and how to connect to
the trace collection API endpoint. The easiest way is to set environment variables,
for example, in a notebook:
.. code:: python
%env OTEL_ENABLE=1
%env OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
from libertem.common.tracing import maybe_setup_tracing
maybe_setup_tracing(service_name="notebook-main")
Or, for intrumenting the :code:`libertem-server`:
.. code:: bash
OTEL_ENABLE=1 OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 libertem-server
The same works for bringing up :code:`libertem-worker` processes:
.. code:: bash
OTEL_ENABLE=1 OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 libertem-worker
Be sure to change the endpoint URL to whatever is the correct one from the
perspective of the worker process in the distributed case.
Support for setting up tracing on workers is already integrated
in the Dask and pipelined executors. The inline executor doesn't need any
additional setup for tracing to work.
For enabling tracing across multiple Python processes in other scenarios,
possibly on multiple nodes, set the environment variables for each of these
processes, and also call the
:func:`~libertem.common.tracing.maybe_setup_tracing` function on each.
On Windows
~~~~~~~~~~
The easiest way to get the tracing services up and running is using
`Windows Subsystem for Linux <https://learn.microsoft.com/en-us/windows/wsl/install>`_
to install Linux and Docker. This allows to run the tracing services as described above.
Alternatively, `Docker Desktop for Windows <https://docs.docker.com/desktop/install/windows-install/>`_
could be an option.
Clients running natively on Windows can then connect to these services:
.. code:: bat
set OTEL_ENABLE=1
set OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
libertem-server
Adding your own instrumentation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
By default, a minimal set of functions is already annotated with tracing
information, to be able to understand how UDFs are executed across multiple
processes. Adding tracing instrumentation to your code is similar to setting
up logging using the :code:`logging` module. At the top of your Python module,
you create and use a :code:`Tracer` object like this:
.. testsetup:: tracing
import time
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
.. testsetup:: nosetup
pass
.. testcode:: nosetup
import time
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
def some_function():
with tracer.start_as_current_span("span-name"):
time.sleep(0.1) # do some real work here
some_function()
You can also add some more information to a span, for example events with attributes:
.. testcode:: tracing
def some_function():
with tracer.start_as_current_span("span-name") as span:
for i in range(16):
time.sleep(0.1) # do some real work here
span.add_event(f"work item done", {
"item_id": i, # you can add attributes to events
})
some_function()
Attributes can also be added to spans themselves:
.. testcode:: tracing
def some_function():
with tracer.start_as_current_span("span-name") as span:
time.sleep(0.1) # do some real work here
span.set_attributes({
"attribute-name": "attribute-value-here",
})
some_function()
Note that, while the tracing is quite lightweight, it is probably a good idea
to not add spans and events in the innermost loops of your processing, like
:code:`UDF.process_frame`, but spans for per-partition operations
should be fine. In the future, metrics could also be collected to gain
further insight into the performance characteristics.
For more details, please also see the `opentelemetry Python API docs <https://opentelemetry-python.readthedocs.io/en/latest/>`_.