PRD: TruLens Eval OpenTelemetry integrations

The goal of this work is the integration of OTEL conventions, sdk, and api into TruLens-Eval. Before enumerating the benefits of this goal, we need to review OpenTelemetry and its features.

What is OpenTelemetry (OTEL)?

OTEL is a complex ecosystem of open-source (part of Cloud Native Foundations) designs and tools supporting observability in software systems. Observability here means the process of logging a systems inputs, outputs, or other relevant behaviour in support of debugging, monitoring, other goals. Aspects of OTEL relevant to this work:

Tracing API -- a language agnostic specification of recording computations ("tracing") and the format of records ("spans"). Some components of the specification closely correspond to components of TruLens-Eval:
- Instrumentation concepts are identical in OTEL and TruLens-Eval.
- OTEL Span <-> RecordAppCall
- OTEL Trace(r) <-> Record. Note that trace is not a data structure in OTEL but rather a term referring to a collection of Spans.
Python API. Specification of the above Tracing API in python. This includes some data definitions but is mostly composed of abstract classes.
Python SDK. Implementations of some of the API specifications for observing python programs. Parts of these implementations can be directly incorporated into TruLens-Eval and in some cases as replacements for functionality currently present in the library.

A relevant component present in both the API and SDK is/are Exporters which are tasked with exporting spans to other tools.
Semantic Conventions and Semantic Conventions for Generative AI systems describe common types of computations observed as Spans and fields they are to populate to describe those computations.

Note that competing conventions for Span types and attributes exists like Arize's OpenInference specs.
(Relevance TBD) Metrics. OTEL also features metrics as part of its design. While this resembles feedback functions to some degree, it refers to results of run-time observations which may not be applicable to feedback functions due to their cost and latency. Feedback in TruLens-Eval are, in the general case, meant to be evaluated after and independently of an app's normal operation. For these reasons, use of metrics as part of this integrations work is uncertain.
(Relevance TBD) Collector -- Collectors receive, process, and export observation data (i.e. spans). It may be possible to implement TruLens-Eval's database and feedback function evaluation as part of the standard collector or as a collector-like component compatible with the export API.
(Relevance TBD) Logs -- Logs (previously termed in OTEL as Events) are typical unstructured logs that can be recorded alongside the more structured spans. The uses cases for logs is alike logs in python. While OTEL-compatibility may allow TruLens-Eval to receive or export logs, it is a low priority feature and its integration is presently not planned.

Benefits of integrations

Benefit 1 -- Allow use of the large ecosystem of OTEL-based tools and therefore support integration into existing workflows. (non-LLM) Observability workflows are likely already based on OTEL. TruLens-Eval-produced Spans (and possibly feedback function results) could be exported to tools such as Prometheus via Exporters. This first and easy step towards LLM observability could serve as an introduction to the wider set of TruLens-Eval capabilities.
- Benefit 1.1 -- Allow use of existing query languages for exploring Spans. See for example Prometheus' PromQL and OpenSearch's Dashboard Query Language.
Benefit 2 -- Integrate information recorded by OTEL compatible tools into TruLens-Eval and its feedback functions. Presenting TruLens-Eval-relevant Spans alongside/including/or as part of Spans recorded by OTEL or other OTEL-compatible tools will give the user additional context of their app's behaviour. See OTEL Instrumentation Registry for Python for a list of python libraries having an OTEL instrumentation. Also note automatic instrumentation utility included in OTEL for python. This additional behaviour could be a target of feedback functions.
Benefit 3 -- Expand reach of observability across processes and networks (termed Distributing Tracing). OTEL Propagators allow for spans information to be transmitted across process and network boundaries to integrate information from remote computations. LLM apps feature remote endpoint invocations heavily therefore this is expected to provide a big additional source of observability data and potentially feedback function targets.
Benefit 4 -- Guide technical approaches to observability. Presently TruLens-Eval features instrumentation and recording implemented with techniques laden with significant tech-debt (see TruLens-Eval Tech-Debt). Adopting standard approaches exemplified by OTEL can be expected to reduce this debt and make all aspects of TruLens-Eval more robust, more debuggable, and more extensible.
Benefit 5 -- Guide standards (i.e. OTEL Semantic Convention (for GenAI)). OTEL semantic conventions feature guidance as to what is to be recorded as part of a Span under various common types of computations and under which labels these pieces of data are to be recorded. Most relevant are the Semantic Conventions for Generative AI systems which describe aspects of LLM requests. While this OTEL project is in the experimental stage, keeping up with its designs will make TruLens-Eval-based tracing more semantically-meaningful to other tools. Further, these conventions can help guide the Span hierarchy/categorization in the library.

Tracing Integration

We split OTEL integration into two related goals: integration of tracing methodology and integration of span standards. The tracing integration sub-goal is aimed at adopting OTEL language-specific methods for instrumenting and recording telemetry data. Examples of this methodology can be found in numerous packages from within the OTEL project and without. The relevant aspect of this methodology and differences between it and current TruLens-Eval methodology include:

Call stack walking (TruLens-Eval) vs. Context Variables (OTEL). OTEL instrumentation packages, at least the ones coming from within the OTEL project, do not rely on inspecting the call stack to collect the necessary information to create spans.
No notion of main/root method call in OTEL. (TBD) OTEL features a global root trace instead.
```
app1: Any # user's app
app2: Any # another

tru_app1: T < trulens_eval.App # trulens wrapper/recorder
tru_app2: T < trulens_eval.App # another

with tru_app1 as recorder1:
    _ = app1.invoke(...) # call span 1 -> root span 1 -> global root
    _ = app1.invoke(...) # call span 2 -> root span 2 -> global root

    _ = app2.invoke(...) # no spans recorded, not inside tru_app2 recording context

    with tru_app2 as recorder2
      _ = app.invoke(...) # call span 3 -> root span 3 -> global root
    # call span 3, root span 3 exported when tru_app2 recording context is closed

    _ = app1.invoke(...) # call span 4 -> root span 4 -> global root
# call spans 1,2, 4, root spans 1, 2, 4 exported when tru_app1 recording context is closed

# Note: whether or not global root span gets exported is determined by OTEL configuration
# outside of _TruLens-Eval_. 
```
The approach in this integration work is thus:
- Context manager of App is used to delineate where recording/tracing should be enabled and when records/spans should be exported to downstream (database, feedback function evaluation, OTEL exporters).
- When creating a new span, if no TruLens-Eval-managed parent span exists, create a SpanRoot (see Data Integration below). Link the new span to SpanRoot as its parent. This corresponds to the creation of a new Record in the existing TruLens-Eval tracing implementation. Note that an OTEL parent span is assumed to always exist and thus this SpanRoot is linked to the global parent to achieve Benefit 1.
- Each instrumented method produces a SpanMethodCall whose parent is either the SpanRoot above or some other SpanMethodCall.
- When the TruLens context manager exits, additional logic is invoked to process each SpanRoot in a manner that existing Records are processed (i.e. database outputs, feedback function evaluation or queuing).
Typically static instrumentation in OTEL vs. dynamic in TruLens-Eval. That is, OTEL instrumentation is not aware of live objects/instances of a particular component type such as a Retriever component within an App. This information, however, is crucial in TruLens-Eval for properly addressing feedback function inputs. Exceptions exists such as FastAPI Instrumentation.

Data (Span) Integration

OTEL stores telemetry primarily in a Span container with an open-ended set of attribute values. TruLens-Eval stores similar information across Records which themselves contain RecordAppCall containers for each method invocation.

Span Organization

While OTEL does not offer a hierarchy or organization of spans (though note Semantic Conventions above), the use cases in TruLens-Eval require the library to understand what a recorded computation or span corresponds to from among a collection of known tasks involved in implementations of common app components (see TruLens-Eval Glossary. These understanding is both presented to the user and also as a foundation of evaluation using feedback functions. Each feedback function applies to components of different types. The organization is being prototyped in ongoing work in feature/traces PR.

OTSpan (OTEL compatibility layer)
- Span
  - TransSpanRecord (temporary type during recording->tracing transition described in this work). Span associated with a Record and contains the existing Record container. This subclass will be eliminated after integration work is complete.
    - SpanRoot
    - SpanMethodCall -- Span associated with a method call.
      - TransSpanRecordAppCall (temporary) -- Method call span whose contents is the existing RecordAppCall container. This subclass will be eliminated after the integration work is complete.
        
        SpanUntyped
        
        SpanTyped
        
        Categorized spans as per the Traces PRD.

Tasks

Two major tasks split the integration work into two avenues to stage changes to existing TruLens-Eval instrumentation and records while preserving compatibility of the library with existing use cases.

Task 1 -- OTEL-Compatible Tracing. Modify instrumentation and tracing to produce spans (and records for backwards compatibility) in a manner that matches OTEL-tracing in python. This will also realize Benefit 4.
- Sub Task 1.1 -- Recording to uncategorized spans. In this task, the exiting Records and RecordAppCallMethod structures are conveyed by traced but uncategorized spans. These spans are then converted to Records for compatibility with users and the rest of the code base. This is part of the piotrm/record_as_spans PR.
- Sub Task 1.2 -- Compatibility Checkpoint: Traces include OTEL Spans produced by other libraries. A relevant example is the OTEL OpenAI instrumentor.
- Sub Task 1.3 -- Compatibility Checkpoint: Traces include OTEL Spans from across the network. TBD: Evaluation setup/tools.
Task 2 -- OTEL-Compatible Spans and Span categorization.
- Sub Task 2.1 -- Categorize pre-recorded Records to Spans. This should realize Benefit 5. This is part of the feature/traces PR.
- Sub Task 2.2 -- Spans for TruCustomApps and custom instrumentation. One approach for this is prototyped in feature/traces PR.
- Sub Task 2.3 -- Compatibility Checkpoint: Traces exported via OTEL and operational in existing tools. The OpenSearch for Trace analytics pipeline may be used for evaluating this checkpoint.
(Task 1, Task 2) -> Task 3 -- Integrate Tasks 1 and 2 into one coherent system with no Record intermediary. That is, trace computations via OTEL-compatible methodology directly into OTEL-compatible Spans.
- Sub Task 3.2 -- Direct OTEL-compatible tracing to OTEL-compatible Spans.
- Sub Task 3.3 -- Categorization of Spans. Note that categorization is intentionally separated from the above sub tasks as categorization may involve computational effort that might otherwise burden the instrumented apps.
(TBD, Task 3) -> Task 4. Compatibility with Record, RecordAppCall. The use of the existing data structures post-integration is to be determined. Conversions for backwards compatibility may be necessary to limit negative impact on existing users.
(TBD, Task 3) -> Task 5. Feedback functions as Metrics.

Timeline

2024-06-07
- Sub Task 2.2 (TruCustomApp -> OTEL Span)
2024-06-15
- Sub Task 1.1 (OTEL-like Tracing)
- Sub Task 2.1 (Record -> Span Categories)
2024-06-22
- Sub Task 1.2 (OTEL Spans -> TruLens-Eval)
- Sub Task 2.3 (TruLens-Eval Spans -> OTEL tools)
2024-06-29
- Sub Task 1.3 (OTEL Spans --> Network --> TruLens-Eval)
- Task 1
2024-?-?
- Task 2
- Task 3
- Task 4
- Task 5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly