Skip to content

Commit

Permalink
chore(telemetry): use importlibs instead of pkg_resources (#3783)
Browse files Browse the repository at this point in the history
pkg_resources has known performance issues: pypa/setuptools#926. This PR replaces pkg_resources with importlib.metadata and uses this module to retrieve package names and versions. 

A further optimization was made to the importlib implementation which parses package metadata: https://github.com/DataDog/dd-trace-py/compare/munir/benchmark-importlib...munir/tests-importlib-metadata-custom-parsing?expand=1. Benchmarks for this third optimization are also shown in the table below:

| benchmark                 | test case                 | Number of Packages | mean (ms) | std (ms) | baseline (ms) | overhead (ms) | overhead (%) |
|---------------------------|---------------------------|--------------------|:---------:|:--------:|:-------------:|:-------------:|:------------:|
| ddtracerun-auto_telemetry | pkg_resources (1.x branch)       | 15                 | 326       | 13       | 274           | 52            | 19.0         |
| ddtracerun-auto_telemetry | importlib                 | 15                 | 285       | 5        | 270           | 15            | 5.6          |
| ddtracerun-auto_telemetry | importlib with partial parsing | 15                 | 285       | 10       | 269           | 16            | 5.9          |
| ddtracerun-auto_telemetry | importlib                 | 30                 | 377       | 5        | 350           | 27            | 7.7         |
| ddtracerun-auto_telemetry | importlib with partial parsing | 30                 | 362       | 7       | 350           | 12            | 3.4            |
| ddtracerun-auto_telemetry | importlib                 | 45                 | 381       | 24       | 348           | 31            | 8.9          |
| ddtracerun-auto_telemetry | importlib with partial parsing | 45                 | 363       | 9        | 350           | 23            | 6.3          |
| ddtracerun-auto_telemetry | importlib                 | 313                 | 1050       | 79       | 991           | 59            | 5.9       |
| ddtracerun-auto_telemetry | importlib with partial parsing | 313                 | 911       | 28        | 905           | 6            | 0.6         |


|             benchmark             | test case                 | Number of Packages | mean (ms) | std (ms) | baseline (ms) | overhead (ms) | overhead (%) |
|:---------------------------------:|---------------------------|--------------------|:---------:|:--------:|:-------------:|:-------------:|:------------:|
| ddtracerun-auto_tracing_telemetry | pkg_resources (1.x)       | 15                 | 324       | 8        | 274           | 50            | 18.2         |
| ddtracerun-auto_tracing_telemetry | importlib                 | 15                 | 293       | 11       | 272           | 21            | 8.3          |
| ddtracerun-auto_tracing_telemetry | importlib with partial parsing | 15                 | 291       | 12       | 272           | 19            | 6.9          |
| ddtracerun-auto_tracing_telemetry | importlib                 | 30                 | 373       | 11       | 351           | 22            | 6.28           |
| ddtracerun-auto_tracing_telemetry | importlib with partial parsing | 30                 | 367       | 13       | 354           | 13             | 3.6          |
| ddtracerun-auto_tracing_telemetry | importlib                 | 45                 | 376       | 8        | 355           | 21            | 5.9          |
| ddtracerun-auto_tracing_telemetry | importlib with partial parsing | 45                 | 364       | 9        | 352           | 22            | 6.5          |
| ddtracerun-auto_tracing_telemetry | importlib                 | 313                 | 1010       | 80        | 960           | 50            | 5.2          |
| ddtracerun-auto_tracing_telemetry | importlib with partial parsing | 313                 | 910       | 20        | 873           | 37      | 4.2          |

Note: redis, requests and urllib3 were included in test cases with 30 and 45 packages. These packages were patched by `ddtrace-run` and this increased the baseline by ~74ms but the overhead of telemetry observed remained consistent. The case with 313 packages patched gevent, pylons, SQLAlchemy, requests, flask, grpc, cassandra, botocore, and urllib3. This was to simulate the overhead of telemetry in a real world application with telemetry enabled. 


Findings from benchmarking sending telemetry events with different number of packages installed, patching integrations, and/or enabling tracing:
 - Using importlib instead of pkg_resources reduced the overhead of telemetry in half (~50ms to ~19ms)
 - The number of packages does not appear to correlate with the overhead of telemetry 
      - the benchmarks might've been too noisy to measure the difference accurately. 
 - creating a custom parser to retrieve package names and versions from PKG-INFO and METADATA files lead to notable performance gains with a large number of packages.
     - the difference appears to be within a standard deviation so more testing is required to accurately measure the difference. 
     - Iterating on this approach might lead to better results: https://github.com/DataDog/dd-trace-py/compare/munir/benchmark-importlib...munir/tests-importlib-metadata-custom-parsing?expand=1
     - These performance gains seem to be minor. It might not be work developing and maintaining a metadata parser.

## Checklist
- [x] Library documentation is updated.
- [x] [Corp site](https://github.com/DataDog/documentation/) documentation is updated (link to the PR).

## Reviewer Checklist
- [ ] Title is accurate.
- [ ] Description motivates each change.
- [ ] No unnecessary changes were introduced in this PR.
- [ ] PR cannot be broken up into smaller PRs.
- [ ] Avoid breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes unless absolutely necessary.
- [ ] Tests provided or description of manual testing performed is included in the code or PR.
- [ ] Release note has been added for fixes and features, or else `changelog/no-changelog` label added.
- [ ] All relevant GitHub issues are correctly linked.
- [ ] Backports are identified and tagged with Mergifyio.
- [ ] Add to milestone.
  • Loading branch information
mabdinur authored Jun 22, 2022
1 parent 33b9ac3 commit db54098
Show file tree
Hide file tree
Showing 7 changed files with 83 additions and 7 deletions.
34 changes: 34 additions & 0 deletions ddtrace/internal/packages.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
import typing


Distribution = typing.NamedTuple("Distribution", [("name", str), ("version", str), ("path", str)])

_DISTRIBUTIONS = None # type: typing.Optional[typing.Set[Distribution]]


def get_distributions():
# type: () -> typing.Set[Distribution]
"""returns the name and version of all distributions in a python path"""
global _DISTRIBUTIONS
if _DISTRIBUTIONS is not None:
return _DISTRIBUTIONS

try:
import importlib.metadata as importlib_metadata
except ImportError:
import importlib_metadata # type: ignore[no-redef]

pkgs = set()
for dist in importlib_metadata.distributions():
# Get the root path of all files in a distribution
path = str(dist.locate_file(""))
# PKG-INFO and/or METADATA files are parsed when dist.metadata is accessed
# Optimization: we should avoid accessing dist.metadata more than once
metadata = dist.metadata
name = metadata["name"]
version = metadata["version"]
if name and version:
pkgs.add(Distribution(path=path, name=name, version=version))

_DISTRIBUTIONS = pkgs
return _DISTRIBUTIONS
9 changes: 9 additions & 0 deletions ddtrace/internal/telemetry/data.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
import platform
import sys
from typing import Dict
from typing import List
from typing import Tuple

from ddtrace.internal.compat import PY3
from ddtrace.internal.constants import DEFAULT_SERVICE_NAME
from ddtrace.internal.packages import get_distributions
from ddtrace.internal.runtime.container import get_container_info
from ddtrace.internal.utils.cache import cached

Expand Down Expand Up @@ -58,6 +60,13 @@ def _get_application(key):
}


def get_dependencies():
# type: () -> List[Dict[str, str]]
"""Returns a unique list of the names and versions of all installed packages"""
dependencies = {(dist.name, dist.version) for dist in get_distributions()}
return [{"name": name, "version": version} for name, version in dependencies]


def get_application(service, version, env):
"""Creates a dictionary to store application data using ddtrace configurations and the System-Specific module"""
# We cache the application dict to reduce overhead since service, version, or env configurations
Expand Down
7 changes: 2 additions & 5 deletions ddtrace/internal/telemetry/writer.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
from ..service import ServiceStatus
from ..utils.time import StopWatch
from .data import get_application
from .data import get_dependencies
from .data import get_host_info


Expand Down Expand Up @@ -187,12 +188,8 @@ def app_started_event(self):
if self._forked:
# app-started events should only be sent by the main process
return
# pkg_resources import is inlined for performance reasons
# This import is an expensive operation
import pkg_resources

payload = {
"dependencies": [{"name": pkg.project_name, "version": pkg.version} for pkg in pkg_resources.working_set],
"dependencies": get_dependencies(),
"integrations": self._flush_integrations_queue(),
"configurations": [],
}
Expand Down
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -247,6 +247,7 @@ def get_exts_for(name):
"attrs>=19.2.0",
"six>=1.12.0",
"typing_extensions; python_version<'3.8'",
"importlib_metadata; python_version<'3.8'",
]
+ bytecode,
extras_require={
Expand Down
26 changes: 26 additions & 0 deletions tests/internal/test_packages.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
import os

from ddtrace.internal.packages import get_distributions


def test_get_distributions():
"""use pkg_resources to validate package names and versions returned by get_distributions()"""
import pkg_resources

pkg_resources_ws = {pkg.project_name for pkg in pkg_resources.working_set}

importlib_pkgs = set()
for pkg in get_distributions():
assert pkg.name
assert pkg.version
assert os.path.exists(pkg.path)
# The package name in typing_extensions-4.x.x.dist-info/METADATA is set to `typing_extensions`
# this is inconsistent with the package name found in pkg_resources. The block below corrects this.
# The correct package name is typing-extensions.
if pkg.name == "typing_extensions" and "typing-extensions" in pkg_resources_ws:
importlib_pkgs.add("typing-extensions")
else:
importlib_pkgs.add(pkg.name)

# assert that pkg_resources and importlib.metadata return the same packages
assert pkg_resources_ws == importlib_pkgs
9 changes: 9 additions & 0 deletions tests/tracer/telemetry/test_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,13 @@
import pytest

from ddtrace.internal.compat import PY3
from ddtrace.internal.packages import get_distributions
from ddtrace.internal.runtime.container import CGroupInfo
from ddtrace.internal.telemetry.data import _format_version_info
from ddtrace.internal.telemetry.data import _get_container_id
from ddtrace.internal.telemetry.data import _get_os_version
from ddtrace.internal.telemetry.data import get_application
from ddtrace.internal.telemetry.data import get_dependencies
from ddtrace.internal.telemetry.data import get_host_info
from ddtrace.internal.telemetry.data import get_hostname
from ddtrace.internal.telemetry.data import get_version
Expand Down Expand Up @@ -143,3 +145,10 @@ def test_get_container_id_when_container_does_not_exists():
with mock.patch("ddtrace.internal.telemetry.data.get_container_info") as gci:
gci.return_value = None
assert _get_container_id() == ""


def test_get_dependencies():
"""asserts that get_dependencies and get_distributions return the same packages"""
pkgs_as_dicts = {(dep["name"], dep["version"]) for dep in get_dependencies()}
pkgs_as_distributions = {(dist.name, dist.version) for dist in get_distributions()}
assert pkgs_as_dicts == pkgs_as_distributions
4 changes: 2 additions & 2 deletions tests/tracer/telemetry/test_writer.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@

import httpretty
import mock
import pkg_resources
import pytest

from ddtrace.internal.telemetry.data import get_application
from ddtrace.internal.telemetry.data import get_dependencies
from ddtrace.internal.telemetry.data import get_host_info
from ddtrace.internal.telemetry.writer import TelemetryWriter
from ddtrace.internal.telemetry.writer import get_runtime_id
Expand Down Expand Up @@ -100,7 +100,7 @@ def test_app_started_event(mock_time, mock_send_request, telemetry_writer):
assert headers["DD-Telemetry-Request-Type"] == "app-started"
# validate request body
payload = {
"dependencies": [{"name": pkg.project_name, "version": pkg.version} for pkg in pkg_resources.working_set],
"dependencies": get_dependencies(),
"integrations": [
{
"name": "integration-t",
Expand Down

0 comments on commit db54098

Please sign in to comment.