⚠️ Experimental
An agent that reports metrics for ec2 instances or titus containers.
-
This build requires a C++11 compiler, some system libraries, and libatlasclient
-
To build the titus-agent:
sudo apt-get update
sudo apt-get install -y zlib1g-dev uuid-dev libblkid-dev libpcre3-dev libcap-dev
rm -rf build && mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=RELWITHDEBINFO -DTITUS_AGENT=ON ..
make VERBOSE=1 -j4
./runtests && make DESTDIR=../root install
- To build the system agent:
Repeat the above commands but do not define -DTITUS_AGENT=ON
Amount of processing time requested for the container. This value is computed based on the number of shares allocated when creating the job. Note that this is not a hard limit, if there is no contention a job can use more than the requested capacity. However, a user should not rely on getting more than requested.
Unit: seconds/second
Amount of time spent processing code in the container. This metric would typically get used for one of two use-cases:
- Utilization: to see how close it is coming to saturating the requested resources for the job you can divide the processing time by the processing capacity.
- Performance Regression: for comparative analysis the sum can be used. Note you should ensure that both systems being compared have the same amount of resources.
Unit: seconds/second
Number of shares configured for the job. The Titus scheduler treats each CPU core as 100 shares. Generally the processing capacity is more relevant to the user as it has been normalized to the same unit as the measured processing time.
Unit: num shares
Amount of time spent processing code in the container in either the system or user category.
Unit: seconds/second
Dimensions:
id
: category of usage, eithersystem
oruser
Counter indicating an allocation failure occurred. Typically this will be seen when the application hits the memory limit.
Unit: failures/second
Memory limit for the cgroup.
Unit: bytes
Memory usage for the cgroup.
Unit: bytes
Description from kernel.org
Counter indicating the number of times that a process of the cgroup triggered
a "page fault" and a "major fault", respectively. A page fault happens when a
process accesses a part of its virtual memory space which is nonexistent or
protected. The former can happen if the process is buggy and tries to access
an invalid address (it will then be sent a SIGSEGV
signal, typically killing
it with the famous Segmentation fault
message). The latter can happen when the
process reads from a memory zone which has been swapped out, or which corresponds
to a mapped file: in that case, the kernel will load the page from disk, and let
the CPU complete the memory access. It can also happen when the process writes to
a copy-on-write memory zone: likewise, the kernel will preempt the process,
duplicate the memory page, and resume the write operation on the process` own copy
of the page. "Major" faults happen when the kernel actually has to read the data
from disk. When it just has to duplicate an existing page, or allocate an empty
page, it is a regular (or "minor") fault.
Unit: faults/second
Dimensions:
id
: eitherminor
ormajor
.
Amount of memory used by processes running in the cgroup.
Unit: bytes
Dimensions:
id
: how the processes are using the memory. Values arecache
,rss
,rss_huge
, andmapped_file
.