Deadlock situation occurs when no log loss is allowed and output is blocked. #9706

Ikonovich · 2024-12-10T19:31:12Z

Bug Report

Describe the bug
Hey there, this issue appears to have been introduced in this commit: d935047

The issue: When an output is blocked (in my case, due to an internet outage), tasks are still created to handle chunks. Eventually, this hits 2048 tasks, and no more can be created. Each of these tasks puts their chunk down when they fail, attempting to pull it back up on the retry.

During this time, the input pulls up chunks, and tries to create tasks for them. Because the tasks are full, it can't create any tasks, but leaves these chunks in memory.

The engine continually retries the output tasks, hitting this line, but since there is no free memory, it can't pull any chunks up, and proceeds to here to reschedule the task. It does this forever.

So:
The input consumes all memory, and can't create any tasks.
The output consumes all tasks, but can't allocate memory to them.
This deadlocks the agent, resulting in constant task re-scheduling messages. Unblocking the output does not have any effect.

I was able to resolve this issue by giving outputs a small virtual memory reserve that the input doesn't have access to, allowing them to always be able to bring up at least one chunk into memory. This breaks the deadlock, allows the resources to be cleared, and allows the operation to proceed.

Specifically, my solution was to give each output a number of chunks allowed over the maximum memory allocation, and replacing the function here with a flb_input_chunk_set_up_for_output function, which propagates down to a function that replaces cio_file_up with this function for output usage:
(Note: MAX_OVER_LIMIT_OUTPUT_CHUNKS_UP needs to be at least 1. The larger it is, the more chunks the output can pull up over the memory limit)

int cio_file_up_for_output(struct cio_chunk *ch, int *chunks_up_for_output)
{
    int ret;

    ret =_cio_file_up(ch, CIO_TRUE);
    if (ret < 0 && (*chunks_up_for_output) < MAX_OVER_LIMIT_OUTPUT_CHUNKS_UP) {
        ret = _cio_file_up(ch, CIO_FALSE);
        if (ret == 0) {
            ch->is_up_for_output = 1;
            // Increment the value of chunks_up_for_output.
            (*chunks_up_for_output)++;
            // Let the chunk track the chunks_up_for_output pointer.
            ch->chunks_up_for_output = chunks_up_for_output;
        }
    }
    return ret;
}

Later, when the chunk is put down, I added this section here in the cio_file_down function:

 // If the chunk is up for an output, mark it as NOT up for an output, decrement that
    // output's chunks_up counter, and clear that reference.
    if (ch->is_up_for_output == 1) {
        ch->is_up_for_output = 0;
        (*ch->chunks_up_for_output)--;
        ch->chunks_up_for_output = NULL;
    }

This is certainly not an ideal solution, but it did work and may be a useful reference. Using this fix, I was able to block the outputs of three instances running this configuration, have each produce over 100 GB of logs, unblock the output, and have full recovery with all of the expected logs being pushed to the logs storage service.

To Reproduce
At minimum, the log agent must be configured so that it retries forever and does not allow logs to be lost due to output failures. In my case, I also had it configured with unlimited filesystem buffer to prevent ANY log loss, but this is probably not necessary based on my understanding of the issue.

Steps to reproduce the problem:

Set up a tail input
Configure filesystem buffering with no maximum storage (It may be reproducible with limited storage, but this is my configuration)
Set an output reading from the tail input to retry_limit: false
Block the output through some mechanism. I did this by using a cloudwatch_logs output and using /etc/hosts to redirect the cloudwatch logs endpoint to a black hole IP address.
Push a significant volume of logs into the tail input. i set up a generator that was producing 20GB/hour of log volume. This was sufficient to cause the problem in about 10-20 minutes at storage.max_chunks_up=128, but this may differ, and longer may be required. If you push enough log volume, this will happen in all cases.
Unblock the output. The agent will not recover, all incoming logs will be written to the filesystem buffer and will never be flushed.

In metrics, you will see that 2048 tasks have been created, and that storage.max_chunks_up is at your maximum allowed setting.

Expected behavior
Outputs should be able to process chunks and push them out, no matter what is happening to the input.

Actual Behavior
The log agent is unable to recover from the outage. Log message reports many re-scheduled tasks:

[task] retry for task %i could not be re-scheduled

And no logs are ever pushed through the output.

Your Environment

Version used: 3.0.4
Configuration: Anonymized version of my FluentBit configuration
Environment name and version: Amazon Linux 2 on an m5.8xlarge EC2 instance with 300GB EBS volume. Instances never went over 10% CPU and memory utilization. Disk utilization steadily escalated as expected.
Server type and version: Amazon Linux 2 on an m5.4xlarge EC2 instance.
Operating System and version:Amazon Linux 2 on an m5.8xlarge EC2 instance.
Filters and plugins: I used a json parser on the output. This is not relevant to the issue.

Additional context

It significantly reduces the ability to recover from network outages, and requires manual resolution after extended outages.

The text was updated successfully, but these errors were encountered:

Ikonovich added the status: waiting-for-triage label Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock situation occurs when no log loss is allowed and output is blocked. #9706

Deadlock situation occurs when no log loss is allowed and output is blocked. #9706

Ikonovich commented Dec 10, 2024 •

edited

Loading

Deadlock situation occurs when no log loss is allowed and output is blocked. #9706

Deadlock situation occurs when no log loss is allowed and output is blocked. #9706

Comments

Ikonovich commented Dec 10, 2024 • edited Loading

Bug Report

Ikonovich commented Dec 10, 2024 •

edited

Loading