Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PacketBuffer: pool EMPTY (CON-1394) #1132

Closed
FMistrorigo opened this issue Oct 28, 2024 · 13 comments
Closed

PacketBuffer: pool EMPTY (CON-1394) #1132

FMistrorigo opened this issue Oct 28, 2024 · 13 comments

Comments

@FMistrorigo
Copy link

Description
I'm trying to get my product certified and the last test I can't pass is the TC-SC-3.6 test, which tries to establish different PASE Sessions (3) for each Fabric.

The problem i'm facing is this one:

E (94817) chip[CSL]: PacketBuffer: pool EMPTY.
E (94818) chip[SC]: Failed to allocate status report message
E (94819) chip[IN]: CASE Session establishment failed: b
E (94821) chip[EM]: OnMessageReceived failed, err = b

It looks like the Largest memory block is not big enough for the things it needs to do.
I logged the biggest memory block and found out that before the error I only have (INTERNAL) 1080 byte, even if the free heap is 43784.

Can you help me find a solution to this problem?
Is there a defragmentation algorithm I can use or something that might help solve this problem?

Environment

  • ESP-Matter Commit Id: 066791d (HEAD -> release/v1.2, origin/release/v1.2)
  • ESP-IDF Commit Id: a322e6bdad4b6675d4597fb2722eea2851ba88cb (HEAD, tag: v5.2.1)
  • SoC: ESP32S3

Additional Details
I'm doing the test using th-fall2023.

@github-actions github-actions bot changed the title PacketBuffer: pool EMPTY PacketBuffer: pool EMPTY (CON-1394) Oct 28, 2024
@shubhamdp
Copy link
Contributor

@FMistrorigo Looking into this.

@shubhamdp
Copy link
Contributor

@FMistrorigo I ran (src/python_testing/TC_SC_3_6.py + factory-reset) locally around 10 times to see if I can reproduce the issue locally, but I was unable to do so.

Could you please share:

  • DUT and TH logs.
  • Please run the idf.py size-components > /tmp/size-output.txt and attach the file [want to know how much heap is available for the application's operation]
  • Heap dump just after nvs flash, berfore/after esp_matter::start, and commissioning complete. If possible add a simple loop in the main task which just dumps the memory to console. You can use this code-block to dump memory
    static esp_err_t mem_dump_console_handler(int argc, char *argv[])
    {
    printf("\tDescription\tInternal\tSPIRAM\n");
    printf("Current Free Memory\t%d\t\t%d\n",
    heap_caps_get_free_size(MALLOC_CAP_8BIT) - heap_caps_get_free_size(MALLOC_CAP_SPIRAM),
    heap_caps_get_free_size(MALLOC_CAP_SPIRAM));
    printf("Largest Free Block\t%d\t\t%d\n", heap_caps_get_largest_free_block(MALLOC_CAP_8BIT | MALLOC_CAP_INTERNAL),
    heap_caps_get_largest_free_block(MALLOC_CAP_SPIRAM));
    printf("Min. Ever Free Size\t%d\t\t%d\n", heap_caps_get_minimum_free_size(MALLOC_CAP_8BIT | MALLOC_CAP_INTERNAL),
    heap_caps_get_minimum_free_size(MALLOC_CAP_SPIRAM));
    return ESP_OK;
    }
  • More details about the environment that the device is in.
  • Gather the mdns noise in the environment when running this test case. For linux machines you can use the avahi-browse and on macOS dns-sd -B.

@jonsmirl
Copy link
Contributor

This issue has lots of history:
#657
#555
#436
#434

I still see it. It is not bothering me because it is transient and will self clear after a minute.

My best guess? Packet transmissions are failing and that starts tying up retry buffers. After a while all of the buffers become used. Then wait a minute, the retries time out and the buffers become free. I have switched over to dynamic wifi buffers and pushed them into PSRAM. I have upto 32 TX and 32 RX.

If you are using a low memory chip like the C3 RAM can get fragmented to the point where wifi buffers can't be allocated.

@FMistrorigo
Copy link
Author

Hi @shubhamdp,

Before proceeding I would like to clarify the behavior of our device that we are finishing certifying.
We use a proprietary app to provide the wifi credentials via Bluetooth, once the device is connected with our cloud we start the Matter task, in this way we have a device that can be considered onNetwork.
About the HEAP memory, a lot of it is used for the freertos tasks but also for wifi and bluetooth (which do not coexist at the same time to reduce the use of RAM), we therefore find ourselves having to manage everything related to Matter with 54KB of RAM, this would also be possible but the problem probably lies in the fragmentation of the Heap which has as its largest memory block only 15KB. (Even after a reboot the largest block is only 31KB, as you can see in the LOGS)

That being said, I did as you asked and gathered all the information.

FAIL LOG on TH.txt
FAIL LOGS on DUT.txt
size-output.txt
FAIL LOG on DUT with heap monitor.txt

HEAP during run.txt -> Recap of the heap in different parts of the code.

For security reasons for my company, I cannot provide mdns traffic data.
The DUT was tested in different environments (here in my company, at home and by our ATL) , the problem persists, it's important to notice that the TC-SC-3.6 sometimes don't fail but passes without problems.

As said by @jonsmirl, looking at the log the problem could be in the retry buffers, but I'm not sure.

Thanks,
Fabio.

@jonsmirl
Copy link
Contributor

jonsmirl commented Oct 31, 2024

I am not an Espressif employee, but I have been working with ESP chips and Matter since the beginning. I would strongly recommend against shipping products which are sitting right on the edge of not being able to fit into a chip. Matter grows and grows over time. While you might be able to just squeeze everything in at the moment, after the next update you might not fit anymore. Then you will tell the programmers to make it fit anyway. And that is where you can end up in a lot of trouble because the effort needed to keep making it fit is going to get larger and larger and cost more and more to do. This path typically ends up costing more than just using a larger chip to begin with and initially shipping with some spare RAM and flash available. We are using the ESP32-S3-PICO-1-N8R2. 4MB flash is too close to the edge for us since it only allows a 1.8MB OTP partition. 8MB allows a 3.8MB image of which we are using 2MB so 1.8MB of room to grow. I got tired of spending major amounts of effort shoehorning into the S3 on chip RAM, adding 2MB of PSRAM gives us 1.8MB for future expansion, meanwhile I am using it to shadow the flash for increased performance.

We use the ESP32-S3-PICO-1-N8R2 is every product. That is a huge benefit because now we can have a single binary image which supports every product we make. When it boots it queries the product ID and then enables the correct support code -- that even includes some products which have LCDs and some which don't. This single image makes development far easier to do and saves more on coding costs than we lose by spending a little more on hardware.

The only things I wouldn't do this way are light bulbs and sensors. Light bulbs have no need for continual upgrades, instead they get replaced so it is reasonable to use right up the edge of the chip in a light bulb. And our sensors use another technology which is far more power efficient than Matter.

@FMistrorigo
Copy link
Author

Hi,
I have some updates. Over the past few days, we managed to free up approximately 60KB of heap memory with Matter started and Wi-Fi connected. Thanks to some debugging, we discovered that Matter does not allocate any new memory in runtime but consumes about 12KB of heap memory throughout its entire lifecycle.

The problem persists, we always get the error "chip[CSL]: PacketBuffer: pool EMPTY."

image

The problem happens both when I try to connect to wifi more than once via our app, and when I try to do the TC-SC-3.6 test.

I have already tried to increase CHIP_DEVICE_CONFIG_MAX_EVENT_QUEUE_SIZE (to 50 or even 100) from menuconfig but nothing changes.

I share with you our sdkconfig in the hope that the problem is there.
sdkconfig.txt

@jonsmirl
Copy link
Contributor

I have moved up to ESP IDF 5.2.3 since I saw some patches in this area in the git logs.

@jonsmirl
Copy link
Contributor

jonsmirl commented Nov 11, 2024

I am also using:
CONFIG_ESP32_WIFI_DYNAMIC_RX_BUFFER_NUM=32
CONFIG_ESP32_WIFI_DYNAMIC_TX_BUFFER_NUM=32
and I put those buffers in PSRAM
CONFIG_SPIRAM_TRY_ALLOCATE_WIFI_LWIP=y

I have a MB+ of free PSRAM, so plenty of room.

@FMistrorigo
Copy link
Author

Hi,

After a lot of time and a lot of debug i finally foud the problem and came out with a solution.
During the test i was having the "chip[CSL]: PacketBuffer: pool EMPTY.", this line was printed inside the function PacketBufferHandle::New(size_t aAvailableSize, uint16_t aReservedSize) only when lPacket is nullptr, this means that there was not enought space to allocate a pool with the function pbuf_alloc().
At this point I added more debug to check the free heap and the largest free block when the "PacketBuffer: pool Empty" error was thrown.

    if (lPacket == nullptr)
    {
        ChipLogError(chipSystemLayer, "\taAvailableSize=%d \taReservedSize=%d \tlAllocSize=%d  \tlBlockSize=%d", aAvailableSize, aReservedSize, lAllocSize, lBlockSize );
        ChipLogError(chipSystemLayer, "Failed to allocate: %d", lAllocSize);
        ChipLogError(chipSystemLayer, "Free HEAP before pool EMPTY: %d", (int)xPortGetFreeHeapSize());
        ChipLogError(chipSystemLayer, "Largest free block: %d", heap_caps_get_largest_free_block(MALLOC_CAP_INTERNAL));
        ChipLogError(chipSystemLayer, "PacketBuffer: pool EMPTY.");
        return PacketBufferHandle();
    }

I confirmed that the largest free block was smaller than the actual block the program was trying to allocate.

By following every single tip to reduce heap size (https://docs.espressif.com/projects/esp-idf/en/stable/esp32/api-guides/performance/ram-usage.html), changing esp-idf version to 5.2.3 to get the latest fixes, minimizing the freertos tasks stack sizes, and also following the lwIP minimum RAM usage (https://docs.espressif.com/projects/esp-idf/en/stable/esp32/api-guides/lwip.html#minimum-ram-usage) I was able to reach 80KB of free heap (with wifi connected and matter initialized).

The minimum free heap you need to have to pass every time the TC-SC-3.6 test is at least 70KB in my case (I used this function to find out what was the peak heap memory that the lwIP consumes: https://docs.espressif.com/projects/esp-idf/en/stable/esp32/api-guides/lwip.html#peak-buffer-usage), so now apart from exceptional cases of fragmentation, the problem is solved.

Thanks @jonsmirl for the patience and your suggestions.

Fabio.

@jonsmirl
Copy link
Contributor

jonsmirl commented Nov 13, 2024

I run this little task all of the time

static size_t s_prepopulated_num = 0;
static heap_task_totals_t s_totals_arr[MAX_TASK_NUM];
static heap_task_block_t s_block_arr[MAX_BLOCK_NUM];

static void esp_dump_per_task_heap_info(void)
{
    heap_task_info_params_t heap_info = {0};
    heap_info.caps[0] = MALLOC_CAP_INTERNAL;        // Gets heap with CAP_INTERNAL capabilities
    heap_info.mask[0] = MALLOC_CAP_INTERNAL;
    heap_info.caps[1] = MALLOC_CAP_SPIRAM;       // Gets heap info with CAP_SPIRAM capabilities
    heap_info.mask[1] = MALLOC_CAP_SPIRAM;
    heap_info.tasks = NULL;                     // Passing NULL captures heap info for all tasks
    heap_info.num_tasks = 0;
    heap_info.totals = s_totals_arr;            // Gets task wise allocation details
    heap_info.num_totals = &s_prepopulated_num;
    heap_info.max_totals = MAX_TASK_NUM;        // Maximum length of "s_totals_arr"
    heap_info.blocks = s_block_arr;             // Gets block wise allocation details. For each block, gets owner task, address and size
    heap_info.max_blocks = MAX_BLOCK_NUM;       // Maximum length of "s_block_arr"

    heap_caps_get_per_task_info(&heap_info);

    for (int i = 0 ; i < *heap_info.num_totals; i++) {
        if (heap_info.totals[i].task == (TaskHandle_t)0x580e) {
            printf("Invalid handle %p\n", heap_info.totals[i].task);
            continue;
        }
        printf("Task: %s %p -> MALLOC_CAP_INTERNAL: %d MALLOC_CAP_SPIRAM: %d Highwater %d\n",
                heap_info.totals[i].task ? pcTaskGetName(heap_info.totals[i].task) : "Pre-Scheduler allocs" ,
                heap_info.totals[i].task,
                heap_info.totals[i].size[0],    // Heap size with CAP_INTERNAL capabilities
                heap_info.totals[i].size[1],   // Heap size with CAP_SPIRAM capabilities
                heap_info.totals[i].task ? uxTaskGetStackHighWaterMark(heap_info.totals[i].task) : 0);
    }

    printf("\n\n");
}

void memTask(void *pvParameter)
{
    int dump = 0;

    while (1) {
        /* Delay 1 tick (assumes FreeRTOS tick is 10ms */
        vTaskDelay(pdMS_TO_TICKS(20000));

        dump++;
        ESP_LOGI(TAG, "Dump #%d Free internal %d biggest block %d total %d PSRAM free %d", dump,
            heap_caps_get_free_size(MALLOC_CAP_INTERNAL), 
            heap_caps_get_largest_free_block(MALLOC_CAP_INTERNAL), 
            xPortGetFreeHeapSize(), heap_caps_get_free_size(MALLOC_CAP_SPIRAM));
        esp_dump_per_task_heap_info();
        if (dump % 5 == 0) {
            ESP_LOGI(TAG, "Dump #%d Free %d", dump, xPortGetFreeHeapSize());
            heap_caps_print_heap_info(MALLOC_CAP_INTERNAL);
            ets_printf("----\n");
            heap_caps_print_heap_info(MALLOC_CAP_SPIRAM);
            ets_printf("==========\n");
        }
    }

    /* A task should NEVER return */
    vTaskDelete(NULL);
}

@jonsmirl
Copy link
Contributor

You could also switch over to static wifi buffers.

@dhrishi
Copy link
Collaborator

dhrishi commented Nov 13, 2024

@FMistrorigo With this commit, the RAM usage in the data model is reduced by around 4-5KB. Please check if you have this commit, if you are using the main branch.

@FMistrorigo
Copy link
Author

Hi @dhrishi,

I'm not using the main branch so I don't have the fix, but I solved the problem following the "minimizing RAM usage" page in the documentation and reaching the quantity of heap i needed to allocate every pool of lwIP in it's peak heap consumption.
For now the issue is solved, and we will be able to certify our product.

Thanks,
Fabio

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants