Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-15776: Introduce remote.fetch.max.timeout.ms to configure DelayedRemoteFetch timeout #14778

Merged
merged 5 commits into from
Jun 5, 2024

Conversation

kamalcph
Copy link
Contributor

@kamalcph kamalcph commented Nov 16, 2023

KIP-1018

Test: Existing unit and integration tests

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@kamalcph kamalcph marked this pull request as ready for review November 16, 2023 10:05
@kamalcph kamalcph added the tiered-storage Related to the Tiered Storage feature label Nov 16, 2023
@divijvaidya
Copy link
Contributor

Could you please help me understand how this change works with fetch.max.wait.ms from a user perspective i.e. what happens when we are retrieving data from both local & remote in a single fetch call?

Also, wouldn't this change user clients? Asking because prior to this change users were expecting a guaranteed response within fetch.max.wait.ms = 500ms but now they might not receive a response until 40s request.timeout.ms. If the user has configured their application timeouts to according to fetch.max.wait.ms, this change will break my application.

@kamalcph
Copy link
Contributor Author

kamalcph commented Nov 20, 2023

Could you please help me understand how this change works with fetch.max.wait.ms from a user perspective i.e. what happens when we are retrieving data from both local & remote in a single fetch call?

fetch.max.wait.ms timeout is applicable only when there is no enough data (fetch.min.bytes) to respond back to the client. This is a special case where we are reading the data from both local and remote, the FETCH request has to wait for the tail latency which is a combined latency of reading from both local and remote storage.

Note that we always read from only one remote partition up-to max.partition.fetch.bytes even-though there is bandwidth available in the FETCH response (fetch.max.bytes) and the client rotates the partition order in the next FETCH request so that next partitions are served.

Also, wouldn't this change user clients? Asking because prior to this change users were expecting a guaranteed response within fetch.max.wait.ms = 500ms but now they might not receive a response until 40s request.timeout.ms. If the user has configured their application timeouts to according to fetch.max.wait.ms, this change will break my application.

fetch.max.wait.ms doesn't guarantee a response within this timeout. The client expires the request only when it exceeds the request.timeout.ms of 30 seconds (default). The time taken to serve the FETCH request can be higher than the fetch.max.wait.ms due to slow hard-disk, sector errors in disk and so on.

The FetchRequest.json doesn't expose the client configured request timeout, so we are using the default server request timeout of 30 seconds. Otherwise, we can introduce one more config fetch.remote.max.wait.ms to define the delay timeout for DelayedRemoteFetch requests. We need to decide whether to keep this config in the client/server since the server operator may need to tune this config for all the clients if the remote storage degrades and latency to serve the remote FETCH requests is high.

@showuon
Copy link
Contributor

showuon commented Nov 21, 2023

I understand the problem you're trying to solve, but using the server default request timeout doesn't make sense to me. It will break the contract of fetch protocol that fetch.max.wait.ms will not be exceeded if no sufficient data in "local" log. I understand the remote read is some kind of grey area about if "data is existed or not", but we have to admit, some users might feel surprised when their fetch doesn't respond in fetch.max.wait.ms time. Ideally, we should introduce another config for this remote read waiting purpose, instead of re-using request timeout.

Copy link

This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please ask a committer for review. If the PR has merge conflicts, please update it with the latest from trunk (or appropriate release branch)

If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.

@github-actions github-actions bot added the stale Stale PRs label Feb 20, 2024
@showuon
Copy link
Contributor

showuon commented Apr 12, 2024

Correction: For this:

some users might feel surprised when their fetch doesn't respond in fetch.max.wait.ms time.

This is wrong. Even if the remote reading is not completed, yet, the fetch request will still return in fetch.max.wait.ms. It's just an empty response.

@github-actions github-actions bot removed the stale Stale PRs label Apr 13, 2024
@kamalcph kamalcph changed the title KAFKA-15776: Use the FETCH request timeout as the delay timeout for DelayedRemoteFetch [WIP] KAFKA-15776: Use the FETCH request timeout as the delay timeout for DelayedRemoteFetch May 21, 2024
kamalcph added 2 commits June 4, 2024 12:58
…eFetch timeout

KAFKA-15776: Use the FETCH request timeout as the delay timeout for DelayedRemoteFetch

DelayedRemoteFetch uses `fetch.max.wait.ms` config as a delay timeout for DelayedRemoteFetchPurgatory. `fetch.max.wait.ms` purpose is to wait for the given amount of time when there is no data available to serve the FETCH request.

```
The maximum amount of time the server will block before answering the fetch request if there isn't sufficient data to immediately satisfy the requirement given by fetch.min.bytes.
```

Using the same timeout in the DelayedRemoteFetchPurgatory can confuse the user on how to configure optimal value for each purpose. Moreover, the config is of LOW importance and most of the users won't configure it and use the default value of 500 ms.

Having the delay timeout of 500 ms in DelayedRemoteFetchPurgatory can lead to higher number of expired delayed remote fetch requests when the remote storage have any degradation.
@kamalcph kamalcph changed the title [WIP] KAFKA-15776: Use the FETCH request timeout as the delay timeout for DelayedRemoteFetch KAFKA-15776: Use the FETCH request timeout as the delay timeout for DelayedRemoteFetch Jun 4, 2024
@kamalcph
Copy link
Contributor Author

kamalcph commented Jun 4, 2024

@showuon @clolov @jeqo

The patch is ready for review. PTAL.

Will open a separate PR to emit the RemoteLogReader FetchRateAndTimeMs metric.

Copy link
Contributor

@showuon showuon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just a minor comment.

@showuon
Copy link
Contributor

showuon commented Jun 5, 2024

@kamalcph , could we move the dynamic config change into another PR? I have some comments to it, but that is separate from the original changes.

Copy link
Contributor

@showuon showuon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@kamalcph kamalcph changed the title KAFKA-15776: Use the FETCH request timeout as the delay timeout for DelayedRemoteFetch KAFKA-15776: Introduce remote.fetch.max.timeout.ms to configure DelayedRemoteFetch timeout Jun 5, 2024
@showuon
Copy link
Contributor

showuon commented Jun 5, 2024

Failed tests are unrelated.

@showuon showuon merged commit 02c794d into apache:trunk Jun 5, 2024
1 check failed
@kamalcph kamalcph deleted the KAFKA-15776 branch June 5, 2024 08:23
apourchet added a commit to apourchet/kafka that referenced this pull request Jun 6, 2024
commit ee834d9
Author: Antoine Pourchet <[email protected]>
Date:   Thu Jun 6 15:20:48 2024 -0600

    KAFKA-15045: (KIP-924 pt. 19) Update to new AssignmentConfigs (apache#16219)

    This PR updates all of the streams task assignment code to use the new AssignmentConfigs public class.

    Reviewers: Anna Sophie Blee-Goldman <[email protected]>

commit 8a2bc3a
Author: Bruno Cadonna <[email protected]>
Date:   Thu Jun 6 21:19:52 2024 +0200

    KAFKA-16903: Consider produce error of different task (apache#16222)

    A task does not know anything about a produce error thrown
    by a different task. That might lead to a InvalidTxnStateException
    when a task attempts to do a transactional operation on a producer
    that failed due to a different task.

    This commit stores the produce exception in the streams producer
    on completion of a send instead of the record collector since the
    record collector is on task level whereas the stream producer
    is on stream thread level. Since all tasks use the same streams
    producer the error should be correctly propagated across tasks
    of the same stream thread.

    For EOS alpha, this commit does not change anything because
    each task uses its own producer. The send error is still
    on task level but so is also the transaction.

    Reviewers: Matthias J. Sax <[email protected]>

commit 7d832cf
Author: David Jacot <[email protected]>
Date:   Thu Jun 6 21:19:20 2024 +0200

    KAFKA-14701; Move `PartitionAssignor` to new `group-coordinator-api` module (apache#16198)

    This patch moves the `PartitionAssignor` interface and all the related classes to a newly created `group-coordinator/api` module, following the pattern used by the storage and tools modules.

    Reviewers: Ritika Reddy <[email protected]>, Jeff Kim <[email protected]>, Chia-Ping Tsai <[email protected]>

commit 79ea7d6
Author: Mickael Maison <[email protected]>
Date:   Thu Jun 6 20:28:39 2024 +0200

    MINOR: Various cleanups in clients (apache#16193)

    Reviewers: Chia-Ping Tsai <[email protected]>

commit a41f7a4
Author: Murali Basani <[email protected]>
Date:   Thu Jun 6 18:06:25 2024 +0200

    KAFKA-16884 Refactor RemoteLogManagerConfig with AbstractConfig (apache#16199)

    Reviewers: Greg Harris <[email protected]>, Kamal Chandraprakash <[email protected]>, Chia-Ping Tsai <[email protected]>

commit 0ed104c
Author: Kamal Chandraprakash <[email protected]>
Date:   Thu Jun 6 21:26:08 2024 +0530

    MINOR: Cleanup the storage module unit tests (apache#16202)

    - Use SystemTime instead of MockTime when time is not mocked
    - Use static assertions to reduce the line length
    - Fold the lines if it exceeds the limit
    - rename tp0 to tpId0 when it refers to TopicIdPartition

    Reviewers: Kuan-Po (Cooper) Tseng <[email protected]>, Chia-Ping Tsai <[email protected]>

commit f36a873
Author: Cy <[email protected]>
Date:   Thu Jun 6 08:46:49 2024 -0700

    MINOR: Added test for ClusterConfig#displayTags (apache#16110)

    Reviewers: Chia-Ping Tsai <[email protected]>

commit 226f3c5
Author: Sanskar Jhajharia <[email protected]>
Date:   Thu Jun 6 18:48:23 2024 +0530

    MINOR: Code cleanup in metadata module (apache#16065)

    Reviewers: Mickael Maison <[email protected]>

commit ebe1e96
Author: Loïc GREFFIER <[email protected]>
Date:   Thu Jun 6 13:40:31 2024 +0200

    KAFKA-16448: Add ProcessingExceptionHandler interface and implementations (apache#16187)

    This PR is part of KAFKA-16448 which aims to bring a ProcessingExceptionHandler to Kafka Streams in order to deal with exceptions that occur during processing.

    This PR brings ProcessingExceptionHandler interface and default implementations.

    Co-authored-by: Dabz <[email protected]>
    Co-authored-by: sebastienviale <[email protected]>

    Reviewer: Bruno Cadonna <[email protected]>

commit b74b182
Author: Lianet Magrans <[email protected]>
Date:   Thu Jun 6 09:45:36 2024 +0200

    KAFKA-16786: Remove old assignment strategy usage in new consumer (apache#16214)

    Remove usage of the partition.assignment.strategy config in the new consumer. This config is deprecated with the new consumer protocol, so the AsyncKafkaConsumer should not use or validate the property.

    Reviewers: Lucas Brutschy <[email protected]>

commit f880ad6
Author: Alyssa Huang <[email protected]>
Date:   Wed Jun 5 23:30:49 2024 -0700

    KAFKA-16530: Fix high-watermark calculation to not assume the leader is in the voter set (apache#16079)

    1. Changing log message from error to info - We may expect the HW calculation to give us a smaller result than the current HW in the case of quorum reconfiguration. We will continue to not allow the HW to actually decrease.
    2. Logic for finding the updated LeaderEndOffset for updateReplicaState is changed as well. We do not assume the leader is in the voter set and check the observer states as well.
    3. updateLocalState now accepts an additional "lastVoterSet" param which allows us to update the leader state with the last known voters. any nodes in this set but not in voterStates will be added to voterStates and removed from observerStates, any nodes not in this set but in voterStates will be removed from voterStates and added to observerStates

    Reviewers: Luke Chen <[email protected]>, José Armando García Sancio <[email protected]>

commit 3835515
Author: Okada Haruki <[email protected]>
Date:   Thu Jun 6 15:10:13 2024 +0900

    KAFKA-16541 Fix potential leader-epoch checkpoint file corruption (apache#15993)

    A patch for KAFKA-15046 got rid of fsync on LeaderEpochFileCache#truncateFromStart/End for performance reason, but it turned out this could cause corrupted leader-epoch checkpoint file on ungraceful OS shutdown, i.e. OS shuts down in the middle when kernel is writing dirty pages back to the device.

    To address this problem, this PR makes below changes: (1) Revert LeaderEpochCheckpoint#write to always fsync
    (2) truncateFromStart/End now call LeaderEpochCheckpoint#write asynchronously on scheduler thread
    (3) UnifiedLog#maybeCreateLeaderEpochCache now loads epoch entries from checkpoint file only when current cache is absent

    Reviewers: Jun Rao <[email protected]>

commit 7763243
Author: Florin Akermann <[email protected]>
Date:   Thu Jun 6 00:22:31 2024 +0200

    KAFKA-12317: Update FK-left-join documentation (apache#15689)

    FK left-join was changed via KIP-962. This PR updates the docs accordingly.

    Reviewers: Ayoub Omari <[email protected]>, Matthias J. Sax <[email protected]>

commit 1134520
Author: Ayoub Omari <[email protected]>
Date:   Thu Jun 6 00:05:04 2024 +0200

    KAFKA-16573: Specify node and store where serdes are needed (apache#15790)

    Reviewers: Matthias J. Sax <[email protected]>, Bruno Cadonna <[email protected]>, Anna Sophie Blee-Goldman <[email protected]>

commit 896af1b
Author: Sanskar Jhajharia <[email protected]>
Date:   Thu Jun 6 01:46:59 2024 +0530

    MINOR: Raft module Cleanup (apache#16205)

    Reviewers: Chia-Ping Tsai <[email protected]>

commit 0109a3f
Author: Antoine Pourchet <[email protected]>
Date:   Wed Jun 5 14:09:37 2024 -0600

    KAFKA-15045: (KIP-924 pt. 17) State store computation fixed (apache#16194)

    Fixed the calculation of the store name list based on the subtopology being accessed.

    Also added a new test to make sure this new functionality works as intended.

    Reviewers: Anna Sophie Blee-Goldman <[email protected]>

commit 52514a8
Author: Greg Harris <[email protected]>
Date:   Wed Jun 5 11:35:32 2024 -0700

    KAFKA-16858: Throw DataException from validateValue on array and map schemas without inner schemas (apache#16161)

    Signed-off-by: Greg Harris <[email protected]>
    Reviewers: Chris Egerton <[email protected]>

commit f2aafcc
Author: Sanskar Jhajharia <[email protected]>
Date:   Wed Jun 5 20:06:01 2024 +0530

    MINOR: Cleanups in Shell Module (apache#16204)

    Reviewers: Chia-Ping Tsai <[email protected]>

commit bd9d68f
Author: Abhijeet Kumar <[email protected]>
Date:   Wed Jun 5 19:12:25 2024 +0530

    KAFKA-15265: Integrate RLMQuotaManager for throttling fetches from remote storage (apache#16071)

    Reviewers: Kamal Chandraprakash<[email protected]>, Luke Chen <[email protected]>, Satish Duggana <[email protected]>

commit 62e5cce
Author: gongxuanzhang <[email protected]>
Date:   Wed Jun 5 18:57:32 2024 +0800

    KAFKA-10787 Update spotless version and remove support JDK8 (apache#16176)

    Reviewers: Chia-Ping Tsai <[email protected]>

commit 02c794d
Author: Kamal Chandraprakash <[email protected]>
Date:   Wed Jun 5 12:12:23 2024 +0530

    KAFKA-15776: Introduce remote.fetch.max.timeout.ms to configure DelayedRemoteFetch timeout (apache#14778)

    KIP-1018, part1, Introduce remote.fetch.max.timeout.ms to configure DelayedRemoteFetch timeout

    Reviewers: Luke Chen <[email protected]>

commit 7ddfa64
Author: Dongnuo Lyu <[email protected]>
Date:   Wed Jun 5 02:08:38 2024 -0400

    MINOR: Adjust validateOffsetCommit/Fetch in ConsumerGroup to ensure compatibility with classic protocol members (apache#16145)

    During online migration, there could be ConsumerGroup that has members that uses the classic protocol. In the current implementation, `STALE_MEMBER_EPOCH` could be thrown in ConsumerGroup offset fetch/commit validation but it's not supported by the classic protocol. Thus this patch changed `ConsumerGroup#validateOffsetCommit` and `ConsumerGroup#validateOffsetFetch` to ensure compatibility.

    Reviewers: Jeff Kim <[email protected]>, David Jacot <[email protected]>

commit 252c1ac
Author: Apoorv Mittal <[email protected]>
Date:   Wed Jun 5 05:55:24 2024 +0100

    KAFKA-16740: Adding skeleton code for Share Fetch and Acknowledge RPC (KIP-932) (apache#16184)

    The PR adds skeleton code for Share Fetch and Acknowledge RPCs. The changes include:

    1. Defining RPCs in KafkaApis.scala
    2. Added new SharePartitionManager class which handles the RPCs handling
    3. Added SharePartition class which manages in-memory record states and for fetched data.

    Reviewers: David Jacot <[email protected]>, Andrew Schofield <[email protected]>, Manikumar Reddy <[email protected]>

commit b89999b
Author: PoAn Yang <[email protected]>
Date:   Wed Jun 5 08:02:52 2024 +0800

    KAFKA-16483: Remove preAppendErrors from createPutCacheCallback (apache#16105)

    The method createPutCacheCallback has a input argument preAppendErrors. It is used to keep the "error" happens before appending. However, it is always empty. Also, the pre-append error is handled before createPutCacheCallback by calling responseCallback. Hence, we can remove preAppendErrors.

    Signed-off-by: PoAn Yang <[email protected]>

    Reviewers: Luke Chen <[email protected]>

commit 01e9918
Author: Kuan-Po (Cooper) Tseng <[email protected]>
Date:   Wed Jun 5 07:56:18 2024 +0800

    KAFKA-16814 KRaft broker cannot startup when `partition.metadata` is missing (apache#16165)

    When starting up kafka logManager, we'll check stray replicas to avoid some corner cases. But this check might cause broker unable to startup if partition.metadata is missing because when startup kafka, we load log from file, and the topicId of the log is coming from partition.metadata file. So, if partition.metadata is missing, the topicId will be None, and the LogManager#isStrayKraftReplica will fail with no topicID error.

    The partition.metadata missing could be some storage failure, or another possible path is unclean shutdown after topic is created in the replica, but before data is flushed into partition.metadata file. This is possible because we do the flush in async way here.

    When finding a log without topicID, we should treat it as a stray log and then delete it.

    Reviewers: Luke Chen <[email protected]>, Gaurav Narula <[email protected]>

commit d652f5c
Author: TingIāu "Ting" Kì <[email protected]>
Date:   Wed Jun 5 07:52:06 2024 +0800

    MINOR: Add topicIds and directoryIds to the return value of the toString method. (apache#16189)

    Add topicIds and directoryIds to the return value of the toString method.

    Reviewers: Luke Chen <[email protected]>

commit 7e0caad
Author: Igor Soarez <[email protected]>
Date:   Tue Jun 4 22:12:33 2024 +0100

    MINOR: Cleanup unused references in core (apache#16192)

    Reviewers: Chia-Ping Tsai <[email protected]>

commit 9821aca
Author: PoAn Yang <[email protected]>
Date:   Wed Jun 5 05:09:04 2024 +0800

    MINOR: Upgrade gradle from 8.7 to 8.8 (apache#16190)

    Reviewers: Chia-Ping Tsai <[email protected]>

commit 9ceed8f
Author: Colin P. McCabe <[email protected]>
Date:   Tue Jun 4 14:04:59 2024 -0700

    KAFKA-16535: Implement AddVoter, RemoveVoter, UpdateVoter RPCs

    Implement the add voter, remove voter, and update voter RPCs for
    KIP-853. This is just adding the RPC handling; the current
    implementation in RaftManager just throws UnsupportedVersionException.

    Reviewers: Andrew Schofield <[email protected]>, José Armando García Sancio <[email protected]>

commit 8b3c77c
Author: TingIāu "Ting" Kì <[email protected]>
Date:   Wed Jun 5 04:21:20 2024 +0800

    KAFKA-15305 The background thread should try to process the remaining task until the shutdown timer is expired. (apache#16156)

    Reviewers: Lianet Magrans <[email protected]>, Chia-Ping Tsai <[email protected]>

commit cda2df5
Author: Kamal Chandraprakash <[email protected]>
Date:   Wed Jun 5 00:41:30 2024 +0530

    KAFKA-16882 Migrate RemoteLogSegmentLifecycleTest to ClusterInstance infra (apache#16180)

    - Removed the RemoteLogSegmentLifecycleManager
    - Removed the TopicBasedRemoteLogMetadataManagerWrapper, RemoteLogMetadataCacheWrapper, TopicBasedRemoteLogMetadataManagerHarness and TopicBasedRemoteLogMetadataManagerWrapperWithHarness

    Reviewers: Kuan-Po (Cooper) Tseng <[email protected]>, Chia-Ping Tsai <[email protected]>

commit 2b47798
Author: Chris Egerton <[email protected]>
Date:   Tue Jun 4 21:04:34 2024 +0200

    MINOR: Fix return tag on Javadocs for consumer group-related Admin methods (apache#16197)

    Reviewers: Greg Harris <[email protected]>, Chia-Ping Tsai <[email protected]>
TaiJuWu pushed a commit to TaiJuWu/kafka that referenced this pull request Jun 8, 2024
…edRemoteFetch timeout (apache#14778)

KIP-1018, part1, Introduce remote.fetch.max.timeout.ms to configure DelayedRemoteFetch timeout

Reviewers: Luke Chen <[email protected]>
kamalcph added a commit to kamalcph/kafka that referenced this pull request Jun 11, 2024
…edRemoteFetch timeout (apache#14778)

KIP-1018, part1, Introduce remote.fetch.max.timeout.ms to configure DelayedRemoteFetch timeout

Reviewers: Luke Chen <[email protected]>
satishd pushed a commit that referenced this pull request Jun 11, 2024
…edRemoteFetch timeout (#14778)

KIP-1018, part1, Introduce remote.fetch.max.timeout.ms to configure DelayedRemoteFetch timeout

Reviewers: Luke Chen <[email protected]>
gongxuanzhang pushed a commit to gongxuanzhang/kafka that referenced this pull request Jun 12, 2024
…edRemoteFetch timeout (apache#14778)

KIP-1018, part1, Introduce remote.fetch.max.timeout.ms to configure DelayedRemoteFetch timeout

Reviewers: Luke Chen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tiered-storage Related to the Tiered Storage feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants