[BUG]When fluid and alluxio configured the data set to be preheated, it was found that some files were not preheated successfully #4380

yizhouv5 · 2024-10-30T02:18:03Z

What is your environment(Kubernetes version, Fluid version, etc.)
K8s: v1.29.7
Containerd: 1.7.22
OS: Ubuntu 22.04.3
fluid: v1.0.2-41eefb6
alluxio/alluxio-dev:2.9.0

Describe the bug
After the fluid dataset and alluxio rumtime CR resources were created, and before the fluid PVC was mounted to the K8s container instance, the dataload CR resource was created to preheat the dataset. Occasionally, some files of the dataset could not be preheated successfully. Only the preheating failure problem is displayed in the log. Do you have specific locating methods and solutions?

What you expect to happen:
Each file should be preheated successfully

How to reproduce it
1、dataset、alluxio runtime CR yaml：
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: fluid-8b-preheat
spec:
mounts:

mountPoint: local:///yunmai/llama3-ds-ckp
name: fluid-8b-preheat
accessModes:
- ReadWriteMany

apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
name: fluid-8b-preheat
spec:
replicas: 2 # 待启动的Alluxio缓存系统Worker组件副本数。
data:
replicas: 2
tieredstore:
levels:
- mediumtype: MEM
path: /dev/shm
quota: 500Gi
high: "0.95"
low: "0.8"

2、dataload CR yaml
apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
name: fluid-8b-preheat
spec:
dataset:
name: fluid-8b-preheat
namespace: default
loadMetadata: true
target:
- path: /
replicas: 2

3、loader-job log

alluxio fs distributedLoad --replication 2 /
Please wait for command submission to finish..
Submitted successfully, jobControlId = 1730198234915
Waiting for the command to finish ...
Get command status information below:
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/.gitattributes
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/LICENSE
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/README.md
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/USE_POLICY.md
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/config.json
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/configuration.json
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/generation_config.json
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/model-00001-of-00004.safetensors
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/model-00002-of-00004.safetensors
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/model-00003-of-00004.safetensors
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/model-00004-of-00004.safetensors
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/model.safetensors.index.json
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/special_tokens_map.json
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/tokenizer.json
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/tokenizer_config.json
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/config.json
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/configuration.json
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/generation_config.json
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/latest_checkpointed_iteration.txt
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/model.safetensors.index.json
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/release/mp_rank_00_001/model_optim_rng.pt
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/release/mp_rank_00_003/model_optim_rng.pt
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/release/mp_rank_01_000/model_optim_rng.pt
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/release/mp_rank_01_001/model_optim_rng.pt
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/release/mp_rank_01_002/model_optim_rng.pt
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/release/mp_rank_01_003/model_optim_rng.pt
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/special_tokens_map.json
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/tokenizer.json
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/tokenizer_config.json
Successfully loaded path /fluid-8b-preheat/llama3-datasets/wudao_llama3bpe_content_document.bin
Successfully loaded path /fluid-8b-preheat/llama3-datasets/wudao_llama3bpe_content_document.idx
Total completed file count is 31, failed file count is 2
Finished running the command, jobControlId = 1730198234915
Here are failed files:
/fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/release/mp_rank_00_002/model_optim_rng.pt,
/fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/release/mp_rank_00_000/model_optim_rng.pt,
Check out ./logs/user/distributedLoad__failures.csv for full list of failed files.

real 1m13.203s
user 1m4.701s
sys 0m4.343s

echo -e 'distributedLoad on / ends'
(( i++ ))
distributedLoad on / ends
(( i<1 ))

4、preheat dataset size

du -sh *

34G llama3-ckpts
70G llama3-datasets

Additional Information

yizhouv5 added the bug Something isn't working label Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]When fluid and alluxio configured the data set to be preheated, it was found that some files were not preheated successfully #4380

[BUG]When fluid and alluxio configured the data set to be preheated, it was found that some files were not preheated successfully #4380

yizhouv5 commented Oct 30, 2024

[BUG]When fluid and alluxio configured the data set to be preheated, it was found that some files were not preheated successfully #4380

[BUG]When fluid and alluxio configured the data set to be preheated, it was found that some files were not preheated successfully #4380

Comments

yizhouv5 commented Oct 30, 2024

du -sh *