You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What is your environment(Kubernetes version, Fluid version, etc.)
K8s: v1.29.7
Containerd: 1.7.22
OS: Ubuntu 22.04.3
fluid: v1.0.2-41eefb6
alluxio/alluxio-dev:2.9.0
Describe the bug
After the fluid dataset and alluxio rumtime CR resources were created, and before the fluid PVC was mounted to the K8s container instance, the dataload CR resource was created to preheat the dataset. Occasionally, some files of the dataset could not be preheated successfully. Only the preheating failure problem is displayed in the log. Do you have specific locating methods and solutions?
What you expect to happen:
Each file should be preheated successfully
How to reproduce it
1、dataset、alluxio runtime CR yaml:
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: fluid-8b-preheat
spec:
mounts:
What is your environment(Kubernetes version, Fluid version, etc.)
K8s: v1.29.7
Containerd: 1.7.22
OS: Ubuntu 22.04.3
fluid: v1.0.2-41eefb6
alluxio/alluxio-dev:2.9.0
Describe the bug
After the fluid dataset and alluxio rumtime CR resources were created, and before the fluid PVC was mounted to the K8s container instance, the dataload CR resource was created to preheat the dataset. Occasionally, some files of the dataset could not be preheated successfully. Only the preheating failure problem is displayed in the log. Do you have specific locating methods and solutions?
What you expect to happen:
Each file should be preheated successfully
How to reproduce it
1、dataset、alluxio runtime CR yaml:
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: fluid-8b-preheat
spec:
mounts:
name: fluid-8b-preheat
accessModes:
apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
name: fluid-8b-preheat
spec:
replicas: 2 # 待启动的Alluxio缓存系统Worker组件副本数。
data:
replicas: 2
tieredstore:
levels:
- mediumtype: MEM
path: /dev/shm
quota: 500Gi
high: "0.95"
low: "0.8"
2、dataload CR yaml
apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
name: fluid-8b-preheat
spec:
dataset:
name: fluid-8b-preheat
namespace: default
loadMetadata: true
target:
- path: /
replicas: 2
3、loader-job log
Please wait for command submission to finish..
Submitted successfully, jobControlId = 1730198234915
Waiting for the command to finish ...
Get command status information below:
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/.gitattributes
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/LICENSE
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/README.md
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/USE_POLICY.md
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/config.json
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/configuration.json
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/generation_config.json
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/model-00001-of-00004.safetensors
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/model-00002-of-00004.safetensors
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/model-00003-of-00004.safetensors
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/model-00004-of-00004.safetensors
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/model.safetensors.index.json
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/special_tokens_map.json
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/tokenizer.json
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/tokenizer_config.json
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/config.json
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/configuration.json
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/generation_config.json
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/latest_checkpointed_iteration.txt
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/model.safetensors.index.json
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/release/mp_rank_00_001/model_optim_rng.pt
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/release/mp_rank_00_003/model_optim_rng.pt
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/release/mp_rank_01_000/model_optim_rng.pt
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/release/mp_rank_01_001/model_optim_rng.pt
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/release/mp_rank_01_002/model_optim_rng.pt
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/release/mp_rank_01_003/model_optim_rng.pt
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/special_tokens_map.json
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/tokenizer.json
Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/tokenizer_config.json
Successfully loaded path /fluid-8b-preheat/llama3-datasets/wudao_llama3bpe_content_document.bin
Successfully loaded path /fluid-8b-preheat/llama3-datasets/wudao_llama3bpe_content_document.idx
Total completed file count is 31, failed file count is 2
Finished running the command, jobControlId = 1730198234915
Here are failed files:
/fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/release/mp_rank_00_002/model_optim_rng.pt,
/fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/release/mp_rank_00_000/model_optim_rng.pt,
Check out ./logs/user/distributedLoad__failures.csv for full list of failed files.
real 1m13.203s
user 1m4.701s
sys 0m4.343s
distributedLoad on / ends
4、preheat dataset size
du -sh *
34G llama3-ckpts
70G llama3-datasets
Additional Information
The text was updated successfully, but these errors were encountered: