Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]When fluid and alluxio configured the data set to be preheated, it was found that some files were not preheated successfully #4380

Open
yizhouv5 opened this issue Oct 30, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@yizhouv5
Copy link

What is your environment(Kubernetes version, Fluid version, etc.)
K8s: v1.29.7
Containerd: 1.7.22
OS: Ubuntu 22.04.3
fluid: v1.0.2-41eefb6
alluxio/alluxio-dev:2.9.0

Describe the bug
After the fluid dataset and alluxio rumtime CR resources were created, and before the fluid PVC was mounted to the K8s container instance, the dataload CR resource was created to preheat the dataset. Occasionally, some files of the dataset could not be preheated successfully. Only the preheating failure problem is displayed in the log. Do you have specific locating methods and solutions?

What you expect to happen:
Each file should be preheated successfully

How to reproduce it
1、dataset、alluxio runtime CR yaml:
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: fluid-8b-preheat
spec:
mounts:

  • mountPoint: local:///yunmai/llama3-ds-ckp
    name: fluid-8b-preheat
    accessModes:
    • ReadWriteMany

apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
name: fluid-8b-preheat
spec:
replicas: 2 # 待启动的Alluxio缓存系统Worker组件副本数。
data:
replicas: 2
tieredstore:
levels:
- mediumtype: MEM
path: /dev/shm
quota: 500Gi
high: "0.95"
low: "0.8"

2、dataload CR yaml
apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
name: fluid-8b-preheat
spec:
dataset:
name: fluid-8b-preheat
namespace: default
loadMetadata: true
target:
- path: /
replicas: 2

3、loader-job log

  • alluxio fs distributedLoad --replication 2 /
    Please wait for command submission to finish..
    Submitted successfully, jobControlId = 1730198234915
    Waiting for the command to finish ...
    Get command status information below:
    Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/.gitattributes
    Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/LICENSE
    Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/README.md
    Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/USE_POLICY.md
    Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/config.json
    Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/configuration.json
    Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/generation_config.json
    Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/model-00001-of-00004.safetensors
    Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/model-00002-of-00004.safetensors
    Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/model-00003-of-00004.safetensors
    Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/model-00004-of-00004.safetensors
    Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/model.safetensors.index.json
    Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/special_tokens_map.json
    Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/tokenizer.json
    Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/tokenizer_config.json
    Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/config.json
    Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/configuration.json
    Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/generation_config.json
    Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/latest_checkpointed_iteration.txt
    Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/model.safetensors.index.json
    Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/release/mp_rank_00_001/model_optim_rng.pt
    Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/release/mp_rank_00_003/model_optim_rng.pt
    Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/release/mp_rank_01_000/model_optim_rng.pt
    Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/release/mp_rank_01_001/model_optim_rng.pt
    Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/release/mp_rank_01_002/model_optim_rng.pt
    Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/release/mp_rank_01_003/model_optim_rng.pt
    Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/special_tokens_map.json
    Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/tokenizer.json
    Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/tokenizer_config.json
    Successfully loaded path /fluid-8b-preheat/llama3-datasets/wudao_llama3bpe_content_document.bin
    Successfully loaded path /fluid-8b-preheat/llama3-datasets/wudao_llama3bpe_content_document.idx
    Total completed file count is 31, failed file count is 2
    Finished running the command, jobControlId = 1730198234915
    Here are failed files:
    /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/release/mp_rank_00_002/model_optim_rng.pt,
    /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/release/mp_rank_00_000/model_optim_rng.pt,
    Check out ./logs/user/distributedLoad__failures.csv for full list of failed files.

real 1m13.203s
user 1m4.701s
sys 0m4.343s

  • echo -e 'distributedLoad on / ends'
  • (( i++ ))
    distributedLoad on / ends
  • (( i<1 ))

4、preheat dataset size

du -sh *

34G llama3-ckpts
70G llama3-datasets

Additional Information

@yizhouv5 yizhouv5 added the bug Something isn't working label Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant