Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cylc clean while cat-log running #5359

Merged
merged 8 commits into from
Mar 28, 2023
Merged

Conversation

datamel
Copy link
Contributor

@datamel datamel commented Feb 9, 2023

cylc clean will fail when a cat-log process is keeping one of the log files open. This is a more pressing issue now that the log view is available in the gui.
This is related to an nfs issue and may not be possible to reproduce on all systems.

The solution involves

  • changing the cat-log command to include --follow=name, changed in the default.
  • changing rmtree behaviour, by retrying on error.

closes #5341

Note for debugging this issue, I changed the error message to be passed into the gui.

To test, open the gui and open a log file. With the log view still open, clean the workflow.
This will error on master but not on this branch.

In the gui, I think we should maybe close any open views on clean, perhaps even redirect to the dashboard? But this would be a separate issue.

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg and conda-environment.yml.
  • Tests are included (or explain why tests are not needed).
  • CHANGES.md entry included if this is a change that can affect users
  • Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
  • If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

@oliver-sanders oliver-sanders added this to the cylc-8.1.2 milestone Feb 16, 2023
@oliver-sanders oliver-sanders added the bug Something is wrong :( label Feb 16, 2023
@oliver-sanders oliver-sanders modified the milestones: cylc-8.1.2, cylc-8.1.3 Feb 16, 2023
@oliver-sanders oliver-sanders marked this pull request as draft February 16, 2023 12:52
@oliver-sanders
Copy link
Member

If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

Needs rebasing onto 8.1.x.

@datamel datamel changed the base branch from master to 8.1.x February 20, 2023 10:33
@MetRonnie
Copy link
Member

MetRonnie commented Mar 6, 2023

Note for debugging this issue, I changed the error message to be passed into the gui, so this (if reviewers are happy with the approach) will also close #5324.

Thanks, however this partially addresses that issue. We need to check for other instances of logging that should be moved into the exception message

@datamel
Copy link
Contributor Author

datamel commented Mar 7, 2023

Note for debugging this issue, I changed the error message to be passed into the gui, so this (if reviewers are happy with the approach) will also close #5324.

Thanks, however this partially addresses that issue. We need to check for other instances of logging that should be moved into the exception message

Okay, I've removed the close from the description.

@datamel datamel marked this pull request as ready for review March 7, 2023 13:47
cylc/flow/scripts/clean.py Outdated Show resolved Hide resolved
@MetRonnie
Copy link
Member

MetRonnie commented Mar 17, 2023

Ah, just realised a slight flaw with only testing for ENOTEMPTY - the second time it tries it, the error will be EBUSY instead, so it will only retry once before failing. (Assuming it's a "sillyrenamed" .nfsXXXX... file causing the problem)

image

Also annoying is how rmtree only reports the filename, not path. If it's a job log file that's being kept open by a running job, there could be hundreds of job log dirs and you don't know which one contains the .nfsXXXX....

Anyway, we can sort these out in a follow-up issue.

@oliver-sanders
Copy link
Member

oliver-sanders commented Mar 22, 2023

Ah, just realised a slight flaw with only testing for ENOTEMPTY - the second time it tries it, the error will be EBUSY instead

Dammit, confirmed with this script:

#!/usr/bin/env bash                                                      
                                                                         
set -eux                                  
                                          
TD="$HOME/cylc-run/foo/log/scheduler"    
mkdir -p "$TD"         
TF="$TD/log"           
                       
echo -e 'a\nb\nc' > "$TF"    
tree "$TD"             
                       
tail --follow=name -f "$TF" & PROC="$!"    
     
sleep 2    
     
# this will likely fail due to FS locking stuffs    
rm -r "$TD" && echo 'REMOVE PASSED' || echo 'REMOVE FAILED'    
rm -r "$TD" && echo 'REMOVE PASSED' || echo 'REMOVE FAILED'    
rm -r "$TD" && echo 'REMOVE PASSED' || echo 'REMOVE FAILED'                                                                        
                                             
# tidy up after                              
kill "$PROC"                                 
rm -r "$TD" || true     

Which reliably gives:

  • Directory not empty
  • Device or resource busy
  • Device or resource busy

Will play a little more but I think we need to add this code too.

@oliver-sanders
Copy link
Member

I used this script to double-check the premise that the tail process dies when the file is removed and to confirm the error message(s) returned:

#!/usr/bin/env bash                                                      
                                                                         
set -eu                                                            
                                                 
TD="$HOME/cylc-run/foo/log/scheduler"            
mkdir -p "$TD"                                   
TF="$TD/log"                                     
                                                 
echo -e 'a\nb\nc' > "$TF"                        
                                                 
# NOTE: for some reason the tail command doesn't die when launched
# as a background process                        
echo 'Start you tail process now:'               
echo "i.e. tail -f --follow=name '$TF'"          
sleep 5                                          
                                                 
# retry removal a few times                      
SLEEP=1                                          
for try in $(seq 1 10); do                       
    sleep "$SLEEP"                               
    echo "Try: $try"                             
    rm -r "$TD" || continue                      
    echo "Success after $((try * SLEEP)) seconds."
    break            
done                 
                     
# tidy up after JIC
rm -r "$TD" 2>/dev/null || true 

This is hard to test, would need NFS in a container and whatnot so commenting it here for the record, will link this comment into the docstring.

Copy link
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works a charm for me.

$ cylc vip whatever
$ cylc cat-log whatever -m t
$ cylc clean -y whatever

@oliver-sanders
Copy link
Member

Managed to work out a functional test after a little bodging, can bump that and the changelog to #5432

Copy link
Member

@MetRonnie MetRonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested it out. Added a changelog entry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move error information out of printing/logging and into exception messages cat-log: use --follow=name
3 participants