-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flush file output on program exit #13995
Comments
|
Ruby flushes the IO buffers on exit, so I think we should mirror what they do. I'm not sure exactly how they do it, whether through finalizers or some other mechanism. $ irb
> data = "stuff\n"
> file = File.open("test.txt", :mode => "a")
> File.size(file.path)
=> 0
> file.size
=> 0
> file.write(data)
=> 6
> File.size(file.path)
=> 0
> file.size
=> 6
> exit
$ cat test.txt
stuff |
I think most GC languages would consider this to be working as intended. Finalizers are only for freeing memory from indirect allocations. Nim in particular was very vocal about it (~7 years ago, anyway..) So why would you free memory right before freeing all the memory? Finalizers are also unreliable. Handles should just be closed explicitly (usually via a block version or in I think it's not good to make the exit of all applications delayed just to help the few users that forget to close their handles. Meanwhile Python, despite also having finalizers, always warns about unclosed files at the end of each unittest. |
This could be something only for debug mode - somehow print warnings about all unclosed handles at the end of the program |
In this particular case it is file = File.open("test.txt", :mode => "a")
file.write "abc"
File.size(file.path) # => 0
File.size(file.path) # => 0
file.size # => 3
File.size(file.path) # => 3 Ruby does that to all writable |
Finalizers aren't just for freeing memory. They do that, but more generally they clean up after themselves. @Fryguy mentioned that Ruby flushes I/O buffers on exit, but I've also confirmed that multiple other languages flush output buffers on program exit. The following languages (with example code) all write data to disk despite using buffered I/O and never closing/flushing those buffers explicitly or even implicitly by exceeding buffer capacity:
I agree that file descriptors should be closed explicitly or the auto-closing block version used. "Should" as in it's the ideal thing to do, especially for long-lived processes. But losing data after it was "written" without an exceptional case (like a |
"Flushing buffers to disk" is now a different description than "using finalizers to flush buffer to disk"; surely C doesn't do that |
Ah, thanks for that detail @HertzDevil. I didn't realize that size itself was also doing the flushing. I think this is the more pertinent statement. I feel like we should follow Ruby here.
|
Not different, just a concrete example and the most important one I can come up with. If the general case I suggested of running all finalizers has unacceptable tradeoffs, that's fine, but I hope I've made the point that at least the special case of flushing buffers is the bare minimum feature request. |
Since we started this discussion, the effect of IO finalizers changed significantly: Since #14882 they only close the file descriptor and no longer flush the buffer. See #14807 for details on why this is necessary.
For the record, there would also be no other reason to run the finalizers at exit because their only effect is closing the file descriptor which happens implictly after exit anyway. In summary, finalizers are not a way forward for this. |
Re: #13995 (comment)
The C and Go examples involve no user space output buffering. |
I've changed the title of the issue because folks seem to be focusing a lot more on the use of finalizers than I intended. I suggested finalizers originally only to leave room for other kinds of cleanup and because it was the only idea I had. The body of the original issue post is about the use case of flushing file buffers. With finalizers eliminated as a path toward safely flushing the buffers on exit as other programming languages do, what other options are there? |
None. We don't keep track of IO objects. |
Yeah, I don't see any feasible mechanism for this either. A theoretical solution could be collecting IO instances in a global list to close them in an Another concept that might be useful for this would be a In the end I don't think this can be effectively handled in the Crystal runtime. IO integrity must be ensured in user code. Only there you can really know the dependencies of IO instances. I think it's a good idea in general to explicitly close or flush IO instances. Consider flushing important data at strategic locations. That does not only help for an orderly exit behaviour, but also in other situations. A process can always end abruptly, without a proper exit or any means to handle anything. |
I feel like a language that is otherwise as programmer-friendly as Crystal is (and even claims it in the tagline on the website) shouldn't require the programmer to avoid shooting themselves in the foot by default while doing something as basic as writing to a file on local disk. I close files explicitly (or use PR incoming. |
I think keeping I remain unconvinced where exactly this is an issue without calling |
@RX14 I thought about the linked list, but then we'd keep live references to all IO objects and the GC won't collect lost IO objects anymore, and we'll start leaking file descriptors again. We could use weak references but then we can't use a "cheap" intrusive linked list anymore. Maybe we could manually allocate the nodes somewhere the GC won't scan... @jgaskins there's no point to flush on exit while the GC collecting a lost IO wouldn't. This issue is intrinsically connected to the "IO finalizers don't flush" issue. Again: they don't flush because it involves runtime that CAN'T run within the GC (details in #14807). We miss a mechanism to defer executing finalizers to Crystal, so they wouldn't run while the GC stopped the world but to after the GC resume the world, and thus along the rest of the application (after run the finalizer is removed, and a next GC collect will free the object). Only then could we start thinking of keeping track of |
Maybe an IO flusher (or generic deferred work) system fiber, with finalizers pushing to a queue to be handled whenever that fiber next gets scheduled. Then the at_exit problem is still there which is huge. Designing a brand new datastructure to keep track of all open IO::Buffered instances, which doesn't leak memory, which is performant under parallelism, isn't easy. And that's what's required to implement this. I think the one usecase for this feature which is when you open a buffered IO as a constant or class variable, which you expect to live throughout a long-lived daemon program. Short scripts can easily use |
The current behavior is not documented anywhere that I can find even while looking for it. Even if it were, I haven't found any other programming languages where writing to a file doesn't write to a file when the program exits cleanly, so it is still surprising behavior. There are a lot of scenarios where a program writes to one or more files and then exits. Many programs are not long-lived and exiting is normal. |
@jgaskins You make a good point that files should better err on the side of safety and ensure correct behaviour in the default case. If implicit performance optimization is incompatible with that, it must be opt-in. So it sounds like a good idea to disable output buffering for files by default. For example, running this Crystal program, while following File.open("out.txt", "w") do |file|
file.puts "hello"
sleep 10.seconds
file.puts "world"
end $ crystal build write-out.cr
$ touch out.txt
$ tail -f out.txt &
$ ./write-out
$ # 10 second delay...
hello
world Btw. Julia Evens talks about pipe buffering with this exact use case as a premise in a recent blog post. Uncoordinated buffering can lead to long delays which are quite often undesired. |
Indeed. This was actually the original use case brought up in the forum post linked in the issue body. I proposed solving the problem of data being lost when the program exits because I assumed it would be an easier sell than solving the problem of data not being immediately consistent (and, admittedly, I'm sympathetic to eventual consistency) but, based on the discussion here, it seems like we can't solve either one without solving both.
Incredible post. She's got some fantastic (and timely!) content. Adding her blog to my RSS reader.
100% agreed with all of this. It's totally understandable why the current behavior was chosen with the use cases that were being considered at the time. |
This issue has been mentioned on Crystal Forum. There might be relevant details there: https://forum.crystal-lang.org/t/using-wait-groups-with-http-server/7452/4 |
GNU libc flushes all open file descriptors when Almost all web servers allow setting a buffer size for log files; for example, if compression is enabled in nginx, it defaults to 64k. Without compression, it's recommended to use 4k or more for high traffic situations, in order to significantly reduce the wait time for the writes. Especially if the filesystem has Crystal's approach of "data is lost by default unless the developer takes non-trivial steps to prevent it" is very surprising. I am bitten by it often. I am curious what the |
@plambert As already mentioned in #13995 (comment), file handles in glibc (or any libc) don't have userspace buffers. The documentation of |
The Crystal runtime does call |
@HertzDevil Sure. It does so only in non-recoverable error situations though. This does not constituate a normal termination and it might not be safe to do so.
|
Feature Request
When your program exits, certain objects may not be properly cleaned up. For example, userland I/O buffers may not be flushed, so data that you wrote to a
File
orTCPSocket
may not actually make it.I haven't personally run into this, but it came up on the forum today and I was surprised until I considered the implementation of
IO::Buffered
. I feel like you shouldn't have to know about that.It used to be the case (up until about 10-20 years ago?) that operating systems wouldn't flush stream buffers on program exit, as well, and there was a common refrain of "… and make sure you call
close()
before exiting your program" because of things like this. Eventually, it just made sense for OSes to clean up file descriptors before exiting.I'm assuming that calling
finalize
on all objects on exit is the simplest way to do this, however that may be my own naïveté talking. I don't know if that's something provided by BDW or any other GC implementation (maybe this function?). And even if it is feasible, I don't know if there's a way to do it that takes object dependency into account. For example, I would want anOpenSSL::SSL::Socket::Client
to have itsfinalize
method called before theTCPSocket
underneath it, because doing it in the reverse order could cause problems — both have their own internal buffers, so if theTCPSocket
is closed first, theOpenSSL::SSL::Socket::Client
can't flush its buffer to the socket.If that is not feasible, are there other ways to ensure this for specific object types?
The text was updated successfully, but these errors were encountered: