Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing .arrow files without closing Julia seems impossible in Windows #492

Closed
Tortar opened this issue Oct 20, 2023 · 18 comments
Closed

Comments

@Tortar
Copy link

Tortar commented Oct 20, 2023

Hi,

I'm a bit in a desperate situation since I'd like to resolve an issue in Agents.jl involving the Arrow.jl package on Windows but even if I tried hard I couldn't: it seems to me that in reality it could be a problem with Arrow.jl since all my (maybe poor) attempts to solve the problem failed. This is the PR trying to solve it: JuliaDynamics/Agents.jl#909.

In practice we have currently in Agents.jl here a test which performs:

            # Arrow, fails on Windows (see issue #826 (https://github.com/JuliaDynamics/Agents.jl/issues/826))
            if !(Sys.iswindows())
                offline_run!(model, 365 * 5;
                    when_model = each_year,
                    when = six_months,
                    backend = :arrow,
                    mdata = [:flag, :year],
                    adata = [(:weight, mean)],
                    writing_interval = 3
                )

                adata_saved = DataFrame(Arrow.Table("adata.arrow"))
                @test size(adata_saved) == (11, 2)
                @test propertynames(adata_saved) == [:step, :mean_weight]

                mdata_saved = DataFrame(Arrow.Table("mdata.arrow"))
                @test size(mdata_saved) == (6, 3)
                @test propertynames(mdata_saved) == [:step, :flag, :year]

                @test size(vcat(DataFrame.(Arrow.Stream("adata.arrow"))...)) == (11, 2)
                @test size(vcat(DataFrame.(Arrow.Stream("mdata.arrow"))...)) == (6, 3)

                rm("adata.arrow")
                rm("mdata.arrow")
                @test !isfile("adata.arrow")
                @test !isfile("mdata.arrow")
            end

On Linux and MacOS it works fine, but on Windows the operating system refuses to remove the .arrow files giving:

ERROR: IOError: unlink("adata.arrow"): permission denied (EACCES)
Stacktrace:
 [1] uv_error
   @ .\libuv.jl:100 [inlined]
 [2] unlink(p::String)
   @ Base.Filesystem .\file.jl:972
 [3] rm(path::String; force::Bool, recursive::Bool)
   @ Base.Filesystem .\file.jl:283
 [4] rm(path::String)
   @ Base.Filesystem .\file.jl:273
 [5] top-level scope
   @ REPL[17]:1

As you can see for Windows it is disables since we didn't manage to make it work.

It's probably necessary to describe what is the relationship between the Agents.jl function offline_run! and Arrow.jl, it is rather simple, offline_run! calls multiple times this other function:

function Agents.writer_arrow(filename, data, append)
    if append
        Arrow.append(filename, data)
    else
        Arrow.write(filename, data; file = false)
    end
end

I'm trying multiple ways to solve the issue, but none worked. I will try to provide an MWE if necessary.

@Moelf
Copy link
Contributor

Moelf commented Oct 20, 2023

Arrow uses memory map which means as long as the memory map is not garbage collected, Windows can't delete, because windows filesystem is kinda annoying.

What you should do is to make sure reference is cleared and forced GC by running GC.gc() a few times

@Tortar
Copy link
Author

Tortar commented Oct 20, 2023

thanks @Moelf .

I tried this code, but removal still gives error

            adata_arrow = Arrow.Table("adata.arrow")
            adata_saved = DataFrame(adata_arrow)
            @test size(adata_saved) == (11, 2)
            @test propertynames(adata_saved) == [:step, :mean_weight]

            mdata_arrow = Arrow.Table("mdata.arrow")
            mdata_saved = DataFrame(mdata_arrow)
            @test size(mdata_saved) == (6, 3)
            @test propertynames(mdata_saved) == [:step, :flag, :year]

            adata_stream_arrow = Arrow.Stream("adata.arrow")
            mdata_stream_arrow = Arrow.Stream("mdata.arrow")
            @test size(vcat(DataFrame.(adata_stream_arrow)...)) == (11, 2)
            @test size(vcat(DataFrame.(mdata_stream_arrow)...)) == (6, 3)

            if Sys.iswindows()
                adata_arrow = nothing
                mdata_arrow = nothing
                mdata_stream_arrow = nothing
                adata_stream_arrow = nothing
                for _ in 1:5
                    GC.gc()
                end
            end
            rm("adata.arrow")
            rm("mdata.arrow")
            @test !isfile("adata.arrow")
            @test !isfile("mdata.arrow")

edit: yes actually there was a typo, will rerun tests

@Moelf
Copy link
Contributor

Moelf commented Oct 20, 2023

adata_arrow = Arrow.Table("adata.arrow")

this is also a reference holding adata.arrow

@Tortar
Copy link
Author

Tortar commented Oct 20, 2023

I fixes the typo in the above code and rerun test, still getting IOError: unlink("adata.arrow"): permission denied (EACCES), how should I clean up the reference? doesn't adata_arrow = nothing do it?

@Moelf
Copy link
Contributor

Moelf commented Oct 20, 2023

I'm out of ideas here.

@Tortar
Copy link
Author

Tortar commented Oct 21, 2023

thanks anyway for the help @Moelf .

To be able to bisect the issue more easily I created this MWE:

using Arrow, DataFrames

function writer_arrow(filename, data, append)
    if append
        Arrow.append(filename, data)
    else
        Arrow.write(filename, data; file = false)
    end
end

filename = "MWE.arrow"
data = DataFrame(a=1:4, b=["M", "F", "F", "M"])
writer_arrow(filename, data, false)
for i in 1:10
    writer_arrow(filename, data, true)
end
table = Arrow.Table("MWE.arrow")
data_saved = DataFrame(table)
table = nothing
GC.gc(); GC.gc()
rm("MWE.arrow")

which gives when runned in REPL:

julia> rm("MWE.arrow")
ERROR: IOError: unlink("MWE.arrow"): permission denied (EACCES)
Stacktrace:
 [1] uv_error
   @ .\libuv.jl:100 [inlined]
 [2] unlink(p::String)
   @ Base.Filesystem .\file.jl:972
 [3] rm(path::String; force::Bool, recursive::Bool)
   @ Base.Filesystem .\file.jl:283
 [4] rm(path::String)
   @ Base.Filesystem .\file.jl:273
 [5] top-level scope
   @ REPL[11]:1

Version info of my settings:

julia> versioninfo()
Julia Version 1.9.3
Commit bed2cd540a (2023-08-24 14:43 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 12 × AMD Ryzen 5 5600H with Radeon Graphics
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 1 on 12 virtual cores

Hopefully this will help someone who can solve the issue :-)

@Tortar
Copy link
Author

Tortar commented Oct 21, 2023

ok strangely enough to me if you add data_saved = nothing the file can gets removed without problems. This sounds strange to me since data_saved is a DataFrame which even if created with an arrow table shouldn't be related to it, do you have any idea on how is this possible?

e.g. this works fine

using Arrow, DataFrames

function writer_arrow(filename, data, append)
    if append
        Arrow.append(filename, data)
    else
        Arrow.write(filename, data; file = false)
    end
end

filename = "MWE.arrow"
data = DataFrame(a=1:4, b=["M", "F", "F", "M"])
writer_arrow(filename, data, false)
for i in 1:10
    writer_arrow(filename, data, true)
end
table = Arrow.Table("MWE.arrow")
data_saved = DataFrame(table)
table = nothing
data_saved = nothing
GC.gc(); GC.gc()
rm("MWE.arrow")

@Tortar
Copy link
Author

Tortar commented Oct 21, 2023

ohh ok got it, the dataframe doesn't copy the content of the arrow table, just steal the reference, if one tries:

julia> deleteat!(data_saved, 2)
ERROR: ArgumentError: `Arrow.Primitive{Int64, Vector{Int64}}` does not support `deleteat!`; arrow data is by nature immutable
Stacktrace:
 [1] deleteat!(x::Arrow.Primitive{Int64, Vector{Int64}}, inds::Vector{Int64})
   @ Arrow C:\Users\bob\.julia\packages\Arrow\R2Rvz\src\arraytypes\arraytypes.jl:32
 [2] deleteat!(A::SentinelArrays.ChainedVector{Int64, Arrow.Primitive{Int64, Vector{Int64}}}, inds::Vector{Int64})
   @ SentinelArrays C:\Users\bob\.julia\packages\SentinelArrays\cav7N\src\chainedvector.jl:560
 [3] _deleteat!_helper(df::DataFrame, drop::Vector{Int64})
   @ DataFrames C:\Users\bob\.julia\packages\DataFrames\58MUJ\src\dataframe\dataframe.jl:919
 [4] deleteat!(df::DataFrame, inds::Int64)
   @ DataFrames C:\Users\bob\.julia\packages\DataFrames\58MUJ\src\dataframe\dataframe.jl:861
 [5] top-level scope
   @ REPL[33]:1

is this correct behaviour though?

It isn't even possible to add new rows to this dataframe since the underlying arrow table is immutable:

julia> append!(data_saved, data)
ERROR: MethodError: no method matching resize!(::Arrow.Primitive{Int64, Vector{Int64}}, ::Int64)

Closest candidates are:
  resize!(::Vector, ::Integer)
   @ Base array.jl:1246
  resize!(::BitVector, ::Integer)
   @ Base bitarray.jl:814
  resize!(::SparseArrays.ReadOnly, ::Any)
   @ SparseArrays C:\Users\bob\AppData\Local\Programs\Julia-1.9.3\share\julia\stdlib\v1.9\SparseArrays\src\readonly.jl:33
  ...

....

@Tortar
Copy link
Author

Tortar commented Oct 21, 2023

All in all, I think this is probably a problem with DataFrames.jl.

edit: okay, actually it is specified in the DataFrames.jl docs so problem solved:

Another example is Arrow.Table, where arrow data is inherently immutable so columns can't be accidentally mutated anyway. To be able to mutate arrow data, columns must be materialized,
which can be accomplished via DataFrame(arrow_table, copycols=true).

@Tortar Tortar closed this as completed Oct 21, 2023
@Tortar
Copy link
Author

Tortar commented Oct 22, 2023

Reopening since even with this new information I couldn't get the code to work on Windows in the tests of Agents.jl and so...I got a new updated MWE:

using Arrow, DataFrames, Test
@testset "reproduce" begin
    data = DataFrame(a=1:4, b=["M", "F", "F", "M"])
    Arrow.write("MWE.arrow", data; file = false)
    DataFrame.(Arrow.Stream("MWE.arrow"), copycols=true)
    nothing
    for _ in 1:10 GC.gc() end
    rm("MWE.arrow")
end

this again throws "ERROR: IOError: unlink("MWE.arrow"): permission denied (EACCES)", notice that strangely if I remove the if statement wrapping all the code everything works fine.

edit: using a testset to make it more realistic

@Tortar Tortar reopened this Oct 22, 2023
@Moelf
Copy link
Contributor

Moelf commented Oct 22, 2023

GC.gc()

try to add a few more GC.gc() call

@Tortar
Copy link
Author

Tortar commented Oct 22, 2023

updated the code with 10 GC runs, same error

@Moelf
Copy link
Contributor

Moelf commented Oct 22, 2023

maybe try isolate the workload into a function, call that function, and run GC outside? idk if you're testing this in REPL or not, but in REPL you would have ans referencing to the last line

@Tortar
Copy link
Author

Tortar commented Oct 22, 2023

I'm testing in REPL but ans is nothing in this case though, anyway changing it in

using Arrow, DataFrames, Test
function create_file()
    data = DataFrame(a=1:4, b=["M", "F", "F", "M"])
    Arrow.write("MWE.arrow", data; file = false)
    DataFrame.(Arrow.Stream("MWE.arrow"), copycols=true)
end
@testset "reproduce" begin
    create_file()
    GC.gc()
    rm("MWE.arrow")
end

removes the file fine.

But still I have problems understanding why that other version without the function call instead gives error, could this be a GC issue?

@Moelf
Copy link
Contributor

Moelf commented Oct 22, 2023

well, I think global scope and running GC.gc() in a for loop is not very clean conceptually.

@Tortar
Copy link
Author

Tortar commented Oct 22, 2023

I'm feeling that this problem is similar to https://discourse.julialang.org/t/how-to-force-an-object-to-be-freed-by-the-garbage-collector/77985. (also, even if I change for _ in 1:10 GC.gc() end with GC.gc(); GC.gc(); GC.gc(); ... the error is still there)

@Moelf
Copy link
Contributor

Moelf commented Oct 22, 2023

I think the general problem is GC interaction happens with compiled code, and the code we human write are not what actually happens line by line at runtime, so you shouldn't rely on compiler does something specific.

I agree compiler should account for GC effects but that's probably hard, meanwhile just remember to isolate by unit of individual function gets you very far already

@Tortar
Copy link
Author

Tortar commented Oct 22, 2023

thanks for all the help @Moelf, do you think we can close this one then since the problem seems unrelated to Arrow in particular?

@Tortar Tortar closed this as completed Oct 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants