Pattern for windowed queries? #46

svilupp · 2023-08-03T18:36:05Z

svilupp
Aug 3, 2023

First of all, thank you again for creating these packages!

I have a use case where I need to process a large number of tables in a windowed query (think tables saved by the hour and I need two-hour averages).

I discussed with @jpsamaroo in JCon that it would be best to process it by simply operating on the chunks underpinning the DTable, but I've been struggling to produce a working example (see MWE below).

I get an error about indexing: "ERROR: ArgumentError: invalid index: EagerThunk (running) of type Dagger.EagerThunk"
Or an error about Chunk: "ERROR: type Chunk has no field a"

I probably wrongly assumed that wrapping it in Dagger.@Spawn will delay execution until it's ready to be run (ie, that inner function will never see the "Chunk")

MWE:

using Dagger, DTables, DataFrames
# starting from a DF to make things simle
x = DataFrame(; a=1:10_000, b=1:10_000)
# create 10 partitions
d = DTable(x, 1000)

results = []
for i in 1:(length(d.chunks)-1)
    tbl = Dagger.@spawn d.chunks[i]
    tbl_next = Dagger.@spawn d.chunks[i+1]
    # running on current+next partition and calculating average of the column :a
    out = Dagger.@spawn sum(tbl.a .+ tbl_next.a) / 2
    # in an ideal world, the return is an object not scalar (eg, DataFrame), hence the generic `push!`
    Dagger.@spawn push!(results, out)
end

# note: this is imperfect as it doesn't process 10th partition, but that's okay. I wanted to keep the example as simple as possible

Thank you!

EDIT:
The bonus question would be how to run it out-of-core as presented by Julian? (with deserialize/serialize arguments)
I couldn't find that API on the main branch / in the docs.
I suspect it might not be released yet, right?

krynju · 2023-08-07T17:20:33Z

krynju
Aug 7, 2023
Maintainer

This should "just work" for you

using Dagger, DTables, DataFrames
# starting from a DF to make things simle
x = DataFrame(; a=1:10_000, b=1:10_000)
# create 10 partitions
d = DTable(x, 1000)

results = []
for i in 1:(length(d.chunks)-1)

    # running on current+next partition and calculating average of the column :a
    out = Dagger.@spawn ((c1, c2) -> sum(c1.a .+ c2.a) / 2)(d.chunks[i], d.chunks[i+1])
    # in an ideal world, the return is an object not scalar (eg, DataFrame), hence the generic `push!`
    push!(results, out)
end

fetch.(results)

What you're trying to do looks a lot like what DTables.jl internals look like, so feel free to have a look at how things are done there for more inspiration
The thing you were missing is that Dagger @spawn/spawn functions should take Thunks/Chunks as arguments in our REPL/code level and only inside the functions that they're put into they will be fetched onto the local process and used directly (so Thunks/Chunks will effectively be unwrapped there)

We should probably handle windowed functions differently. I suggest you try OnlineStats with this to skip the manual work. Something like this https://github.com/krynju/mgr-benchmarks/blob/6f536976fb78750e46f1a4f06b9239e6c443faea/dtable/scripts/dtable_full_scenario_stages.jl#L56-L62
No docs for that either, but you can follow OnlineStats.jl docs and just follow the general idea that you need to put fit! into the reduction function in DTables and the rest is handled by init setup.
I think DTables + OnlineStats can cover your usecase properly without this manual Dagger code

Out of core (so disk caching) can be enabled through DTables.enable_disk_caching!(), but there aren't any docs for this yet. You can give it a go though, it should be somewhat functional (just avoid processes and threads in the setup together, either is fine, but both causes some multithreading issues we need to sort out in Distributed)

0 replies

jpsamaroo · 2023-08-08T15:54:14Z

jpsamaroo
Aug 8, 2023
Maintainer

One thing to beware of is that Dagger.@spawn push!(results, out) will break if you use multithreading or distributed, for different reasons:

push! is not thread-safe under multithreaded execution
results will be copied to other workers (and not copied back) under distributed execution

For multithreading, I would put a lock around it. For distributed, I would instead use something like results = Dagger.@shard [] to allocate a Vector for each worker, which you can then combine at the end with reduce((x,y)->vcat(fetch(x), fetch(y)), results).

0 replies

krynju · 2023-08-08T15:57:53Z

krynju
Aug 8, 2023
Maintainer

Oooops there should be no Dagger.@spawn push!(results, out) there, I forgot to remove it because it makes it not work

fixed and edited previous comment

using Dagger, DTables, DataFrames
# starting from a DF to make things simle
x = DataFrame(; a=1:10_000, b=1:10_000)
# create 10 partitions
d = DTable(x, 1000)

results = []
for i in 1:(length(d.chunks)-1)

    # running on current+next partition and calculating average of the column :a
    out = Dagger.@spawn ((c1, c2) -> sum(c1.a .+ c2.a) / 2)(d.chunks[i], d.chunks[i+1])
    # in an ideal world, the return is an object not scalar (eg, DataFrame), hence the generic `push!`
    push!(results, out)
end

fetch.(results)

0 replies

svilupp · 2023-08-13T16:13:26Z

svilupp
Aug 13, 2023
Author

Thank you for both for your comments! The original MWE is working great now :)

I'm ultimately looking for a (table,table)->Vector{MyObject} interface that saves to an on-disk file. I've tried to go through the source code for Dagger and DTable, but I'm still failing with a working example.

Note: I'm using DTable and Dagger latest versions on main (to have access to File and tofile)

My questions below (hopefully, they will be helpful for others as well):

Dtable filter/reduce are super helpful, but if I want a custom operation, is it best to just @spawn over DTable.chunks?

Eg, "I want to sum up column a" (if reduce didn't exist...)

# Create 1000 partitions as files
mkpath("data")
x = DataFrame(; a=1:100_000, b=1:100_000)
for (i, chunk) in enumerate(Iterators.partition(x, 1000))
    Arrow.write("data/myfile_$i.arrow", chunk)
end

# Build the DTable
files = readdir("data", join=true)
d = DTable(x -> Arrow.Table(x), files; tabletype=DataFrame)

# standard DTable - works as it should!
r = reduce(+, d, cols=[:a])
fetch(r) # (a=50005000,)

# but if I wanted a custom op, I cannot @spawn on DTable directly, can I?
# **FAIL**: this "runs" but returns wrong result (I'm not sure how to introspect it)
res = Dagger.@spawn (x -> sum(x.a))(d)
fetch(res) # result is 0, not an error?

# **SUCCESS**: run on chunks manually (as your original answer)
f = x -> sum(x.a)
res = fetch.([Dagger.@spawn f(c) for c in d.chunks]) |> sum
fetch(res) # 50005000

Windowed operation that saves to disk - saves to a wrong path?

mkpath("data_output")
status = []
function f2(c1, c2)
    combine(x -> (; a=sum(x.a), b=sum(x.b)), vcat(c1, c2))
end
for i in 1:(length(d.chunks)-1)
    out = Dagger.@spawn f2(d.chunks[i], d.chunks[i+1])
    # this files -- probably because I need to wrap it in a function (like previous answers)
    # s = Dagger.@spawn Dagger.tofile(out, "data_output/out_$i.arrow", serialize=Arrow.write, deserialize=Arrow.Table)
    # wrap in a function
    s = Dagger.@spawn (x -> Dagger.tofile(x, "data_output/out_$i.arrow", serialize=Arrow.write, deserialize=Arrow.Table))(out)
    push!(status, s)
end
status # returns file references/thunk references
# fetch.(status) # returns the actual tables inside the files

This code runs but the files will be created in the pwd(), not in "data_output/out_$i.arrow".

I suspect this is a bug that comes from here.

I think the fix might be:

isabspath("data_output/out.arrow") == false # not helpful
# instead:
dir, file = dirname(path), basename(path)
dir = isempty(dir) ? pwd() : dir

I'm happy to open Issue/PR, but I wanted to confirm that I'm not using it wrong.

Memory management - it's referred to as out-of-core, but it seems that neither the DTable chunks, nor the Dagger.tofile-created objects can be GC'd as long as any reference to them are alive (I keep seeing growing memory usage in my toy examples)

Eg, (references to code in Q2)

If I keep the status array with thunk references to Dagger.tofile(), will my RAM keep holding all the intermediary tables as well? (ie, cannot be GC'd)
When iterating over DTable.chunks, once loaded, will the chunks stay loaded even once their prime time is over?
Let's assume a scenario of 100x 1GB files with 32GB RAM machine - will the mempool drop them in some smart way?

Thank you :)

2 replies

krynju Aug 14, 2023
Maintainer

Ping me if I don't get back to you after this week

svilupp Aug 17, 2023
Author

No rush!

In the meantime, I've opened the PR to fix (2)

Pattern for windowed queries? #46

Uh oh!

svilupp Aug 3, 2023

Replies: 4 comments · 2 replies

Uh oh!

Uh oh!

krynju Aug 7, 2023 Maintainer

Uh oh!

jpsamaroo Aug 8, 2023 Maintainer

Uh oh!

Uh oh!

krynju Aug 8, 2023 Maintainer

Uh oh!

Uh oh!

svilupp Aug 13, 2023 Author

Uh oh!

krynju Aug 14, 2023 Maintainer

Uh oh!

svilupp Aug 17, 2023 Author

svilupp
Aug 3, 2023

Replies: 4 comments 2 replies

krynju
Aug 7, 2023
Maintainer

jpsamaroo
Aug 8, 2023
Maintainer

krynju
Aug 8, 2023
Maintainer

svilupp
Aug 13, 2023
Author

krynju Aug 14, 2023
Maintainer

svilupp Aug 17, 2023
Author