Fix getindex and add groupvars and groupindices #1709

bkamins · 2019-02-01T13:32:55Z

I have added methods following the discussion in #1704.

@nalimilan unfortunately we cannot use groups field. The reason is that GroupedDataFrame can contain less levels than originally created as it can be subsetted and we do not update groups variable when subsetting GroupedDataFrame.

This is WIP, because we might not accept this design (if we do I will add tests).

The issue is that we should make sure that groups for subsetted is actually correct - @nalimilan we have a bug here. If we fix the bug the implementation might be faster by using groups field:

Here you have a bug reproduced:

julia> using DataFrames

julia> df = DataFrame(rand(3,4))
3×4 DataFrame
│ Row │ x1       │ x2       │ x3       │ x4       │
│     │ Float64  │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ 0.83705  │ 0.103435 │ 0.887152 │ 0.850853 │
│ 2   │ 0.614734 │ 0.696163 │ 0.905968 │ 0.564249 │
│ 3   │ 0.824804 │ 0.273611 │ 0.257941 │ 0.483494 │

julia> gd = groupby(df, :x1)
GroupedDataFrame with 3 groups based on key: x1
First Group (1 row): x1 = 0.8370503031615726
│ Row │ x1      │ x2       │ x3       │ x4       │
│     │ Float64 │ Float64  │ Float64  │ Float64  │
├─────┼─────────┼──────────┼──────────┼──────────┤
│ 1   │ 0.83705 │ 0.103435 │ 0.887152 │ 0.850853 │
⋮
Last Group (1 row): x1 = 0.8248041395500108
│ Row │ x1       │ x2       │ x3       │ x4       │
│     │ Float64  │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ 0.824804 │ 0.273611 │ 0.257941 │ 0.483494 │

julia> gd2 = gd[[3,2]]
GroupedDataFrame with 2 groups based on key: x1
First Group (1 row): x1 = 0.8248041395500108
│ Row │ x1       │ x2       │ x3       │ x4       │
│     │ Float64  │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ 0.824804 │ 0.273611 │ 0.257941 │ 0.483494 │
⋮
Last Group (1 row): x1 = 0.6147335928896147
│ Row │ x1       │ x2       │ x3       │ x4       │
│     │ Float64  │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ 0.614734 │ 0.696163 │ 0.905968 │ 0.564249 │

julia> combine(gd2, res=:x2=>sum)
2×2 DataFrame
│ Row │ x1       │ res      │
│     │ Float64  │ Float64  │
├─────┼──────────┼──────────┤
│ 1   │ 0.824804 │ 0.103435 │
│ 2   │ 0.614734 │ 0.696163 │

nalimilan · 2019-02-01T13:49:35Z

That's problematic indeed... But shouldn't we fix the bug (by recomputing groups when subsetting), and then we can just use groups?

bkamins · 2019-02-01T14:31:17Z

But shouldn't we fix the bug

We should - that is why I made this WIP to show where the problem is. I will fix the bug in this PR ten and probably we should make a minor release soon.

bkamins · 2019-02-01T15:08:34Z

Pushed a commit as I have to go now, so it might fail the CI, but the design is there.
It fixes two things with getindex:

makes sure we correctly initialize groups field;
makes sure that we do not allow duplicate indices (which would lead to problems with groups field as it assumes that each value belongs to a unique group)

bkamins · 2019-02-01T15:09:14Z

I have also made groupby_checked even more strict just to be sure.

matthieugomez · 2019-02-03T19:26:07Z

src/groupeddataframe/grouping.jl

+a data frame `gd[i]` contains this row.
+Rows not present in any of `gd` groups have their index set to `missing`.
+"""
+groupindices(gd::GroupedDataFrame) = replace(gd.groups, 0=>missing)


If the initial grouping variables do not allow missing values, maybe it would be better to return a vector of Integers, rather than Union{Missing, <: Integer}.

There are two issues here:

even if the initial grouping variables do not allow missing values still GroupedDataFrame may not map all rows of the parent, because it can be subsetted;

there is an issue of type stability - upfront we do not know if we will have missing in the return value the result would be an union of two possible types (not that bad under Julia 1.0, but still maybe it is better to guarantee one return type). Notice that this is probably the reason why replace also has this behavior). It is easy to change - I would just replace replace with a comprehension which will use the narrowest type possible.

But I am open to change it if the majority would prefer the "type unstable" option (actually this was the reason why I considered to leave 0 instead of missing in the first place).

I'm not familiar enough with the package to understand the first issue. For, the second, I was thinking that the value of skipmissing could be in the type of the grouped dataframe.
Just to be more precise: the behavior I'm envisioning is that, if groupby is called with skipmissing = false, groupindices returns a Vector{Int}. But if groupby is called with skipmissing =true, it returns a Vector{Union{Missing, T}) whether or not there are actual missing values in the grouping variable.

This proposal exactly hits the problem number 1, so let me elaborate on it.

If you create a GroupedDataFrame you can subset it, e.g. assume that you have a gd with three groups and then write something like gd2 = gd[2:3] this will create a new GroupedDataFrame with only two groups (the first one is dropped). This means that even if skipmissing=false you may have have some rows that behave as-if they were skipped due to skipmissing=true.

This part of code in this PR handles this case.

We could add a type parameter indicating whether gd has been built with skipmissing=true or is a subset. But that would just create a type instability earlier (in the case of skipmissing at least), so that's not necessarily a good idea. OTOH, using a comprehension would allow inference to know that the result is a Vector{<:Union{Int,Missing}}, which isn't too bad (and not worse than Vector{Union{Int,Missing}}. I guess that's strictly superior to using replace (which should maybe have used the same approach, not sure).

Actually, unfortunately, using a comprehension inference is only able to say it is an AbstractArray:

julia> g(x) = [ifelse(v == 0, missing, v) for v in x] g (generic function with 1 method) julia> h(x) = [v == 0 ? missing : v for v in x] h (generic function with 1 method) julia> x = [1,2,3] 3-element Array{Int64,1}: 1 2 3 julia> @code_warntype g(x) Body::AbstractArray 1 ─ %1 = %new(Base.Generator{Array{Int64,1},getfield(Main, Symbol("##11#12"))}, getfield(Main, Symbol("##11#12"))(), x)::Base.Generator{Array{Int64,1},getfield(Main, Symbol("##11#12"))} │ %2 = invoke Base.collect(%1::Base.Generator{Array{Int64,1},getfield(Main, Symbol("##11#12"))})::AbstractArray └── return %2 julia> @code_warntype h(x) Body::AbstractArray 1 ─ %1 = %new(Base.Generator{Array{Int64,1},getfield(Main, Symbol("##13#14"))}, getfield(Main, Symbol("##13#14"))(), x)::Base.Generator{Array{Int64,1},getfield(Main, Symbol("##13#14"))} │ %2 = invoke Base.collect(%1::Base.Generator{Array{Int64,1},getfield(Main, Symbol("##13#14"))})::AbstractArray └── return %2

so maybe we should stick to the implementation in replace?

Ah yes, that's JuliaLang/julia#30485. Maybe we can keep replace for now, but file an issue on the 1.0 milestone to remember we should switch before that if possible.

Opened #1713

src/groupeddataframe/grouping.jl

nalimilan · 2019-02-04T21:24:23Z

src/groupeddataframe/grouping.jl

+a data frame `gd[i]` contains this row.
+Rows not present in any of `gd` groups have their index set to `missing`.
+"""
+groupindices(gd::GroupedDataFrame) = replace(gd.groups, 0=>missing)


We could add a type parameter indicating whether gd has been built with skipmissing=true or is a subset. But that would just create a type instability earlier (in the case of skipmissing at least), so that's not necessarily a good idea. OTOH, using a comprehension would allow inference to know that the result is a Vector{<:Union{Int,Missing}}, which isn't too bad (and not worse than Vector{Union{Int,Missing}}. I guess that's strictly superior to using replace (which should maybe have used the same approach, not sure).

src/groupeddataframe/grouping.jl

nalimilan · 2019-02-04T21:30:35Z

test/grouping.jl

+@testset "groupindices and groupvars" begin
+    df = DataFrame(A = [missing, :A, :B, :A, :B, missing], B = 1:6)
+    gd = groupby_checked(df, :A)
+    @test groupindices(gd) == [1, 2, 3, 2, 3, 1]


Also test type?

I have added @inferred test, which should pass with the current implementation (and fails with a comprehension).

Co-Authored-By: bkamins <[email protected]>

matthieugomez · 2019-02-05T20:59:27Z

Is it true that maximum(skipmissing(groupindices(df))) gives the number of groups? If so, I will add it to the documentation.

bkamins · 2019-02-05T21:18:09Z

Yes. But we have this contract documented more precisely already. If groupindices(df)[idx] is not missing then it is a number of a group of row idx in the parent data frame. What you state is a consequence of this, because we do not create empty groups (i.e. we have other conract - and this one is not documented AFACT, but is important and often used that each group has at least one row).

matthieugomez · 2019-02-05T22:57:21Z

Yes I guess that’s the second contract I was wondering about (and it sounds related to the subseting issue you were talking about)

…

On Tue, Feb 5, 2019 at 4:18 PM Bogumił Kamiński ***@***.***> wrote: Yes. But we have this contract documented more precisely already. If groupindices(df)[idx] is not missing then it is a number of a group of row idx in the parent data frame. What you state is a consequence of this, because we do not create empty groups (i.e. we have other conract - and this one is not documented AFACT, but is important and often used that each group has at least one row). — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1709 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AF733e2X7mMIUwB39eq4mu_PngJS53GTks5vKfUTgaJpZM4aeUwv> .

bkamins · 2019-02-05T23:35:56Z

So this function specifies the contract on a GroupedDataFrame returned by groupby.

Then GroupedDataFrame in general does not have to satisfy this contract as it can be subsetted in which case the only difference are lines:

https://github.com/JuliaData/DataFrames.jl/blob/master/test/grouping.jl#L25, which changes to se[end][2] <= length(gd.idx)
https://github.com/JuliaData/DataFrames.jl/blob/master/test/grouping.jl#L32 which changes to se[i-1][2] < se[i][1]

(the general difference is that something returned by groupby is "packed" - has no jumps and fills the full length of gd.idx, but can start later as fist entries of gd.idx are reserved for possible missings; while subsetting just removes some start-end pairs, so we can have jumps and if the last pair is removed we do not reach the end of gd.idx)

These are internal things, but if there is something you feel is worth documenting (e.g. that groups must be non-empty, which is enforced by the fact that we have start-end pairs and start[i]<=end[i] by design) it would be great if you pushed a PR to the documentation.

bkamins · 2019-02-05T23:39:06Z

Ah - and the problem with maximum(skipmissing(groupindices(gd))) is that it would fail if gd were empty (which is possible), so this would be a contract valid only if length(gd)>0.

add groupvars and groupindices

029ea01

bkamins requested a review from nalimilan February 1, 2019 13:33

bkamins added bug grouping labels Feb 1, 2019

bkamins changed the title ~~WIP: add groupvars and groupindices~~ Fix getindex and add groupvars and groupindices Feb 1, 2019

fix getindex

c6ca17b

bkamins added 6 commits February 1, 2019 16:11

a small fix

7ce1621

fix typo

02015d1

fix wrong variable reference

d1df50a

fix test condition

92968e4

use missing in groupindices

3237656

fix comparison

b40d2e3

matthieugomez reviewed Feb 3, 2019

View reviewed changes

nalimilan reviewed Feb 4, 2019

View reviewed changes

nalimilan and others added 6 commits February 4, 2019 22:47

Update src/groupeddataframe/grouping.jl

e894f3d

Co-Authored-By: bkamins <[email protected]>

apply review comments

a6d7e00

fix inferred

1f75bdd

fix a typo

6968957

fix wrong test

f321bec

Use initial lower case

b323ba7

nalimilan approved these changes Feb 5, 2019

View reviewed changes

bkamins mentioned this pull request Feb 5, 2019

Replace replace with a comprehension #1713

Open

more strict groupby_checked

d16bb50

bkamins merged commit 359bf31 into JuliaData:master Feb 5, 2019

bkamins deleted the grouping_api branch February 5, 2019 10:48

Fix getindex and add groupvars and groupindices #1709

Fix getindex and add groupvars and groupindices #1709

Uh oh!

Conversation

bkamins commented Feb 1, 2019

Uh oh!

nalimilan commented Feb 1, 2019

Uh oh!

bkamins commented Feb 1, 2019

Uh oh!

bkamins commented Feb 1, 2019

Uh oh!

bkamins commented Feb 1, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthieugomez Feb 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthieugomez commented Feb 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bkamins commented Feb 5, 2019

Uh oh!

matthieugomez commented Feb 5, 2019 via email

Uh oh!

bkamins commented Feb 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bkamins commented Feb 5, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

matthieugomez Feb 3, 2019 •

edited

Loading

matthieugomez commented Feb 5, 2019 •

edited

Loading

bkamins commented Feb 5, 2019 •

edited

Loading