Skip to content

Conversation

@bkamins
Copy link
Member

@bkamins bkamins commented Feb 1, 2019

I have added methods following the discussion in #1704.

@nalimilan unfortunately we cannot use groups field. The reason is that GroupedDataFrame can contain less levels than originally created as it can be subsetted and we do not update groups variable when subsetting GroupedDataFrame.

This is WIP, because we might not accept this design (if we do I will add tests).

The issue is that we should make sure that groups for subsetted is actually correct - @nalimilan we have a bug here. If we fix the bug the implementation might be faster by using groups field:

Here you have a bug reproduced:

julia> using DataFrames

julia> df = DataFrame(rand(3,4))
3×4 DataFrame
│ Row │ x1       │ x2       │ x3       │ x4       │
│     │ Float64  │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ 0.83705  │ 0.103435 │ 0.887152 │ 0.850853 │
│ 2   │ 0.614734 │ 0.696163 │ 0.905968 │ 0.564249 │
│ 3   │ 0.824804 │ 0.273611 │ 0.257941 │ 0.483494 │

julia> gd = groupby(df, :x1)
GroupedDataFrame with 3 groups based on key: x1
First Group (1 row): x1 = 0.8370503031615726
│ Row │ x1      │ x2       │ x3       │ x4       │
│     │ Float64 │ Float64  │ Float64  │ Float64  │
├─────┼─────────┼──────────┼──────────┼──────────┤
│ 1   │ 0.83705 │ 0.103435 │ 0.887152 │ 0.850853 │
⋮
Last Group (1 row): x1 = 0.8248041395500108
│ Row │ x1       │ x2       │ x3       │ x4       │
│     │ Float64  │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ 0.824804 │ 0.273611 │ 0.257941 │ 0.483494 │

julia> gd2 = gd[[3,2]]
GroupedDataFrame with 2 groups based on key: x1
First Group (1 row): x1 = 0.8248041395500108
│ Row │ x1       │ x2       │ x3       │ x4       │
│     │ Float64  │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ 0.824804 │ 0.273611 │ 0.257941 │ 0.483494 │
⋮
Last Group (1 row): x1 = 0.6147335928896147
│ Row │ x1       │ x2       │ x3       │ x4       │
│     │ Float64  │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ 0.614734 │ 0.696163 │ 0.905968 │ 0.564249 │

julia> combine(gd2, res=:x2=>sum)
2×2 DataFrame
│ Row │ x1       │ res      │
│     │ Float64  │ Float64  │
├─────┼──────────┼──────────┤
│ 1   │ 0.824804 │ 0.103435 │
│ 2   │ 0.614734 │ 0.696163 │

@nalimilan
Copy link
Member

That's problematic indeed... But shouldn't we fix the bug (by recomputing groups when subsetting), and then we can just use groups?

@bkamins
Copy link
Member Author

bkamins commented Feb 1, 2019

But shouldn't we fix the bug

We should - that is why I made this WIP to show where the problem is. I will fix the bug in this PR ten and probably we should make a minor release soon.

@bkamins bkamins changed the title WIP: add groupvars and groupindices Fix getindex and add groupvars and groupindices Feb 1, 2019
@bkamins
Copy link
Member Author

bkamins commented Feb 1, 2019

Pushed a commit as I have to go now, so it might fail the CI, but the design is there.
It fixes two things with getindex:

  • makes sure we correctly initialize groups field;
  • makes sure that we do not allow duplicate indices (which would lead to problems with groups field as it assumes that each value belongs to a unique group)

@bkamins
Copy link
Member Author

bkamins commented Feb 1, 2019

I have also made groupby_checked even more strict just to be sure.

a data frame `gd[i]` contains this row.
Rows not present in any of `gd` groups have their index set to `missing`.
"""
groupindices(gd::GroupedDataFrame) = replace(gd.groups, 0=>missing)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the initial grouping variables do not allow missing values, maybe it would be better to return a vector of Integers, rather than Union{Missing, <: Integer}.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two issues here:

  1. even if the initial grouping variables do not allow missing values still GroupedDataFrame may not map all rows of the parent, because it can be subsetted;
  2. there is an issue of type stability - upfront we do not know if we will have missing in the return value the result would be an union of two possible types (not that bad under Julia 1.0, but still maybe it is better to guarantee one return type). Notice that this is probably the reason why replace also has this behavior). It is easy to change - I would just replace replace with a comprehension which will use the narrowest type possible.

But I am open to change it if the majority would prefer the "type unstable" option (actually this was the reason why I considered to leave 0 instead of missing in the first place).

Copy link
Contributor

@matthieugomez matthieugomez Feb 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar enough with the package to understand the first issue. For, the second, I was thinking that the value of skipmissing could be in the type of the grouped dataframe.
Just to be more precise: the behavior I'm envisioning is that, if groupby is called with skipmissing = false, groupindices returns a Vector{Int}. But if groupby is called with skipmissing =true, it returns a Vector{Union{Missing, T}) whether or not there are actual missing values in the grouping variable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal exactly hits the problem number 1, so let me elaborate on it.

If you create a GroupedDataFrame you can subset it, e.g. assume that you have a gd with three groups and then write something like gd2 = gd[2:3] this will create a new GroupedDataFrame with only two groups (the first one is dropped). This means that even if skipmissing=false you may have have some rows that behave as-if they were skipped due to skipmissing=true.

This part of code in this PR handles this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add a type parameter indicating whether gd has been built with skipmissing=true or is a subset. But that would just create a type instability earlier (in the case of skipmissing at least), so that's not necessarily a good idea. OTOH, using a comprehension would allow inference to know that the result is a Vector{<:Union{Int,Missing}}, which isn't too bad (and not worse than Vector{Union{Int,Missing}}. I guess that's strictly superior to using replace (which should maybe have used the same approach, not sure).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, unfortunately, using a comprehension inference is only able to say it is an AbstractArray:

julia> g(x) = [ifelse(v == 0, missing, v) for v in x]
g (generic function with 1 method)

julia> h(x) = [v == 0 ? missing : v for v in x]
h (generic function with 1 method)

julia> x = [1,2,3]
3-element Array{Int64,1}:
 1
 2
 3

julia> @code_warntype g(x)
Body::AbstractArray
1 ─ %1 = %new(Base.Generator{Array{Int64,1},getfield(Main, Symbol("##11#12"))}, getfield(Main, Symbol("##11#12"))(), x)::Base.Generator{Array{Int64,1},getfield(Main, Symbol("##11#12"))}
│   %2 = invoke Base.collect(%1::Base.Generator{Array{Int64,1},getfield(Main, Symbol("##11#12"))})::AbstractArray
└──      return %2

julia> @code_warntype h(x)
Body::AbstractArray
1 ─ %1 = %new(Base.Generator{Array{Int64,1},getfield(Main, Symbol("##13#14"))}, getfield(Main, Symbol("##13#14"))(), x)::Base.Generator{Array{Int64,1},getfield(Main, Symbol("##13#14"))}
│   %2 = invoke Base.collect(%1::Base.Generator{Array{Int64,1},getfield(Main, Symbol("##13#14"))})::AbstractArray
└──      return %2

so maybe we should stick to the implementation in replace?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, that's JuliaLang/julia#30485. Maybe we can keep replace for now, but file an issue on the 1.0 milestone to remember we should switch before that if possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #1713

a data frame `gd[i]` contains this row.
Rows not present in any of `gd` groups have their index set to `missing`.
"""
groupindices(gd::GroupedDataFrame) = replace(gd.groups, 0=>missing)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add a type parameter indicating whether gd has been built with skipmissing=true or is a subset. But that would just create a type instability earlier (in the case of skipmissing at least), so that's not necessarily a good idea. OTOH, using a comprehension would allow inference to know that the result is a Vector{<:Union{Int,Missing}}, which isn't too bad (and not worse than Vector{Union{Int,Missing}}. I guess that's strictly superior to using replace (which should maybe have used the same approach, not sure).

@testset "groupindices and groupvars" begin
df = DataFrame(A = [missing, :A, :B, :A, :B, missing], B = 1:6)
gd = groupby_checked(df, :A)
@test groupindices(gd) == [1, 2, 3, 2, 3, 1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also test type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added @inferred test, which should pass with the current implementation (and fails with a comprehension).

@bkamins bkamins merged commit 359bf31 into JuliaData:master Feb 5, 2019
@bkamins bkamins deleted the grouping_api branch February 5, 2019 10:48
@matthieugomez
Copy link
Contributor

matthieugomez commented Feb 5, 2019

Is it true that maximum(skipmissing(groupindices(df))) gives the number of groups? If so, I will add it to the documentation.

@bkamins
Copy link
Member Author

bkamins commented Feb 5, 2019

Yes. But we have this contract documented more precisely already. If groupindices(df)[idx] is not missing then it is a number of a group of row idx in the parent data frame. What you state is a consequence of this, because we do not create empty groups (i.e. we have other conract - and this one is not documented AFACT, but is important and often used that each group has at least one row).

@matthieugomez
Copy link
Contributor

matthieugomez commented Feb 5, 2019 via email

@bkamins
Copy link
Member Author

bkamins commented Feb 5, 2019

So this function specifies the contract on a GroupedDataFrame returned by groupby.

Then GroupedDataFrame in general does not have to satisfy this contract as it can be subsetted in which case the only difference are lines:

(the general difference is that something returned by groupby is "packed" - has no jumps and fills the full length of gd.idx, but can start later as fist entries of gd.idx are reserved for possible missings; while subsetting just removes some start-end pairs, so we can have jumps and if the last pair is removed we do not reach the end of gd.idx)

These are internal things, but if there is something you feel is worth documenting (e.g. that groups must be non-empty, which is enforced by the fact that we have start-end pairs and start[i]<=end[i] by design) it would be great if you pushed a PR to the documentation.

@bkamins
Copy link
Member Author

bkamins commented Feb 5, 2019

Ah - and the problem with maximum(skipmissing(groupindices(gd))) is that it would fail if gd were empty (which is possible), so this would be a contract valid only if length(gd)>0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants