-
Notifications
You must be signed in to change notification settings - Fork 374
Fix getindex and add groupvars and groupindices #1709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
That's problematic indeed... But shouldn't we fix the bug (by recomputing |
We should - that is why I made this WIP to show where the problem is. I will fix the bug in this PR ten and probably we should make a minor release soon. |
|
Pushed a commit as I have to go now, so it might fail the CI, but the design is there.
|
|
I have also made |
| a data frame `gd[i]` contains this row. | ||
| Rows not present in any of `gd` groups have their index set to `missing`. | ||
| """ | ||
| groupindices(gd::GroupedDataFrame) = replace(gd.groups, 0=>missing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the initial grouping variables do not allow missing values, maybe it would be better to return a vector of Integers, rather than Union{Missing, <: Integer}.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two issues here:
- even if the initial grouping variables do not allow missing values still
GroupedDataFramemay not map all rows of the parent, because it can be subsetted; - there is an issue of type stability - upfront we do not know if we will have
missingin the return value the result would be an union of two possible types (not that bad under Julia 1.0, but still maybe it is better to guarantee one return type). Notice that this is probably the reason whyreplacealso has this behavior). It is easy to change - I would just replacereplacewith a comprehension which will use the narrowest type possible.
But I am open to change it if the majority would prefer the "type unstable" option (actually this was the reason why I considered to leave 0 instead of missing in the first place).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar enough with the package to understand the first issue. For, the second, I was thinking that the value of skipmissing could be in the type of the grouped dataframe.
Just to be more precise: the behavior I'm envisioning is that, if groupby is called with skipmissing = false, groupindices returns a Vector{Int}. But if groupby is called with skipmissing =true, it returns a Vector{Union{Missing, T}) whether or not there are actual missing values in the grouping variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This proposal exactly hits the problem number 1, so let me elaborate on it.
If you create a GroupedDataFrame you can subset it, e.g. assume that you have a gd with three groups and then write something like gd2 = gd[2:3] this will create a new GroupedDataFrame with only two groups (the first one is dropped). This means that even if skipmissing=false you may have have some rows that behave as-if they were skipped due to skipmissing=true.
This part of code in this PR handles this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add a type parameter indicating whether gd has been built with skipmissing=true or is a subset. But that would just create a type instability earlier (in the case of skipmissing at least), so that's not necessarily a good idea. OTOH, using a comprehension would allow inference to know that the result is a Vector{<:Union{Int,Missing}}, which isn't too bad (and not worse than Vector{Union{Int,Missing}}. I guess that's strictly superior to using replace (which should maybe have used the same approach, not sure).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, unfortunately, using a comprehension inference is only able to say it is an AbstractArray:
julia> g(x) = [ifelse(v == 0, missing, v) for v in x]
g (generic function with 1 method)
julia> h(x) = [v == 0 ? missing : v for v in x]
h (generic function with 1 method)
julia> x = [1,2,3]
3-element Array{Int64,1}:
1
2
3
julia> @code_warntype g(x)
Body::AbstractArray
1 ─ %1 = %new(Base.Generator{Array{Int64,1},getfield(Main, Symbol("##11#12"))}, getfield(Main, Symbol("##11#12"))(), x)::Base.Generator{Array{Int64,1},getfield(Main, Symbol("##11#12"))}
│ %2 = invoke Base.collect(%1::Base.Generator{Array{Int64,1},getfield(Main, Symbol("##11#12"))})::AbstractArray
└── return %2
julia> @code_warntype h(x)
Body::AbstractArray
1 ─ %1 = %new(Base.Generator{Array{Int64,1},getfield(Main, Symbol("##13#14"))}, getfield(Main, Symbol("##13#14"))(), x)::Base.Generator{Array{Int64,1},getfield(Main, Symbol("##13#14"))}
│ %2 = invoke Base.collect(%1::Base.Generator{Array{Int64,1},getfield(Main, Symbol("##13#14"))})::AbstractArray
└── return %2
so maybe we should stick to the implementation in replace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, that's JuliaLang/julia#30485. Maybe we can keep replace for now, but file an issue on the 1.0 milestone to remember we should switch before that if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #1713
| a data frame `gd[i]` contains this row. | ||
| Rows not present in any of `gd` groups have their index set to `missing`. | ||
| """ | ||
| groupindices(gd::GroupedDataFrame) = replace(gd.groups, 0=>missing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add a type parameter indicating whether gd has been built with skipmissing=true or is a subset. But that would just create a type instability earlier (in the case of skipmissing at least), so that's not necessarily a good idea. OTOH, using a comprehension would allow inference to know that the result is a Vector{<:Union{Int,Missing}}, which isn't too bad (and not worse than Vector{Union{Int,Missing}}. I guess that's strictly superior to using replace (which should maybe have used the same approach, not sure).
| @testset "groupindices and groupvars" begin | ||
| df = DataFrame(A = [missing, :A, :B, :A, :B, missing], B = 1:6) | ||
| gd = groupby_checked(df, :A) | ||
| @test groupindices(gd) == [1, 2, 3, 2, 3, 1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also test type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added @inferred test, which should pass with the current implementation (and fails with a comprehension).
|
Is it true that |
|
Yes. But we have this contract documented more precisely already. If |
|
Yes I guess that’s the second contract I was wondering about (and it sounds
related to the subseting issue you were talking about)
…On Tue, Feb 5, 2019 at 4:18 PM Bogumił Kamiński ***@***.***> wrote:
Yes. But we have this contract documented more precisely already. If
groupindices(df)[idx] is not missing then it is a number of a group of
row idx in the parent data frame. What you state is a consequence of
this, because we do not create empty groups (i.e. we have other conract -
and this one is not documented AFACT, but is important and often used that
each group has at least one row).
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1709 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AF733e2X7mMIUwB39eq4mu_PngJS53GTks5vKfUTgaJpZM4aeUwv>
.
|
|
So this function specifies the contract on a Then
(the general difference is that something returned by These are internal things, but if there is something you feel is worth documenting (e.g. that groups must be non-empty, which is enforced by the fact that we have start-end pairs and |
|
Ah - and the problem with |
I have added methods following the discussion in #1704.
@nalimilan unfortunately we cannot use
groupsfield. The reason is thatGroupedDataFramecan contain less levels than originally created as it can be subsetted and we do not updategroupsvariable when subsettingGroupedDataFrame.This is WIP, because we might not accept this design (if we do I will add tests).
The issue is that we should make sure that
groupsfor subsetted is actually correct - @nalimilan we have a bug here. If we fix the bug the implementation might be faster by usinggroupsfield:Here you have a bug reproduced: