-
Notifications
You must be signed in to change notification settings - Fork 374
Fix getindex and add groupvars and groupindices #1709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
029ea01
add groupvars and groupindices
bkamins c6ca17b
fix getindex
bkamins 7ce1621
a small fix
bkamins 02015d1
fix typo
bkamins d1df50a
fix wrong variable reference
bkamins 92968e4
fix test condition
bkamins 3237656
use missing in groupindices
bkamins b40d2e3
fix comparison
bkamins e894f3d
Update src/groupeddataframe/grouping.jl
nalimilan a6d7e00
apply review comments
bkamins 1f75bdd
fix inferred
bkamins 6968957
fix a typo
bkamins f321bec
fix wrong test
bkamins b323ba7
Use initial lower case
nalimilan d16bb50
more strict groupby_checked
bkamins File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -16,6 +16,8 @@ by | |
| colwise | ||
| combine | ||
| groupby | ||
| groupindices | ||
| groupvars | ||
| join | ||
| map | ||
| melt | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -6,12 +6,16 @@ const ≅ = isequal | |
| function groupby_checked(df::AbstractDataFrame, keys, args...; kwargs...) | ||
| gd = groupby(df, keys, args...; kwargs...) | ||
|
|
||
| for i in 1:length(gd) | ||
| # checking that groups field is consistent with other fields | ||
| # (since == and isequal do not use it) | ||
| # and that idx is increasing per group | ||
| @assert findall(==(i), gd.groups) == gd.idx[gd.starts[i]:gd.ends[i]] | ||
| # checking that groups field is consistent with other fields | ||
| # (since == and isequal do not use it) | ||
| # and that idx is increasing per group | ||
| new_groups = zeros(Int, length(gd.groups)) | ||
| for idx in eachindex(gd.starts) | ||
| subidx = gd.idx[gd.starts[idx]:gd.ends[idx]] | ||
| @assert issorted(subidx) | ||
| new_groups[subidx] .= idx | ||
| end | ||
| @assert new_groups == gd.groups | ||
|
|
||
| if length(gd) > 0 | ||
| se = sort!(collect(zip(gd.starts, gd.ends))) | ||
|
|
@@ -24,15 +28,10 @@ function groupby_checked(df::AbstractDataFrame, keys, args...; kwargs...) | |
| for i in eachindex(se) | ||
| @assert se[i][1] <= se[i][2] | ||
| if i > 1 | ||
| # the blocks returned by groupby must be continuous | ||
| @assert se[i-1][2] + 1 == se[i][1] | ||
| end | ||
| end | ||
|
|
||
| # correct coverage of missings if dropped | ||
| @assert findall(==(0), gd.groups) == gd.idx[1:se[1][1]-1] | ||
| else | ||
| # a case when missings are dropped and nothing was left to group by | ||
| @assert all(==(0), gd.groups) | ||
| end | ||
|
|
||
| gd | ||
|
|
@@ -878,25 +877,35 @@ end | |
| else | ||
| @test_throws ArgumentError gd[true] | ||
| end | ||
| @test_throws ArgumentError gd[[1, 2, 1]] | ||
| @test_throws MethodError gd["a"] | ||
| gd2 = gd[[true, false, false, false]] | ||
| gd2 = gd[[false, true, false, false]] | ||
| @test length(gd2) == 1 | ||
| @test gd2[1] == gd[1] | ||
| @test gd2[1] == gd[2] | ||
| @test_throws BoundsError gd[[true, false]] | ||
| @test gd2.groups == [0, 1, 0, 0, 0, 1, 0, 0] | ||
| @test gd2.starts == [3] | ||
| @test gd2.ends == [4] | ||
| @test gd2.idx == gd.idx | ||
|
|
||
| gd3 = gd[:] | ||
| @test gd3 isa GroupedDataFrame | ||
| @test length(gd3) == 4 | ||
| @test gd3 == gd | ||
| for i in 1:4 | ||
| @test gd3[i] == gd[i] | ||
| end | ||
| gd4 = gd[[1,2]] | ||
| gd4 = gd[[2,1]] | ||
| @test gd4 isa GroupedDataFrame | ||
| @test length(gd4) == 2 | ||
| for i in 1:2 | ||
| @test gd4[i] == gd[i] | ||
| @test gd4[i] == gd[3-i] | ||
| end | ||
| @test_throws BoundsError gd[1:5] | ||
| @test gd4.groups == [2, 1, 0, 0, 2, 1, 0, 0] | ||
| @test gd4.starts == [3,1] | ||
| @test gd4.ends == [4,2] | ||
| @test gd4.idx == gd.idx | ||
| end | ||
|
|
||
| @testset "== and isequal" begin | ||
|
|
@@ -1074,10 +1083,16 @@ end | |
| @test sort(DataFrame(gd), :B) ≅ sort(df, :B) | ||
| @test eltypes(DataFrame(gd)) == [Union{Missing, Symbol}, Int] | ||
|
|
||
| gd2 = gd[[3,2]] | ||
| @test DataFrame(gd2) == df[[3,5,2,4], :] | ||
|
|
||
| gd = groupby_checked(df, :A, skipmissing=true) | ||
| @test sort(DataFrame(gd), :B) == | ||
| sort(dropmissing(df, disallowmissing=false), :B) | ||
| @test eltypes(DataFrame(gd)) == [Union{Missing, Symbol}, Int] | ||
|
|
||
| gd2 = gd[[2,1]] | ||
| @test DataFrame(gd2) == df[[3,5,2,4], :] | ||
| end | ||
|
|
||
| df = DataFrame(a=Int[], b=[], c=Union{Missing, String}[]) | ||
|
|
@@ -1091,4 +1106,25 @@ end | |
| @test eltypes(DataFrame(gd)) == [Union{Missing, Symbol}, Int] | ||
| end | ||
|
|
||
| @testset "groupindices and groupvars" begin | ||
| df = DataFrame(A = [missing, :A, :B, :A, :B, missing], B = 1:6) | ||
| gd = groupby_checked(df, :A) | ||
| @inferred groupindices(gd) | ||
| @test groupindices(gd) == [1, 2, 3, 2, 3, 1] | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also test type? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have added |
||
| @test groupvars(gd) == [:A] | ||
| gd2 = gd[[3,2]] | ||
| @inferred groupindices(gd2) | ||
| @test groupindices(gd2) ≅ [missing, 2, 1, 2, 1, missing] | ||
| @test groupvars(gd2) == [:A] | ||
|
|
||
| gd = groupby_checked(df, :A, skipmissing=true) | ||
| @inferred groupindices(gd) | ||
| @test groupindices(gd) ≅ [missing, 1, 2, 1, 2, missing] | ||
| @test groupvars(gd) == [:A] | ||
| gd2 = gd[[2,1]] | ||
| @inferred groupindices(gd2) | ||
| @test groupindices(gd2) ≅ [missing, 2, 1, 2, 1, missing] | ||
| @test groupvars(gd2) == [:A] | ||
| end | ||
|
|
||
| end # module | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the initial grouping variables do not allow missing values, maybe it would be better to return a vector of Integers, rather than Union{Missing, <: Integer}.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two issues here:
GroupedDataFramemay not map all rows of the parent, because it can be subsetted;missingin the return value the result would be an union of two possible types (not that bad under Julia 1.0, but still maybe it is better to guarantee one return type). Notice that this is probably the reason whyreplacealso has this behavior). It is easy to change - I would just replacereplacewith a comprehension which will use the narrowest type possible.But I am open to change it if the majority would prefer the "type unstable" option (actually this was the reason why I considered to leave
0instead ofmissingin the first place).Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar enough with the package to understand the first issue. For, the second, I was thinking that the value of
skipmissingcould be in the type of the grouped dataframe.Just to be more precise: the behavior I'm envisioning is that, if
groupbyis called withskipmissing = false,groupindicesreturns aVector{Int}. But ifgroupbyis called withskipmissing =true, it returns aVector{Union{Missing, T})whether or not there are actual missing values in the grouping variable.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This proposal exactly hits the problem number 1, so let me elaborate on it.
If you create a
GroupedDataFrameyou can subset it, e.g. assume that you have agdwith three groups and then write something likegd2 = gd[2:3]this will create a newGroupedDataFramewith only two groups (the first one is dropped). This means that even ifskipmissing=falseyou may have have some rows that behave as-if they were skipped due toskipmissing=true.This part of code in this PR handles this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add a type parameter indicating whether
gdhas been built withskipmissing=trueor is a subset. But that would just create a type instability earlier (in the case ofskipmissingat least), so that's not necessarily a good idea. OTOH, using a comprehension would allow inference to know that the result is aVector{<:Union{Int,Missing}}, which isn't too bad (and not worse thanVector{Union{Int,Missing}}. I guess that's strictly superior to usingreplace(which should maybe have used the same approach, not sure).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, unfortunately, using a comprehension inference is only able to say it is an
AbstractArray:so maybe we should stick to the implementation in
replace?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, that's JuliaLang/julia#30485. Maybe we can keep
replacefor now, but file an issue on the 1.0 milestone to remember we should switch before that if possible.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #1713