Skip to content
2 changes: 2 additions & 0 deletions docs/src/lib/functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ by
colwise
combine
groupby
groupindices
groupvars
join
map
melt
Expand Down
2 changes: 2 additions & 0 deletions src/DataFrames.jl
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,8 @@ export AbstractDataFrame,
dropmissing!,
eltypes,
groupby,
groupindices,
groupvars,
insertcols!,
mapcols,
melt,
Expand Down
36 changes: 34 additions & 2 deletions src/groupeddataframe/grouping.jl
Original file line number Diff line number Diff line change
Expand Up @@ -104,8 +104,22 @@ Base.last(gd::GroupedDataFrame) = gd[end]

Base.getindex(gd::GroupedDataFrame, idx::Integer) =
view(gd.parent, gd.idx[gd.starts[idx]:gd.ends[idx]], :)
Base.getindex(gd::GroupedDataFrame, idxs::AbstractArray) =
GroupedDataFrame(gd.parent, gd.cols, gd.groups, gd.idx, gd.starts[idxs], gd.ends[idxs])

function Base.getindex(gd::GroupedDataFrame, idxs::AbstractArray)
new_starts = gd.starts[idxs]
new_ends = gd.ends[idxs]
if !allunique(new_starts)
throw(ArgumentError("duplicates in idxs argument are not allowed"))
end
new_groups = zeros(Int, length(gd.groups))
for idx in eachindex(new_starts)
@inbounds for j in new_starts[idx]:new_ends[idx]
new_groups[gd.idx[j]] = idx
end
end
GroupedDataFrame(gd.parent, gd.cols, new_groups, gd.idx, new_starts, new_ends)
end

Base.getindex(gd::GroupedDataFrame, idxs::Colon) =
GroupedDataFrame(gd.parent, gd.cols, gd.groups, gd.idx, gd.starts, gd.ends)

Expand Down Expand Up @@ -1125,3 +1139,21 @@ function DataFrame(gd::GroupedDataFrame)
resize!(idx, doff - 1)
parent(gd)[idx, :]
end

"""
groupindices(gd::GroupedDataFrame)

Return a vector of group indices for each row of `parent(gd)`.

Rows appearing in group `gd[i]` are attributed index `i`. Rows not present in
any group are attributed `missing` (this can happen if `skipmissing=true` was
passed when creating `gd`, or if `gd` is a subset from a larger `GroupedDataFrame`).
"""
groupindices(gd::GroupedDataFrame) = replace(gd.groups, 0=>missing)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the initial grouping variables do not allow missing values, maybe it would be better to return a vector of Integers, rather than Union{Missing, <: Integer}.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two issues here:

  1. even if the initial grouping variables do not allow missing values still GroupedDataFrame may not map all rows of the parent, because it can be subsetted;
  2. there is an issue of type stability - upfront we do not know if we will have missing in the return value the result would be an union of two possible types (not that bad under Julia 1.0, but still maybe it is better to guarantee one return type). Notice that this is probably the reason why replace also has this behavior). It is easy to change - I would just replace replace with a comprehension which will use the narrowest type possible.

But I am open to change it if the majority would prefer the "type unstable" option (actually this was the reason why I considered to leave 0 instead of missing in the first place).

Copy link
Contributor

@matthieugomez matthieugomez Feb 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar enough with the package to understand the first issue. For, the second, I was thinking that the value of skipmissing could be in the type of the grouped dataframe.
Just to be more precise: the behavior I'm envisioning is that, if groupby is called with skipmissing = false, groupindices returns a Vector{Int}. But if groupby is called with skipmissing =true, it returns a Vector{Union{Missing, T}) whether or not there are actual missing values in the grouping variable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal exactly hits the problem number 1, so let me elaborate on it.

If you create a GroupedDataFrame you can subset it, e.g. assume that you have a gd with three groups and then write something like gd2 = gd[2:3] this will create a new GroupedDataFrame with only two groups (the first one is dropped). This means that even if skipmissing=false you may have have some rows that behave as-if they were skipped due to skipmissing=true.

This part of code in this PR handles this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add a type parameter indicating whether gd has been built with skipmissing=true or is a subset. But that would just create a type instability earlier (in the case of skipmissing at least), so that's not necessarily a good idea. OTOH, using a comprehension would allow inference to know that the result is a Vector{<:Union{Int,Missing}}, which isn't too bad (and not worse than Vector{Union{Int,Missing}}. I guess that's strictly superior to using replace (which should maybe have used the same approach, not sure).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, unfortunately, using a comprehension inference is only able to say it is an AbstractArray:

julia> g(x) = [ifelse(v == 0, missing, v) for v in x]
g (generic function with 1 method)

julia> h(x) = [v == 0 ? missing : v for v in x]
h (generic function with 1 method)

julia> x = [1,2,3]
3-element Array{Int64,1}:
 1
 2
 3

julia> @code_warntype g(x)
Body::AbstractArray
1 ─ %1 = %new(Base.Generator{Array{Int64,1},getfield(Main, Symbol("##11#12"))}, getfield(Main, Symbol("##11#12"))(), x)::Base.Generator{Array{Int64,1},getfield(Main, Symbol("##11#12"))}
│   %2 = invoke Base.collect(%1::Base.Generator{Array{Int64,1},getfield(Main, Symbol("##11#12"))})::AbstractArray
└──      return %2

julia> @code_warntype h(x)
Body::AbstractArray
1 ─ %1 = %new(Base.Generator{Array{Int64,1},getfield(Main, Symbol("##13#14"))}, getfield(Main, Symbol("##13#14"))(), x)::Base.Generator{Array{Int64,1},getfield(Main, Symbol("##13#14"))}
│   %2 = invoke Base.collect(%1::Base.Generator{Array{Int64,1},getfield(Main, Symbol("##13#14"))})::AbstractArray
└──      return %2

so maybe we should stick to the implementation in replace?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, that's JuliaLang/julia#30485. Maybe we can keep replace for now, but file an issue on the 1.0 milestone to remember we should switch before that if possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #1713


"""
groupvars(gd::GroupedDataFrame)

Return a vector of column names in `parent(gd)` used for grouping.
"""
groupvars(gd::GroupedDataFrame) = _names(gd)[gd.cols]
66 changes: 51 additions & 15 deletions test/grouping.jl
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,16 @@ const ≅ = isequal
function groupby_checked(df::AbstractDataFrame, keys, args...; kwargs...)
gd = groupby(df, keys, args...; kwargs...)

for i in 1:length(gd)
# checking that groups field is consistent with other fields
# (since == and isequal do not use it)
# and that idx is increasing per group
@assert findall(==(i), gd.groups) == gd.idx[gd.starts[i]:gd.ends[i]]
# checking that groups field is consistent with other fields
# (since == and isequal do not use it)
# and that idx is increasing per group
new_groups = zeros(Int, length(gd.groups))
for idx in eachindex(gd.starts)
subidx = gd.idx[gd.starts[idx]:gd.ends[idx]]
@assert issorted(subidx)
new_groups[subidx] .= idx
end
@assert new_groups == gd.groups

if length(gd) > 0
se = sort!(collect(zip(gd.starts, gd.ends)))
Expand All @@ -24,15 +28,10 @@ function groupby_checked(df::AbstractDataFrame, keys, args...; kwargs...)
for i in eachindex(se)
@assert se[i][1] <= se[i][2]
if i > 1
# the blocks returned by groupby must be continuous
@assert se[i-1][2] + 1 == se[i][1]
end
end

# correct coverage of missings if dropped
@assert findall(==(0), gd.groups) == gd.idx[1:se[1][1]-1]
else
# a case when missings are dropped and nothing was left to group by
@assert all(==(0), gd.groups)
end

gd
Expand Down Expand Up @@ -878,25 +877,35 @@ end
else
@test_throws ArgumentError gd[true]
end
@test_throws ArgumentError gd[[1, 2, 1]]
@test_throws MethodError gd["a"]
gd2 = gd[[true, false, false, false]]
gd2 = gd[[false, true, false, false]]
@test length(gd2) == 1
@test gd2[1] == gd[1]
@test gd2[1] == gd[2]
@test_throws BoundsError gd[[true, false]]
@test gd2.groups == [0, 1, 0, 0, 0, 1, 0, 0]
@test gd2.starts == [3]
@test gd2.ends == [4]
@test gd2.idx == gd.idx

gd3 = gd[:]
@test gd3 isa GroupedDataFrame
@test length(gd3) == 4
@test gd3 == gd
for i in 1:4
@test gd3[i] == gd[i]
end
gd4 = gd[[1,2]]
gd4 = gd[[2,1]]
@test gd4 isa GroupedDataFrame
@test length(gd4) == 2
for i in 1:2
@test gd4[i] == gd[i]
@test gd4[i] == gd[3-i]
end
@test_throws BoundsError gd[1:5]
@test gd4.groups == [2, 1, 0, 0, 2, 1, 0, 0]
@test gd4.starts == [3,1]
@test gd4.ends == [4,2]
@test gd4.idx == gd.idx
end

@testset "== and isequal" begin
Expand Down Expand Up @@ -1074,10 +1083,16 @@ end
@test sort(DataFrame(gd), :B) ≅ sort(df, :B)
@test eltypes(DataFrame(gd)) == [Union{Missing, Symbol}, Int]

gd2 = gd[[3,2]]
@test DataFrame(gd2) == df[[3,5,2,4], :]

gd = groupby_checked(df, :A, skipmissing=true)
@test sort(DataFrame(gd), :B) ==
sort(dropmissing(df, disallowmissing=false), :B)
@test eltypes(DataFrame(gd)) == [Union{Missing, Symbol}, Int]

gd2 = gd[[2,1]]
@test DataFrame(gd2) == df[[3,5,2,4], :]
end

df = DataFrame(a=Int[], b=[], c=Union{Missing, String}[])
Expand All @@ -1091,4 +1106,25 @@ end
@test eltypes(DataFrame(gd)) == [Union{Missing, Symbol}, Int]
end

@testset "groupindices and groupvars" begin
df = DataFrame(A = [missing, :A, :B, :A, :B, missing], B = 1:6)
gd = groupby_checked(df, :A)
@inferred groupindices(gd)
@test groupindices(gd) == [1, 2, 3, 2, 3, 1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also test type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added @inferred test, which should pass with the current implementation (and fails with a comprehension).

@test groupvars(gd) == [:A]
gd2 = gd[[3,2]]
@inferred groupindices(gd2)
@test groupindices(gd2) ≅ [missing, 2, 1, 2, 1, missing]
@test groupvars(gd2) == [:A]

gd = groupby_checked(df, :A, skipmissing=true)
@inferred groupindices(gd)
@test groupindices(gd) ≅ [missing, 1, 2, 1, 2, missing]
@test groupvars(gd) == [:A]
gd2 = gd[[2,1]]
@inferred groupindices(gd2)
@test groupindices(gd2) ≅ [missing, 2, 1, 2, 1, missing]
@test groupvars(gd2) == [:A]
end

end # module