-
Couldn't load subscription status.
- Fork 195
Description
With help from @nalimilan, back in February I contributed a re-write of corkendall that yielded quite useful speedups. (see this commit. But I need a version of corkendall that supports missing elements, which StatsBase.corkendall doesn't support at present. Hence my (public, but not registered) package KendallTau.
I would like to propose that corkendall in StatsBase be replaced by code copied from KendallTau, but I know that how best to handle missing values in StatsBase has been a topic of various discussions, including a more general approach than my proposal here so I'd like to know if a pull request on these lines might be accepted.
This is the approach I have taken:
- Defined types:
const RealOrMissingVector{T <: Real} = AbstractArray{<:Union{T, Missing},1}
const RealOrMissingMatrix{T <: Real} = AbstractArray{<:Union{T, Missing},2}
-
written an auxiliary function
skipmissingpairsthat takes a pair of arguments each of one of the above two types and does the needful to implement both a "pairwise" and "complete" handling of missings, i.e. a return a pair of vectors\matrices with the missings (or rows containing missings) removed and also type narrowed toRealVectororRealMatrix. -
amended the methods of
corkendallto have signatures such as:
function corkendall(X::Union{RealMatrix,RealOrMissingMatrix},
y::Union{RealVector,RealOrMissingVector};
skipmissing::Symbol=:undefined)
The proposed change would be non-breaking on existing code, which must be passing arguments of type RealVector and\or RealMatrix. Also when one or more of the arguments is of type RealOrMissing... then skipmissing defaults to :undefined thus forcing the user to decide whether :pairwise or :complete is the better approach for the data they have.
Other changes I have made:
-
Slightly improved consistency with
corspearman, for example errors when dimensions of arguments don't match now align between the two functions. Obviously this proposal will takecorkendall"ahead" ofcorspearman, but I think it would be straightforward to make equivalent changes tocorspearman, hopefully without code duplication. -
Some code refactoring that gives an approx 10% further speedup to corkendall, and approx 40% reduction in memory allocation. The trick was to reduce the number of calls to
permute!in all but the vector-vector case. -
Amended the tests to test behaviour against missing values. I also have a test suite testing corkendall against
corkendal_naivethat, for simplicity, does not use Knight's algorithm. That gives confidence that results fromcorkendallare correct for a wide variety of argument types and values but I don't propose to port that test suite to StatsBase - it's too complicated.
Could a member of StatsBase let me know their thoughts?