Skip to content

What to port from StatsBase #87

@nalimilan

Description

@nalimilan

This issue is to discuss what functions should be ported from StatsBase to Statistics (#2). Some functions would better move to a separate package:

  • statmodels.jl: should go to StatsAPI.jl

Most APIs have passed the test of time so they are probably good enough, but I find some of them are not completely satisfying:

  • hist.jl: I don't know this part of the code enough to judge whether the API is OK. There have been proposals to move these to a separate package (Proposal: Move histograms to separate package StatsBase.jl#650).
  • weights.jl: Weighted sum cannot be implemented via a weights keyword arguments like other functions since the function lives in Base (RFC: Add weights argument to sum JuliaLang/julia#33310). We could either export wsum or keep it internal and do not support it for now.
  • counts.jl: counts sounds a bit too generic of a term for a function that only allows counting integer values. countmap is more general and its name is explicit. That said, counts could easily be extended to allow any type of levels -- its limitation is just that it returns a vector without names so the mapping to the levels has to be done by hand, which isn't user-friendly. APIs provided by FreqTables.jl are nicer to use, but they need NamedArrays.jl (or a similar package). Then there's the issue that countmap uses radix sort for performance with some types, but this needs SortingAlgorithms.jl, which isn't a stdlib (yet?).
  • deviation.jl: Do we really need all of these small convenience functions? counteq and countne don't really sound like statistical functions and I'm not sure how commonly they are used. sqL2dist, L2dist, L1dist, Linfdist have an uppercase in their name; these and remaining functions are redundant with functions provided in Distances.jl. That only leaves psnr.
  • misc.jl: indexmap is just indexin so remove it. levelsmap and indicatormat sound a bit limited compared with what StatsModels provides. rle and inverse_rle are not really related to statistics.
  • scalarstats.jl: mean_and_var and mean_and_std have weird names so I'm not sure we should keep them or not. zscore and zscore! are convenient but redundant with (more general and more verbose) functions in transformations.jl.
  • transformations.jl: transform and transform! are too generic names, I propose overloading LinearAlgebra's normalize and normalize!, since that name is actually the commonly used term for such transformations. I wonder whether we really need reconstruct and reconstruct! (which could be called unnormalize if we keep them). I'm also not sure what's the use of allowing a separate fit operation before actually applying the transformation (I'd imagine one would always normalize the data immediately).
  • moments.jl: moment is redundant with specific functions so I'd drop it.
  • robust.jl: trimvar(x) could be var(trim(x)) if trim(x) returned a special iterator type to dispatch on

See also my previous notes at JuliaLang/julia#27152 (comment).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions