Improve statistics (umbrella issue)

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
This is an umbrella issue to gather all improvements regarding statistics.

**Describe the solution you'd like**
The list below should probably be better prioritized:
- [x] #962
- [ ] #992 
- [ ] better validate that the `column_statistics` vector is aligned on the schema `fields` vector (same size, same types...) when constructing the `ExecutionPlan` instance (ex #717)
- [ ] remove `total_byte_size` as we are not using it **OR** better estimate it when we have both a fixed size type and the `num_rows` for the output columns
- [ ] replace the `is_exact` field at the `Statistics` level with per-field information
- [ ] have more granularity in statistics that just `(value, is_exact)`: possible solutions are histograms (cf [Spark CBOs](https://issues.apache.org/jira/browse/SPARK-16026))
- [ ] fix the way `LocalLimitExec` propagates its inexact statistics (requires more granular statistics)
- [ ] estimate statistics in CSV datasource
- [ ] estimate statistics in JSON dataource
- [ ] better estimate output statistics of hash_aggregate
- [ ] better estimate output statistics of hash_join
- [ ] better estimate output statistics of projection (requires #992)
- [ ] better estimate output statistics of window_agg
- [ ] better estimate output statistics of filters (requires more granular statistics, in particular histograms)


**Additional context**
Statistics are usually sourced at the datasource level, then propagated through the plan tree according to the types of nodes. They are used to choose between different logically equivalent plans or plan configurations. The more rules are implemented for propagating the statistics, the more information the optimizer will have to take good decisions. But at the same time, an overly complex abstraction that is not used by any optimization rule would bloat the code base and make it harder to maintain. For that reason, extensions of the statistics system should be driven by the addition of concrete optimization rules that require them.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve statistics (umbrella issue) #997

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve statistics (umbrella issue) #997

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions