Skip to content

Top-K query optimization in sort uses substantial memory  #7149

@gruuya

Description

@gruuya

Describe the bug

It seems like the Top-K query optimization is somehow conditional on the usage of a custom allocator (mimalloc/snmalloc), while in principle that shouldn't be the case?

To Reproduce

  1. Grab and build bytehound: https://github.com/koute/bytehound
  2. Prepare some large-ish Parquet file, e.g. https://seafowl-public.s3.eu-west-1.amazonaws.com/tutorial/trase-supply-chains.parquet:
$ du -h ~/supply-chains.parquet 
146M /home/ubuntu/supply-chains.parquet
  1. Remove the custom allocator and build
diff --git a/datafusion-cli/src/main.rs b/datafusion-cli/src/main.rs
index aea499d60..a92957730 100644
--- a/datafusion-cli/src/main.rs
+++ b/datafusion-cli/src/main.rs
@@ -24,13 +24,13 @@ use datafusion_cli::catalog::DynamicFileCatalog;
 use datafusion_cli::{
     exec, print_format::PrintFormat, print_options::PrintOptions, DATAFUSION_CLI_VERSION,
 };
-use mimalloc::MiMalloc;
+// use mimalloc::MiMalloc;
 use std::env;
 use std::path::Path;
 use std::sync::Arc;

-#[global_allocator]
-static GLOBAL: MiMalloc = MiMalloc;
+// #[global_allocator]
+// static GLOBAL: MiMalloc = MiMalloc;

 #[derive(Debug, Parser, PartialEq)]
 #[clap(author, version, about, long_about= None)]
  1. Profile a Top-K query
$ LD_PRELOAD=~/bytehound/target/release/libbytehound.so ./target/debug/datafusion-cli
DataFusion CLI v28.0.0
❯ CREATE EXTERNAL TABLE supply_chains STORED AS PARQUET LOCATION '/home/ubuntu/supply-chains.parquet';
0 rows in set. Query took 0.445 seconds.
❯ SELECT * FROM supply_chains ORDER BY flow_id DESC LIMIT 1;
...

The profile I get is
slika

Expected behavior

With the custom allocator present the memory profile I see is like this
slika

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions