Skip to content

Conversation

youngsofun
Copy link
Member

@youngsofun youngsofun commented Jun 6, 2022

I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

Summary

add new trait and trait functions so I can replace them increasingly

  1. use row-based serializer
  2. HTTP handler return string fields
    serialization problem with "f32 as f64" #5863

Conclusions of prev bench results:

Compared with row base impl in this pr:

  1. Return vec is very slow
  2. Return something like StringColumn is faster but still slow(x 2) and has more memory overhead
  3. Return streamin_iter is a bit complex and a litter slower (x1.5 for i32 len 1024, better for longger column )

with row base impl

  1. holding &[T] in Serializer make it 30% faster(for i32 column), compared with passing ColumnRef each time and cast
  2. enum_dispath is 10% faster than dyn dispatch

@sundy-li if you vote for enum_dispath, I will delete the dyn dispatch code, and go on with other types

some notes about serializer:

  1. add ConstSerializer, it can be optimized a lot later by caching the serialized value
  2. writes string to MySQL writer for float (instead of writing float and let mysql-srv format it), to make float formats consistent (e.g. float in array use our own serializer,), its the only place need to change the test. see the 2nd commit.
    • lexical-core did not trim .0 of float by default unless use it with WriteFloatOptions, since we now use the same float str format in all output formats (need more effort to specialize for each format). it is better to keep .0 to make it easy for other applicaitons to detect the schema for json, csv .etc.

Changelog

  • Performance Improvement

Related Issues

Fixes #issue

@vercel
Copy link

vercel bot commented Jun 6, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Updated
databend ⬜️ Ignored (Inspect) Jun 9, 2022 at 7:26AM (UTC)

@mergify
Copy link
Contributor

mergify bot commented Jun 6, 2022

Thanks for the contribution!
I have applied any labels matching special text in your PR Changelog.

Please review the labels and make any necessary changes.

@youngsofun youngsofun marked this pull request as draft June 6, 2022 03:31
@youngsofun youngsofun mentioned this pull request Jun 6, 2022
5 tasks
@youngsofun
Copy link
Member Author

enum_dispath is not IDE friendly, both do not support associate const( can be used in quote), not sure if enum_dispath can be able to support it later, is there any other drawback on it, since it is only 10% faster?

@sundy-li

@youngsofun youngsofun force-pushed the output branch 5 times, most recently from 8020827 to 79ca832 Compare June 7, 2022 08:40
@youngsofun youngsofun marked this pull request as ready for review June 7, 2022 09:07
@BohuTANG
Copy link
Member

BohuTANG commented Jun 7, 2022

@mergify update

@mergify
Copy link
Contributor

mergify bot commented Jun 7, 2022

update

✅ Branch has been successfully updated

@sundy-li
Copy link
Member

sundy-li commented Jun 7, 2022

enum_dispath is not IDE friendly, both do not support associate const( can be used in quote), not sure if enum_dispath can be able to support it later, is there any other drawback on it, since it is only 10% faster?

Let's make life easier, enum_dispatch is not needed.

@youngsofun youngsofun marked this pull request as draft June 7, 2022 16:34
@youngsofun youngsofun force-pushed the output branch 2 times, most recently from 8060016 to 289fec2 Compare June 8, 2022 03:24
@youngsofun
Copy link
Member Author

enum_dispath is not IDE friendly, both do not support associate const( can be used in quote), not sure if enum_dispath can be able to support it later, is there any other drawback on it, since it is only 10% faster?

Let's make life easier, enum_dispatch is not needed.

the serializer is used in limited places, not too bothering, I prefer keep it for now

@youngsofun youngsofun force-pushed the output branch 2 times, most recently from c95a8a9 to fddffdc Compare June 8, 2022 05:49
@youngsofun youngsofun marked this pull request as ready for review June 8, 2022 06:41
@youngsofun
Copy link
Member Author

youngsofun commented Jun 8, 2022

@BohuTANG @sundy-li is it acceptable to see 1.0 instead of 1 for float in mysql client?

@sundy-li
Copy link
Member

sundy-li commented Jun 8, 2022

table to see 1.0 instead of 1 f

https://www.db-fiddle.com/f/abKfoifJD29jCXVgBNnZap/0

MySQL outputs 1, but I think we do not need to follow MySQL strictly.

@youngsofun
Copy link
Member Author

snowflake

image

@sundy-li sundy-li requested a review from zhang2014 June 8, 2022 07:51
@youngsofun
Copy link
Member Author

youngsofun commented Jun 8, 2022

I prefer always use 1.0

  1. uniform format, make it much easier to write tests for diff format and handlers
  2. easier to know the type.
  3. to use 1 for MySQL , need lexical_core::write_with_options, instead of lexical_core::write, maybe a little slow

trim_float is not used for any prepared FORMATS in lib lexical, including MySQL

https://github.com/Alexhuszagh/rust-lexical/blob/main/lexical-write-float/src/options.rs#L1373

table to see 1.0 instead of 1 f

https://www.db-fiddle.com/f/abKfoifJD29jCXVgBNnZap/0

MySQL outputs 1, but I think we do not need to follow MySQL strictly.

@youngsofun
Copy link
Member Author

have some problem to use WriteFloatOptions

N::Options == WriteFloatOptions only holds for f32 and f64

pub fn write_with_options<'a, N: ToLexicalWithOptions, const FORMAT: u128>(
    n: N,
    bytes: &'a mut [u8],
    options: &N::Options,
) -> &'a mut [u8] {
    n.to_lexical_with_options::<FORMAT>(bytes, options)

}
}

fn create_serializer_inner<'a>(&self, _col: &'a ColumnRef) -> Result<TypeSerializerImpl<'a>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why need this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used to for normal column( non-const)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create_serializer_inner use for impl
create_serializer as interface

@youngsofun
Copy link
Member Author

youngsofun commented Jun 8, 2022

oops, SQL logic tests take effect, need to update expect the results for floats too
no need to change the code
will be done tomorrow morning

Ok(Arc::new(ArrayColumnData::create(inner_data, offsets)))
}

fn serialize_json(&self, format: &FormatSettings) -> Result<Vec<Value>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So row based api not works on serialize_json ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serializer can has API for both row based and col-based(clickhouse, json::Value)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format choose to use which

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pub fn block_to_json_value(
    block: &DataBlock,
    format: &FormatSettings,
) -> Result<Vec<Vec<JsonValue>>> {
    let cols = block_to_json_value_columns(block, format)?;
    Ok(transpose(cols))
}

Since we need to transpose, so serialize_json is row based ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess column-based may be faster?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's not zero alloc in the transpose function. So row based is faster in this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's not zero alloc in the transpose function. So row based is faster in this case.

make sense

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may I refactor it later in another pr later, this pr is large and need to change new-added tests case, hope it to merge early.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also need add formats based on this

@youngsofun
Copy link
Member Author

@sundy-li @zhang2014 rebase only to include new test cases, the original commits are not changed

@youngsofun
Copy link
Member Author

@sundy-li @zhang2014 all tests passed (except a build), please review

@sundy-li
Copy link
Member

sundy-li commented Jun 9, 2022

/LGTM

@BohuTANG BohuTANG merged commit 04729d3 into databendlabs:main Jun 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants