Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Nov 18, 2024

Let's celebrate the accomplishment of getting to the top of the ClickBench leaderboard

Screenshot 2024-11-18 at 10 38 27 AM

Copy link
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi @Weijun-H @dharanad @Lordworms, @goldmedal @wiedld, @tlm365 @my-vegetable-has-exploded @doupache, @jayzhan211, @xinlifoobar, @Kev1n8
@tshauck, @austin362667, @demetribu, @PsiACE, @devanbenz, @thinh2, @Omega359 @XiangpengHao, @ariesdevil, @tustvold , @RinChanNOWW, @a10y @Dandandan @viirya @itsjunetime, @eejbyfeldt and @Rachelint
@korowa @pmcgleenon

I mentioned you and your work in this blog post -- thank you again 🙏

For names, I copy/pasted whatever was publically available on your github profiles. If you would like different names / attributions (or none at all) please propose a change 🙏

Also, if you remember others who should be on this list, please let me know

a challenge!), and we have subsequently rallied to steadily improve the
performance release on release as shown in Figure 2.

[Mehmet Ozan Kabak]: https://www.linkedin.com/in/mehmet-ozan-kabak/)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@comphead
Copy link
Contributor

Thats greatest news. Congrats!

@comphead
Copy link
Contributor

For the ClickBench run for DataFusion what is the batch_size being used? is it 8192 by default?

@alamb
Copy link
Contributor Author

alamb commented Nov 18, 2024

For the ClickBench run for DataFusion what is the batch_size being used? is it 8192 by default?

I don't think the scripts change the default setting -- the scripts used are here: https://github.com/ClickHouse/ClickBench/tree/main/datafusion

Here is the PR to update for 43.0.0: ClickHouse/ClickBench#251

alamb and others added 3 commits November 19, 2024 06:26
Co-authored-by: Alex Huang <[email protected]>
Co-authored-by: Patrick McGleenon <[email protected]>
Co-authored-by: Jay Zhan <[email protected]>
@alamb
Copy link
Contributor Author

alamb commented Nov 20, 2024

I would like to merge this PR to publish it tomorrow if there are no more comments. I can't so so without another committer clicking the approve button:

Screenshot 2024-11-20 at 9 58 21 AM


# Rallying The Community around Performance

In July, 2024 [Mehmet Ozan Kabak], CEO of [Synnada], [called on the community to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this central to this post? I don't mean to "discredit" the call by any means, but I'm not sure the work described in this post was driven by this comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I certainly was inspired by the comment to help me focus where I spent my time reviewing PRs and helping push them through, though it is a good point that this may imply it motivated others as well, when I don't really know what did.

Perhaps we could rephrase the motivation with something like this?

"Performance has long been a focus for DataFusion: one of the core benefits of DataFusion is its core performance, which both excites contributors and attracts users. There seems to have been a renewed focus on performance recently, including a call in July 2024 from Mehmet ....?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rephrased in 2507e67

@alamb
Copy link
Contributor Author

alamb commented Nov 21, 2024

Thank you to everyone who reviewed this PR. I plan to merge / publish it later today unless there are any other comments

@adriangb
Copy link
Contributor

Amazing work Andrew! You and all of the DataFusion contributors should be incredibly proud of this accomplishment.

@alamb
Copy link
Contributor Author

alamb commented Nov 21, 2024

Let's get this published to the world

@alamb alamb merged commit f655415 into apache:main Nov 21, 2024
@alamb alamb deleted the alamb/clickbench_blog branch November 21, 2024 18:17
@alamb
Copy link
Contributor Author

alamb commented Nov 21, 2024

https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Blog post: How DataFusion became the fastest engine for querying parquet (according to Clickbench)

10 participants