Skip to content

Conversation

@parthchandra
Copy link
Contributor

What changes were proposed in this pull request?

Implements a vectorized version of the parquet reader for DELTA_BINARY_PACKED encoding
This PR includes a previous PR for this issue which passed the read request thru to the parquet implementation and which was not vectorized. The current PR builds on top of that PR (hence both are included).

Why are the changes needed?

Currently Spark throws an exception when reading data with these encodings if vectorized reader is enabled

Does this PR introduce any user-facing change?

No

How was this patch tested?

Additional unit tests for the encoding for both long and integer types (mirroring the unit tests in the Parquet implementation)

@SparkQA
Copy link

SparkQA commented Nov 2, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49330/

@SparkQA
Copy link

SparkQA commented Nov 3, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49330/

@SparkQA
Copy link

SparkQA commented Nov 3, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49334/

@SparkQA
Copy link

SparkQA commented Nov 3, 2021

Test build #144861 has finished for PR 34471 at commit fc9683b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 3, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49334/

@SparkQA
Copy link

SparkQA commented Nov 3, 2021

Test build #144864 has finished for PR 34471 at commit 10659d3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon HyukjinKwon changed the title [SPARK-36879][SQL]Support Parquet v2 data page encoding (DELTA_BINARY_PACKED) for the vectorized path [SPARK-36879][SQL] Support Parquet v2 data page encoding (DELTA_BINARY_PACKED) for the vectorized path Nov 3, 2021
@dbtsai
Copy link
Member

dbtsai commented Nov 3, 2021

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Nov 3, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49350/

@SparkQA
Copy link

SparkQA commented Nov 3, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49350/

@SparkQA
Copy link

SparkQA commented Nov 3, 2021

Test build #144880 has finished for PR 34471 at commit 10659d3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @parthchandra for working on this! I left some comments.

.gitignore Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unrelated change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 with @sunchao 's comment. Please remove this from this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't seem to be used anywhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But not for long (there's a horrible pun in here somewhere).
I need this for the vectorized implemenatation of DeltaByteArrayReader (which I did not include to make review easier).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that case can we put this together with the follow-up PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just knew you would say that :). Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add some comments for this? what are c, rowId and val for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems it's better to have an abstract class inheriting ValuesReader and VectorizedValuesReader with this default behavior defined, rather than repeating the same thing in all the different value readers.

this can be done separately though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why return 0 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why indeed. (Intellij generated code for unimplemented methods)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll need to implement these too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh dear. I implemented only the methods the original PR had implemented. On closer look we also need support for byte, short, date, timestamp, yearmonth interval, and daytime interval datatypes which are stored as int32 or int64.
Perf note: Rebased dates and timestamps appear to be a backward compatibility fix and incur the penalty of checking if the value needs to be rebased.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be done more efficiently, for instance we don't need to unpack the bits anymore, and don't need to compute the original value from delta, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do. The original unit tests have interleaving read and skip. To continue to read after a skip, we need to have read the previous value.

@sunchao
Copy link
Member

sunchao commented Nov 4, 2021

cc @sadikovi @viirya @dongjoon-hyun too

@parthchandra
Copy link
Contributor Author

Thank you for the review @sunchao! Let me address the comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use two-space indentation like the other part of this file, @parthchandra ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import order is a little strange. Could you grouping java import (line 30 and 19) together as the first group?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. Let's remove redundant empty line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make these two lines into a single liner.

- readValues(total, null, -1, (w, r, v) -> {
- });
+ readValues(total, null, -1, (w, r, v) -> {});

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you put this at the beginning before int remaining = total;?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@parthchandra parthchandra marked this pull request as draft November 8, 2021 22:28
@parthchandra
Copy link
Contributor Author

Looks like it may take some time to address some of the review comments. Marking this PR as draft in the meantime.

@parthchandra parthchandra marked this pull request as ready for review November 11, 2021 23:32
@SparkQA
Copy link

SparkQA commented Nov 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49592/

@SparkQA
Copy link

SparkQA commented Nov 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49593/

@SparkQA
Copy link

SparkQA commented Nov 12, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49592/

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for updating this @parthchandra ! Overall look pretty good. I think we just need to address the issue with benchmark and attach the result together with the PR. You can find out how to get benchmark result using GitHub workflow here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe revise this message a bit, since "total value count is + valuesRead" looks a bit confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work yet because of the BooleanType added recently.

Error:

[error] Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'sum(parquetv2table.id)' due to data type mismatch: function sum requires numeric or interval types, not boolean; line 1 pos 7;
[error] 'Aggregate [unresolvedalias(sum(id#40), None)]
[error] +- SubqueryAlias parquetv2table
[error]    +- View (`parquetV2Table`, [id#40])
[error]       +- Relation [id#40] parquet

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: these seem redundant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh. Rebase issue. Fixed.

@SparkQA
Copy link

SparkQA commented Dec 15, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50674/

@SparkQA
Copy link

SparkQA commented Dec 15, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50674/

@SparkQA
Copy link

SparkQA commented Dec 15, 2021

Test build #146200 has finished for PR 34471 at commit f472135.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sunchao
Copy link
Member

sunchao commented Dec 20, 2021

@parthchandra could you address the unit tests failure?

@parthchandra
Copy link
Contributor Author

The unit tests are failing in parts that I am not familiar with. Previously, re-running the tests had worked, but this time around the tests are failing every time. Can I get some help figuring out where the problem is?

@sunchao
Copy link
Member

sunchao commented Dec 21, 2021

If you click the SparkQA test build link you should see the failed tests. For instance:

sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: "[no more values to read, total value count is 641]" did not equal "[No more values to read. Total values read:  641, total count: 641, trying to read 1 more.]"
	at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
	at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
	at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
	at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetDeltaEncodingSuite.$anonfun$new$21(ParquetDeltaEncodingSuite.scala:170)
	at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
	at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
	at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
	at org.scalatest.Transformer.apply(Transformer.scala:22)
	at org.scalatest.Transformer.apply(Transformer.scala:20)

@parthchandra
Copy link
Contributor Author

Thank you @sunchao. I had gone thru the log and failed to see the test(s) that had failed. One of the unit tests was checking for the error message that accompanied an exception and as part of the review I had changed the error message! Updated the test. The tests should pass now.

@SparkQA
Copy link

SparkQA commented Dec 22, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50967/

@SparkQA
Copy link

SparkQA commented Dec 22, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50967/

@SparkQA
Copy link

SparkQA commented Dec 22, 2021

Test build #146491 has finished for PR 34471 at commit 64cc82f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sunchao
Copy link
Member

sunchao commented Jan 5, 2022

Committed to master branch, thanks @parthchandra !

saveAsCsvTable(testDf, dir.getCanonicalPath + "/csv")
saveAsJsonTable(testDf, dir.getCanonicalPath + "/json")
saveAsParquetTable(testDf, dir.getCanonicalPath + "/parquet")
saveAsParquetV2Table(testDf, dir.getCanonicalPath + "/parquetV2")
Copy link
Contributor

@LuciferYang LuciferYang Jan 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should update the benchmark-result of DataSourceReadBenchmark @parthchandra

Copy link
Contributor

@LuciferYang LuciferYang Jan 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that there are still unsupported encoding in Data Page V2, such as RLE for Boolean. It seems that it is not time to update the benchmark, please ignore my previous comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LuciferYang I was getting ready to set up a PR for the RLE/Boolean encoding and noticed that you have done so. Thank you!
Adding back the benchmark in a new PR.

private ByteBufferInputStream in;

// temporary buffers used by readByte, readShort, readInteger, and readLong
byte byteVal;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these 4 field be private?

withSQLConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key -> tsOutputType) {
val df = Seq.tabulate(N)(rowFunc).toDF("dict", "plain")
.select($"dict".cast(catalystType), $"plain".cast(catalystType))
withSQLConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key -> tsOutputType) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something wrong with indentation here and below:
Screenshot 2022-01-06 at 22 56 02

Seq.tabulate(N)(_ => Row(nonRebased)))
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong indentation too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants