-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-13390][SQL]Fix the issue due to Iterator.map().toSeq is not Serializable #11313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #51694 has finished for PR 11313 at commit
|
|
Maybe good to add a test which would have failed before this change? |
| val beanInfo = Introspector.getBeanInfo(beanClass) | ||
| val rows = SQLContext.beansToRows(data.asScala.iterator, beanInfo, attrSeq) | ||
| DataFrame(self, LocalRelation(attrSeq, rows.toSeq)) | ||
| DataFrame(self, LocalRelation(attrSeq, rows.toArray)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toArray could be kind of expensive for large local iterators - maybe we should only do this if necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you have to materialize it anyways in order to support multiple scans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yah I know we need to materialize anyways (toSeq forces that too on Iterators) - but I meant more the cost of copying all the objects to an array (see https://github.com/scala/scala/blob/v2.11.1/src/library/scala/collection/TraversableOnce.scala#L269 ) so maybe only calling toArray if necessary would be worth while (but perhaps not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was the same before -- it was materialized into a Seq. I think Josh is suggesting this is inherently necessary before making a LocalRelation so there's no meaningful opportunity for lazy eval (?) If so then this seems fine.
|
Just found this happened to be fixed in #10511. I'm going to resubmit this patch against branch-1.6 since this is not necessary for master. |
|
See #11334 instead. |
What changes were proposed in this pull request?
scala.collection.Iterator's methods (e.g., map, filter) will return anAbstractIteratorwhich is not Serializable. E.g.,If we call something like
Iterator.map(...).toSeq, it will create aStreamthat contains a non-serializableAbstractIteratorfield and make theStreambe non-serializable.This PR uses
toArrayinstead oftoSeqto fix such issue indef createDataFrame(data: java.util.List[_], beanClass: Class[_]): DataFrame.How was the this patch tested?
Jenkins tests.