Skip to content

Conversation

zhangjiajin
Copy link
Contributor

Collect enough frequent prefixes before projection in PrefixSpan

@zhangjiajin
Copy link
Contributor Author

@mengxr This is new PR, please review it. TKS.

@mengxr
Copy link
Contributor

mengxr commented Jul 15, 2015

cc @feynmanliang

@SparkQA
Copy link

SparkQA commented Jul 15, 2015

Test build #37309 has finished for PR 7412 at commit 6560c69.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to parallelize if you remove the collect on L88 (will still need collect on L90 for now)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: split**_t**_ed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, instead of ._1 and ._2 below, why not just assign like in L94 here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@feynmanliang
Copy link
Contributor

Made some suggestions; see how perf changes after them. Unfortunately, scanning the dataset to ensure suffixes are bounded will introduce a performance hit. I still think it's worth it though since it's certainly better than just failin.

It may be worthwhile to test that these changes prevent executor failure due to overload. One way to do that would be to use a large enough dataset and set spark.akka.maxFrameSize small enough s.t. the first method fails but the latter method passes.

@zhangjiajin
Copy link
Contributor Author

@feynmanliang About splitPrefixSuffixPairs, I compared these two methods. I find your method's running time more than mine. And the result is not correct. I don't know why it was so, please check it, thank you.

@zhangjiajin
Copy link
Contributor Author

@feynmanliang You are right, it is worth to prevent executor failure. I very much agree with it. And I tested it according to your suggestion.
The results of the performance test are not stable, I try to find out the reason, perhaps due to the environment, I will post it after solving this problem.

@SparkQA
Copy link

SparkQA commented Jul 27, 2015

Test build #38502 has finished for PR 7412 at commit 64271b3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 29, 2015

Test build #38791 has finished for PR 7412 at commit ad23aa9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please leave a TODO to make it configurable with a better default value. 10000 may be too small.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@mengxr
Copy link
Contributor

mengxr commented Jul 29, 2015

@zhangjiajin @feynmanliang I made some minor comments. We can address the default value of maxLocalProjDBSize in a follow-up PR. To generate the final RDD, I would recommend using sc.parallel(localSmall, 1) ++ getPatternsInLocal instead of chaining multiple RDDs. Let's fix those and merge this PR, so we can unblock related PRs. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants