-
Notifications
You must be signed in to change notification settings - Fork 383
add DenseVector and SparseVector to mllib, and replace all Array[Double] with Vectors
#736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for your pull request. An admin will review this request soon. |
|
You might take a look at Mahout's sparse vector support. It has substantial optimizations beyond what colt does. For example:
|
|
Thanks for you advise, I have read these codes in Mahout, my concern is that it will bring in a huge bulk of codes, so currently I haven't ported these code yet. Anyway, sparse vector optimization is must to be done, I'm working on it :) |
|
Actually Mahout math purposely has limited dependencies. On Fri, Jul 26, 2013 at 10:42 PM, Frank Dai [email protected]:
|
|
Thanks @soulmachine for sending this out and thanks @tdunning for the pointer to the mahout math package. It is definitely useful to have a math library that makes algorithms easier to write, but there are some advantages to having the external interface be Array[Double] like making it easier to call into the library from Java etc. So there are are some things to think about here. Things a little busy this week as we are trying to get the 0.8 release out, but we will get back to you soon on this. |
|
when both vectors are dense, use jblas, otherwise use optimization strategies similar to mahout |
|
Yeah... we tried that in Mahout (blas for dense, hashmaps for sparse). For small vectors and matrices, this can cause significant slow down because of the cost of the JNI. You often also have either runtime overhead because the JNI has to pin the matrix to protect against the possibility of GC moving it or you have considerable API complexity because you have to remember to deallocate off-heap storage. For small, simple operations like small vector dot product, we saw 2x or more slowdown using JBLAS. For larger complex operations like 200x200 eigenvector decomposition, we saw up to 5-10x speedup. It isn't a one-size-fits-all sort of decision. We have considered having a BLAS-oriented off-heap matrix type in Mahout and depend on reference queues to deallocate them. The speed advantage has not been big enough yet to matter for us because of our normal focus on sparse matrices and the perceived portability hit. Having multiple kinds of matrices that use BLAS or not also brings the optimization questions into view for dense matrices as well since you now have questions about whether it is worthwhile to copy a matrix off-heap in order to use BLAS. |
|
In the current version at least, JBLAS itself doesn't call into C unless you're doing one of the more expensive operations (beyond dot product or sum). So it might be fine as is. |
|
It is definitely worth trying. |
|
@tdunning Thanks for your advice! We're working on benchmarks to find out the best situations that JBLAS and sparse vectors fit in. The underlying implementation is subject to change. @mateiz There are four situations:
I guess JBLAS can handle the first two situations well, and sparse vector comes in handy in the last two situations. We'll investigate JBLAS to confirm. |
|
Thank you for your pull request. An admin will review this request soon. |
|
Hi, What is the point of the port? From an admittedly quick glance it looks like a lot of duplication of what Mahout has done - if so then are we not running into issues of having to now maintain this separate but very similar "math" codebase, and now mllib starts becoming a linear algebra / matlab / breeze / full mahout-like library instead of just focusing on machine learning algorithms? Mahout ended up doing this work because of issues and lack of functionality with Java linear algebra libraries esp. Colt. To me it seems like a lot of duplication of effort and maintenance. Perhaps I'm missing something in the overarching design and intentions? |
|
@MLnick Thanks for you comment! As you've pointed out, mahout-math is an ambitious project that fills the blank of a good Java math library. Spark is intended for iterative algorithms like machine learning algorithms. In the short term, supporting various kinds of machine learning models ASAP is much more important than a full fledged math library. Mahout-math is a large library (over 150k lines of code), and is still in active development. On the other hand, I think mllib only need a small core library which supports vector and matrix operations, so we write this vector library for Spark mllib. The code base size is only 1% of that of mahout-math. The vector library has just been ported from mahout-math, and looks very similar to mahout-math. We are actively refactoring the code to make it more Scala-like. Comparing to Java, Scala is much more powerful and concise. With techniques like typeclass and specialization, I believe we'll end up with a lean and mean library rather than a giant one. |
|
What would you think about contributing more scala-isms to Mahout and using the resulting Math library? Wouldn't it generally be better to join forces? Note also that there is an active project to provide a much more friendly Scala face to Mahout math. |
|
@soulmachine @mateiz @dlyubimov @dlwh Originally I understood the plan with MLlib was to have basic interfaces to the algorithms and stick to simple Array[Double] etc. for performance and Java/Python interop reasons. Now it seems there is a switch to using Vectors (which makes sense of course in particular for sparse support and is something I wanted from the beginning in my library). In addition there is mention of the power of Scala (typeclasses and therefore implicits, spec etc). Why then not use an existing library like Breeze and add to that rather than build from the ground up? This is what I started to do in my Spark ML library. While there may be performance implications of all the implicit stuff, I found Breeze really nice to work with. For pure performance I understand going with primitive arrays, but if it's going to be Scala-fied anyway now, I am somewhat confused about the reasoning for not using Breeze. Or (as Ted mentions) the Scala DSL / interface to mahout-math that Dmitriy has been working on. Just wondering really because the message I got originally seems somewhat at odds with all this work. It doesn't make a huge difference either way, all that is needed initially is good dense matrix/vector ops and decent sparse vectors, I'm just trying to understand the thinking here. Overall it seems to me better to have a core linear algebra library that can be used across many different projects (and benefit from all those projects' devs) rather than rebuilding one from scratch. Note that I am not married to mahout-math or Breeze (though frankly I somewhat prefer Breeze at this point), but it would be really nice to get devs from mahout on board with this project - they have a lot of expertise - and/or get David Hall of breeze (from UC Berkeley!) involved looking at / helping out on the linear algebra stuff, and work together. |
|
Guys, for what it's worth, from the Berkeley side, we're still deciding how much math to put into MLlib. As @shivaram said, because we want a library that can be called from a wide variety of frontends (including Java, Python, Shark, MLbase, etc), a low-level API based on arrays of doubles seems much easier. People can write interfaces on top that look more like math. So in this particular case, it might be better to base this library on Mahout Math and contribute it back to Mahout, or on Breeze, rather than duplicating effort. It's also the case that the math within distributed ML algorithms like we have here is quite a bit simpler than what a lot of high-level math libraries try to enable. Things like Breeze are great if you want a replacement for Matlab, and we will certainly have one API like that on top of MLlib (as part of the MLbase project), but they're not that useful when writing MLlib itself, especially because we want high control over performance and memory management. |
|
The other thing I should add is that it's really daunting to commit to maintaining a new math library as part of Spark -- even at 10K lines, it's much more code than the rest of MLlib, just to simplify a few function calls. This is why contributing this to Mahout-Math or Breeze could be better. |
|
FWIW, I did a random forest implementation in spark and initially used On 7 August 2013 16:56, Matei Zaharia [email protected] wrote:
Dr Andy Twigg |
|
So, I think the thing that makes the most sense is to have an additional lib downstream from Breeze and Spark/Mahout-Math that just adds Breeze operator support to whatever classes are used. What you have to do is: (example for mahout-math's Vector)
And then most of the Breeze API should just work for that class. That includes sum, normalize, L-BFGS, Adagrad, etc. etc. -- David |
|
@dlwh it would be quite a nice idea to have a single linear algebra DSL that then has pluggable backends! @mateiz the original reasoning behind having MLlib use much more raw primitive inputs makes sense to me, far more so than building a math library from scratch. In my view it's either stick to the first approach or go for a full blown nice math DSL that has pluggable backends (and use either Breeze DSL or Dmitriy's Mahout/Scala DSL as the standard frontend). Both have their merits (performance and interop for the former, ease of understanding and use for the latter). I can see something working like a wrapper library that provides (pretty trivial) implicit conversions from Note one thing that is in my view an absolute requirement is SparseVector, since a large-scale linear learning algo will need it. |
|
Yup, I think something based on implicit conversions would be great for Scala APIs. I like that Breeze can do this to add new vector types already. |
|
FWIW i think good in-core math is important. After looking at various in-core jvm stuff, my bottom line is that all existing implementations have gaps, and their strong sides. And IMO it is not much of a tragedy to pick one and fill in the gaps and perhaps join forces just like @tdunning suggests. It is even quite possible to do selective adoption just like Mahout did with Colt, to address @soulmachine's "lean codebase" concern. What would be however a real tragedy (to me, at least) is if Spark ML stuff adopts in-core linalg architecture that is actually less consistent than any of these alone. Which so far seems to be where it all is headed (less MLI it seems, I liked their way of thinking abut data frame abstraction) -- but then I will have to accept Matei's explanation of still picking a direction etc. I also want to stress emerging importance of sparse in-core stuff in real life (Mahout has two types of sparse vectors and cost-based optimizer selecting proper algorithm for a given type of operation and proper product structure). That's its strong side (along with general Matrix and Vector functional abstractions which could plug JBlas backend, and i probably eventually do that too) over anything else i looked at and that's what i need at the moment. Not without some gaps though, too. @soulmachine @mateiz Re: scala mean and lean: Scala DSL and implicit conversions sure look nice but they are pretty dangerous in loops. in my tests unscrupulous use of implicits (which is inherent to scala DSL) produces up to two orders of magnitude slowdown in bulk numeric loops. So... DSL is primarily for the matrix users, (including blockwise distributed algorithm implementers) but not so much for in-core linalg building block implementers. So are, sadly, "for" comprehensions aka for loops. My understanding it is still an issue even in 2.10. I am really tired of writing "while" loops by now. This is even more tedious than java. The respective issue in scala seems to be still at general discussion step, there's no patch or even a concrete direction on horizon yet (or so i gathered form reading their jira). Which is why i prefer to actually work on Mahout in-core stuff with java. With that caveat in mind, i am all for linalg DSLs and it sure looks pretty in my distributed code. |
|
Regarding size of the Mahout library, here are some actual line counts: These are for actual java classes. Mahout also has a templated primitive collections system. The original templates consist of After expansion, these templates have many more lines of code As can be seen, the actual math library itself is only about 30K lines of code. The underlying collections library is nearly 100K lines of code after automated expansion. The collections library has had one round of change/fixes relatively recently, but has otherwise been very stable. My feeling is that the Mahout library isn't all that big. |
Iteration is a touchy point in Mahout itself. We have been unable to build an iterator that is simultaneously high performance and usable in functional settings. The issue is that if we allocate a new carrier object for the index and value, we have good generality and 2x lower performance. If we re-use the carrier, then we get high performance, but you can't use the iterator with things like Google's guava Iterables class. All suggestions to remedy this are welcome! (other than saying that java should do better escape analysis) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why didn't you use mkString("[", ", ", "]") here? Just curious.
|
Hi guys, thanks for concerning this pull request. After serious considerations about all your comments, we decided to close this pull request. We are sorry for our abruptness and lacking of prior communication about the solution to the vector abstraction problem. @MLnick Thanks for pointing out our most serious problems and started the whole discussion, from which many insightful opinions emerged. @tdunning mahout-math is an ambitious project which fills the blank that there isn't a good algebra library on JVM. Recently I spent a lot of time reading mahout math's source code, I'd like to contribute to it when I have a more through understanding. @dlyubimov The Scala DSL over mahout-math is a good idea, the code base is an order of magnitude smaller than our library, but achieved the same effects. It indeed shows the power of "joint forces". @mateiz I agree that Spark should remains a concise and small code base. I think it's time to close this pull request now. As for mllib's linear algebra library, I think mahout-math would be a good choice, since many people engaged in machine learning are familiar with Mahout, and @dlyubimov's Scala DSL is very thin and lightweight. |
|
Thanks @soulmachine -- we really appreciate the interest in MLlib and the effort you've put into this. Hopefully it's at least good learning experience, and maybe something that will help improve @dlyubimov's project or Mahout Math. |
|
FWIW I actually think this PR was hugely valuable despite not going anywhere, simply for the discussion it encouraged and for highlighting the pretty clear benefits of collaboration across communities! Thanks! On Tue, Aug 13, 2013 at 8:09 PM, Matei Zaharia [email protected]
|
SPARK-830: add DenseVector and SparseVector to mllib, and replace all
Array[Double]with Vectors