add DenseVector and SparseVector to mllib, and replace all `Array[Double]` with Vectors #736

soulmachine · 2013-07-26T01:47:58Z

SPARK-830: add DenseVector and SparseVector to mllib, and replace all Array[Double] with Vectors

AmplabJenkins · 2013-07-26T01:48:21Z

Thank you for your pull request. An admin will review this request soon.

tdunning · 2013-07-27T00:55:09Z

You might take a look at Mahout's sparse vector support. It has substantial optimizations beyond what colt does.

For example:

based on primitive collections
allows for multiple strategies for storing sparse data
cost based optimizations relative to function applications to allow dense f sparse, sparse f dense and sparse f sparse cases to be handled correctly based on the properties of f and the vectors in question.

soulmachine · 2013-07-27T05:42:27Z

Thanks for you advise, I have read these codes in Mahout, my concern is that it will bring in a huge bulk of codes, so currently I haven't ported these code yet. Anyway, sparse vector optimization is must to be done, I'm working on it :)

tdunning · 2013-07-27T10:47:03Z

Actually Mahout math purposely has limited dependencies.

On Fri, Jul 26, 2013 at 10:42 PM, Frank Dai [email protected]:

Thanks for you advise, I have read these codes in Mahout, my concern is
that it will bring in a huge bulk of codes, so currently I haven't ported
these code yet. Anyway, sparse vector optimization is must to be done, I'm
working on it :)

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/736#issuecomment-21660088
.

shivaram · 2013-07-29T21:33:54Z

Thanks @soulmachine for sending this out and thanks @tdunning for the pointer to the mahout math package. It is definitely useful to have a math library that makes algorithms easier to write, but there are some advantages to having the external interface be Array[Double] like making it easier to call into the library from Java etc. So there are are some things to think about here.

Things a little busy this week as we are trying to get the 0.8 release out, but we will get back to you soon on this.

soulmachine · 2013-08-04T17:13:51Z

when both vectors are dense, use jblas, otherwise use optimization strategies similar to mahout

tdunning · 2013-08-04T20:09:52Z

Yeah... we tried that in Mahout (blas for dense, hashmaps for sparse).

For small vectors and matrices, this can cause significant slow down because of the cost of the JNI. You often also have either runtime overhead because the JNI has to pin the matrix to protect against the possibility of GC moving it or you have considerable API complexity because you have to remember to deallocate off-heap storage.

For small, simple operations like small vector dot product, we saw 2x or more slowdown using JBLAS. For larger complex operations like 200x200 eigenvector decomposition, we saw up to 5-10x speedup.

It isn't a one-size-fits-all sort of decision.

We have considered having a BLAS-oriented off-heap matrix type in Mahout and depend on reference queues to deallocate them. The speed advantage has not been big enough yet to matter for us because of our normal focus on sparse matrices and the perceived portability hit. Having multiple kinds of matrices that use BLAS or not also brings the optimization questions into view for dense matrices as well since you now have questions about whether it is worthwhile to copy a matrix off-heap in order to use BLAS.

mateiz · 2013-08-04T23:37:55Z

In the current version at least, JBLAS itself doesn't call into C unless you're doing one of the more expensive operations (beyond dot product or sum). So it might be fine as is.

tdunning · 2013-08-04T23:50:37Z

It is definitely worth trying.

liancheng · 2013-08-05T02:49:24Z

@tdunning Thanks for your advice! We're working on benchmarks to find out the best situations that JBLAS and sparse vectors fit in. The underlying implementation is subject to change.

@mateiz There are four situations:

small dense vector
big dense vector
small sparse vector
big sparse vector

I guess JBLAS can handle the first two situations well, and sparse vector comes in handy in the last two situations. We'll investigate JBLAS to confirm.

AmplabJenkins · 2013-08-05T21:33:44Z

Thank you for your pull request. An admin will review this request soon.

MLnick · 2013-08-07T11:34:14Z

Hi,

What is the point of the port? From an admittedly quick glance it looks like a lot of duplication of what Mahout has done - if so then are we not running into issues of having to now maintain this separate but very similar "math" codebase, and now mllib starts becoming a linear algebra / matlab / breeze / full mahout-like library instead of just focusing on machine learning algorithms?

Mahout ended up doing this work because of issues and lack of functionality with Java linear algebra libraries esp. Colt. To me it seems like a lot of duplication of effort and maintenance.

Perhaps I'm missing something in the overarching design and intentions?

soulmachine · 2013-08-07T14:43:09Z

@MLnick Thanks for you comment! As you've pointed out, mahout-math is an ambitious project that fills the blank of a good Java math library. Spark is intended for iterative algorithms like machine learning algorithms. In the short term, supporting various kinds of machine learning models ASAP is much more important than a full fledged math library.

Mahout-math is a large library (over 150k lines of code), and is still in active development. On the other hand, I think mllib only need a small core library which supports vector and matrix operations, so we write this vector library for Spark mllib. The code base size is only 1% of that of mahout-math.

The vector library has just been ported from mahout-math, and looks very similar to mahout-math. We are actively refactoring the code to make it more Scala-like. Comparing to Java, Scala is much more powerful and concise. With techniques like typeclass and specialization, I believe we'll end up with a lean and mean library rather than a giant one.

tdunning · 2013-08-07T15:02:01Z

What would you think about contributing more scala-isms to Mahout and using the resulting Math library?

Wouldn't it generally be better to join forces?

Note also that there is an active project to provide a much more friendly Scala face to Mahout math.

MLnick · 2013-08-07T15:13:26Z

@soulmachine @mateiz @dlyubimov @dlwh
I understand the reluctance to pull in huge dependencies - @tdunning is that 150k for math or the whole of mahout? But this still seems like duplication of effort and most importantly maintenance.

Originally I understood the plan with MLlib was to have basic interfaces to the algorithms and stick to simple Array[Double] etc. for performance and Java/Python interop reasons.

Now it seems there is a switch to using Vectors (which makes sense of course in particular for sparse support and is something I wanted from the beginning in my library). In addition there is mention of the power of Scala (typeclasses and therefore implicits, spec etc). Why then not use an existing library like Breeze and add to that rather than build from the ground up? This is what I started to do in my Spark ML library. While there may be performance implications of all the implicit stuff, I found Breeze really nice to work with. For pure performance I understand going with primitive arrays, but if it's going to be Scala-fied anyway now, I am somewhat confused about the reasoning for not using Breeze.

Or (as Ted mentions) the Scala DSL / interface to mahout-math that Dmitriy has been working on.

Just wondering really because the message I got originally seems somewhat at odds with all this work. It doesn't make a huge difference either way, all that is needed initially is good dense matrix/vector ops and decent sparse vectors, I'm just trying to understand the thinking here.

Overall it seems to me better to have a core linear algebra library that can be used across many different projects (and benefit from all those projects' devs) rather than rebuilding one from scratch.

Note that I am not married to mahout-math or Breeze (though frankly I somewhat prefer Breeze at this point), but it would be really nice to get devs from mahout on board with this project - they have a lot of expertise - and/or get David Hall of breeze (from UC Berkeley!) involved looking at / helping out on the linear algebra stuff, and work together.

mateiz · 2013-08-07T15:41:04Z

Guys, for what it's worth, from the Berkeley side, we're still deciding how much math to put into MLlib. As @shivaram said, because we want a library that can be called from a wide variety of frontends (including Java, Python, Shark, MLbase, etc), a low-level API based on arrays of doubles seems much easier. People can write interfaces on top that look more like math. So in this particular case, it might be better to base this library on Mahout Math and contribute it back to Mahout, or on Breeze, rather than duplicating effort.

It's also the case that the math within distributed ML algorithms like we have here is quite a bit simpler than what a lot of high-level math libraries try to enable. Things like Breeze are great if you want a replacement for Matlab, and we will certainly have one API like that on top of MLlib (as part of the MLbase project), but they're not that useful when writing MLlib itself, especially because we want high control over performance and memory management.

mateiz · 2013-08-07T15:56:19Z

The other thing I should add is that it's really daunting to commit to maintaining a new math library as part of Spark -- even at 10K lines, it's much more code than the rest of MLlib, just to simplify a few function calls. This is why contributing this to Mahout-Math or Breeze could be better.

andytwigg · 2013-08-07T16:13:21Z

FWIW, I did a random forest implementation in spark and initially used
mahout-math for all the vectors, but i've recently reverted that due to
various difficulties, mostly serialization and inability to perform basic
operations (eg plus) on vectors of different sizes, so it was just easier
to do what I needed to do using double[]

On 7 August 2013 16:56, Matei Zaharia [email protected] wrote:

The other thing I should add is that it's really daunting to commit to
maintaining a new math library as part of Spark -- even at 10K lines, it's
much more code than the rest of MLlib, just to simplify a few function
calls. This is why contributing this to Mahout-Math or Breeze would be
better.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/736#issuecomment-22261677
.

Dr Andy Twigg
Junior Research Fellow, St Johns College, Oxford
Room 351, Department of Computer Science
http://www.cs.ox.ac.uk/people/andy.twigg/
[email protected] | +447799647538

dlwh · 2013-08-07T16:38:11Z

So, I think the thing that makes the most sense is to have an additional lib downstream from Breeze and Spark/Mahout-Math that just adds Breeze operator support to whatever classes are used. What you have to do is:

(example for mahout-math's Vector)

Add an implicit conversion from Vector to NumericOps[Vector]
Add implicits for OpAdd, OpMulInner, etc. etc.
Define a TensorSpace (if appropriate) implicit
Add UFunc and URFunc implicits so that sum(...) etc work.

And then most of the Breeze API should just work for that class. That includes sum, normalize, L-BFGS, Adagrad, etc. etc.

-- David

MLnick · 2013-08-07T17:00:09Z

@dlwh it would be quite a nice idea to have a single linear algebra DSL that then has pluggable backends!

@mateiz the original reasoning behind having MLlib use much more raw primitive inputs makes sense to me, far more so than building a math library from scratch. In my view it's either stick to the first approach or go for a full blown nice math DSL that has pluggable backends (and use either Breeze DSL or Dmitriy's Mahout/Scala DSL as the standard frontend). Both have their merits (performance and interop for the former, ease of understanding and use for the latter).

I can see something working like a wrapper library that provides (pretty trivial) implicit conversions from RDD[breeze.Vector] => RDD[Array[Double]], or from Dmitriy's Scala DSL for mahout math, i.e. RDD[mahout.Vector] => RDD[Array[Double]] that then calls the underlying MLlib algos. This wrapper library can provide higher-level operators like DistributedRowMatrix that has multiply and transpose operators (for example).

Note one thing that is in my view an absolute requirement is SparseVector, since a large-scale linear learning algo will need it.

mateiz · 2013-08-07T17:41:23Z

Yup, I think something based on implicit conversions would be great for Scala APIs. I like that Breeze can do this to add new vector types already.

dlyubimov · 2013-08-07T18:45:13Z

FWIW i think good in-core math is important.

After looking at various in-core jvm stuff, my bottom line is that all existing implementations have gaps, and their strong sides. And IMO it is not much of a tragedy to pick one and fill in the gaps and perhaps join forces just like @tdunning suggests. It is even quite possible to do selective adoption just like Mahout did with Colt, to address @soulmachine's "lean codebase" concern.

What would be however a real tragedy (to me, at least) is if Spark ML stuff adopts in-core linalg architecture that is actually less consistent than any of these alone. Which so far seems to be where it all is headed (less MLI it seems, I liked their way of thinking abut data frame abstraction) -- but then I will have to accept Matei's explanation of still picking a direction etc.

I also want to stress emerging importance of sparse in-core stuff in real life (Mahout has two types of sparse vectors and cost-based optimizer selecting proper algorithm for a given type of operation and proper product structure). That's its strong side (along with general Matrix and Vector functional abstractions which could plug JBlas backend, and i probably eventually do that too) over anything else i looked at and that's what i need at the moment. Not without some gaps though, too.

@soulmachine @mateiz Re: scala mean and lean: Scala DSL and implicit conversions sure look nice but they are pretty dangerous in loops. in my tests unscrupulous use of implicits (which is inherent to scala DSL) produces up to two orders of magnitude slowdown in bulk numeric loops. So... DSL is primarily for the matrix users, (including blockwise distributed algorithm implementers) but not so much for in-core linalg building block implementers. So are, sadly, "for" comprehensions aka for loops. My understanding it is still an issue even in 2.10. I am really tired of writing "while" loops by now. This is even more tedious than java. The respective issue in scala seems to be still at general discussion step, there's no patch or even a concrete direction on horizon yet (or so i gathered form reading their jira). Which is why i prefer to actually work on Mahout in-core stuff with java.

With that caveat in mind, i am all for linalg DSLs and it sure looks pretty in my distributed code.

tdunning · 2013-08-07T22:14:08Z

Regarding size of the Mahout library, here are some actual line counts:

collections       557 total
common            205 total
math            28250 total

These are for actual java classes. Mahout also has a templated primitive collections system.

The original templates consist of

buffer       42 total
function    275 total
list       1510 total
map        3241 total
set         604 total

After expansion, these templates have many more lines of code

buffer         294 total
function      3983 total
list         10357 total
map          68621 total
set           4186 total

As can be seen, the actual math library itself is only about 30K lines of code. The underlying collections library is nearly 100K lines of code after automated expansion.

The collections library has had one round of change/fixes relatively recently, but has otherwise been very stable.

My feeling is that the Mahout library isn't all that big.

tdunning · 2013-08-07T22:19:38Z

... sadly, "for" comprehensions aka for loops. My understanding it is still an issue even in 2.10. I am really tired of writing "while" loops by now. This is even more tedious than java. The respective issue in scala seems to be still at general discussion step, there's no patch or even a concrete direction on horizon yet (or so i gathered form reading their jira).

Iteration is a touchy point in Mahout itself. We have been unable to build an iterator that is simultaneously high performance and usable in functional settings. The issue is that if we allocate a new carrier object for the index and value, we have good generality and 2x lower performance. If we re-use the carrier, then we get high performance, but you can't use the iterator with things like Google's guava Iterables class. All suggestions to remedy this are welcome! (other than saying that java should do better escape analysis)

tkroman · 2013-08-08T12:27:52Z

mllib/src/main/scala/spark/mllib/math/Arrays.scala

Why didn't you use mkString("[", ", ", "]") here? Just curious.

soulmachine · 2013-08-13T07:12:26Z

Hi guys, thanks for concerning this pull request. After serious considerations about all your comments, we decided to close this pull request. We are sorry for our abruptness and lacking of prior communication about the solution to the vector abstraction problem.

@MLnick Thanks for pointing out our most serious problems and started the whole discussion, from which many insightful opinions emerged.

@tdunning mahout-math is an ambitious project which fills the blank that there isn't a good algebra library on JVM. Recently I spent a lot of time reading mahout math's source code, I'd like to contribute to it when I have a more through understanding.

@dlyubimov The Scala DSL over mahout-math is a good idea, the code base is an order of magnitude smaller than our library, but achieved the same effects. It indeed shows the power of "joint forces".

@mateiz I agree that Spark should remains a concise and small code base. I think it's time to close this pull request now. As for mllib's linear algebra library, I think mahout-math would be a good choice, since many people engaged in machine learning are familiar with Mahout, and @dlyubimov's Scala DSL is very thin and lightweight.

mateiz · 2013-08-13T18:09:00Z

Thanks @soulmachine -- we really appreciate the interest in MLlib and the effort you've put into this. Hopefully it's at least good learning experience, and maybe something that will help improve @dlyubimov's project or Mahout Math.

MLnick · 2013-08-13T21:12:43Z

FWIW I actually think this PR was hugely valuable despite not going anywhere, simply for the discussion it encouraged and for highlighting the pretty clear benefits of collaboration across communities! Thanks!
—
Sent from Mailbox for iPhone

On Tue, Aug 13, 2013 at 8:09 PM, Matei Zaharia [email protected]
wrote:

Thanks @soulmachine -- we really appreciate the interest in MLlib and the effort you've put into this. Hopefully it's at least good learning experience, and maybe something that will help improve @dlyubimov's project or Mahout Math.

Reply to this email directly or view it on GitHub:
#736 (comment)

soulmachine added 4 commits July 26, 2013 04:34

add DenseVector(based on jblas) and SparseVector(based on colt)

a0c7fcc

add distance measures

9dc2db4

replace Array[Double] with Vector

d6d997a

replace all Array[Double] with Vector

bb3dfbc

port mahout vector to spark

6f943b1

Renamed several functions according to Scala's naming convention

416fded

tkroman reviewed Aug 8, 2013
View reviewed changes

mllib/src/main/scala/spark/mllib/math/Arrays.scala

Copy link

Contributor

tkroman Aug 8, 2013

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why didn't you use mkString("[", ", ", "]") here? Just curious.

mateiz closed this Aug 13, 2013

levin-royl mentioned this pull request Jan 28, 2016

Changes to support KMeans with large feature space apache/spark#10739

Closed

add DenseVector and SparseVector to mllib, and replace all Array[Double] with Vectors #736

add DenseVector and SparseVector to mllib, and replace all Array[Double] with Vectors #736

Uh oh!

Conversation

soulmachine commented Jul 26, 2013

Uh oh!

AmplabJenkins commented Jul 26, 2013

Uh oh!

tdunning commented Jul 27, 2013

Uh oh!

soulmachine commented Jul 27, 2013

Uh oh!

tdunning commented Jul 27, 2013

Uh oh!

shivaram commented Jul 29, 2013

Uh oh!

soulmachine commented Aug 4, 2013

Uh oh!

tdunning commented Aug 4, 2013

Uh oh!

mateiz commented Aug 4, 2013

Uh oh!

tdunning commented Aug 4, 2013

Uh oh!

liancheng commented Aug 5, 2013

Uh oh!

AmplabJenkins commented Aug 5, 2013

Uh oh!

MLnick commented Aug 7, 2013

Uh oh!

soulmachine commented Aug 7, 2013

Uh oh!

tdunning commented Aug 7, 2013

Uh oh!

MLnick commented Aug 7, 2013

Uh oh!

mateiz commented Aug 7, 2013

Uh oh!

mateiz commented Aug 7, 2013

Uh oh!

andytwigg commented Aug 7, 2013

Uh oh!

dlwh commented Aug 7, 2013

Uh oh!

MLnick commented Aug 7, 2013

Uh oh!

mateiz commented Aug 7, 2013

Uh oh!

dlyubimov commented Aug 7, 2013

Uh oh!

tdunning commented Aug 7, 2013

Uh oh!

tdunning commented Aug 7, 2013

Uh oh!

tkroman Aug 8, 2013

Choose a reason for hiding this comment

Uh oh!

soulmachine commented Aug 13, 2013

Uh oh!

mateiz commented Aug 13, 2013

Uh oh!

MLnick commented Aug 13, 2013

Thanks @soulmachine -- we really appreciate the interest in MLlib and the effort you've put into this. Hopefully it's at least good learning experience, and maybe something that will help improve @dlyubimov's project or Mahout Math.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

add DenseVector and SparseVector to mllib, and replace all `Array[Double]` with Vectors #736

add DenseVector and SparseVector to mllib, and replace all `Array[Double]` with Vectors #736