Tidy Android Instructions README.md #7016

Jeximo · 2024-04-30T23:40:29Z

It's better to tidy readme regarding CLBlast instructions for Android.

Removed CLBlast instructions(outdated). Simplified Android CPU Build instructions.

Remove CLBlast instructions(outdated), added OpenBlas.

Added apt install git, so that git clone works

slaren · 2024-04-30T23:56:04Z

Is OpenBLAS actually worth using in Android? For quantized models, it may be faster without it. Ultimately though, without the OpenCL instructions, this basically looks like "install termux and follow the normal build instructions for linux". So maybe it would be simpler that way.

Jeximo · 2024-05-01T00:14:33Z

Is OpenBLAS actually worth using in Android?

I like leaving the decision to the user if OpenBlas is worth it. I don't use it, but I don't prompt large(supposedly that's where it shines).

this basically looks like "install termux and follow the normal build instructions for linux". So maybe it would be simpler that way.

Agreed.

Linked to Linux build instructions

Remove word "run"

teleprint-me · 2024-05-01T00:28:21Z

I build with OpenBLAS on Android, not that it matters. My chiming is, unfortunately, anecdotal. Is it really negligible? It's more difficult to tell on the phone if I'm being honest.

slaren · 2024-05-01T00:29:57Z

The easiest way to tell if OpenBLAS helps would be to run llama-bench and look at the pp performance. BLAS is only used for prompts with at least 32 tokens.

Jeximo · 2024-05-01T03:26:26Z

CPU is definitely faster with quants on my device:
OpenBlas:

| model                          |       size |     params | backend    |    threads |    n_batch | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------: | ---------- | ---------------: |
| llama 8B IQ4_XS - 4.25 bpw     |   3.64 GiB |     7.24 B | BLAS       |          4 |         32 | pp 512     |      1.00 ± 0.00 |
build: a8f9b076 (2775)

CPU:

| model                          |     size   |     params | backend    |    threads |    n_batch | test       |              t/s |                 
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------: | ---------- | ---------------: |                 
| llama 8B IQ4_XS - 4.25 bpw     |   3.64 GiB |     7.24 B | CPU        |          4 |         32 | pp 512     |      3.12 ± 0.06 |                
build: a8f9b076 (2775)

teleprint-me · 2024-05-01T04:47:02Z

I had to update, fix the convert script by adding the hash, and the upload the model I use, rebuild, and then download the quant. Plus, I have a bunch of other scripts running, so I'll post once it's all set.

teleprint-me · 2024-05-01T14:05:38Z

CPU is much faster! Why is that?

~ $ ./llama.cpp/llama-bench -m models/stablelm-2-zephyr-1_6b.gguf -t 8
| model                          |       size |     params | backend    |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------- | ---------------: |
| stablelm 1B F16 (guessed)      |   3.06 GiB |     1.64 B | BLAS       |          8 | pp 512     |     10.19 ± 1.87 |
| stablelm 1B F16 (guessed)      |   3.06 GiB |     1.64 B | BLAS       |          8 | tg 128     |      2.17 ± 0.17 |

build: a8f9b076 (2775)
~ $ ./llama.cpp/llama-bench -m models/stablelm-2-zephyr-1_6b.gguf -t 8
| model                          |       size |     params | backend    |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------- | ---------------: |
| stablelm 1B F16 (guessed)      |   3.06 GiB |     1.64 B | CPU        |          8 | pp 512     |     32.35 ± 2.42 |
| stablelm 1B F16 (guessed)      |   3.06 GiB |     1.64 B | CPU        |          8 | tg 128     |      3.72 ± 0.79 |

build: a8f9b076 (2775)

Jeximo · 2024-05-02T18:23:32Z

CPU is much faster! Why is that?

I think libopenblas is not a full backend. Vulkan is the way forward for mobile GPU: #6395 (comment)

README.md

Co-authored-by: slaren <[email protected]>

README.md

Fdroid is not required Co-authored-by: slaren <[email protected]>

Jeximo · 2024-05-04T15:33:18Z

This should only affect the load time of the model though, but performance during inference should be the same.

Thank you. I'll try various options and post results later.

Co-authored-by: slaren <[email protected]>

Jeximo · 2024-05-05T16:44:03Z

Tested --no-mmap on a model loaded frrom ~/ vs shared storage(Downloads). Performance is improved. It appears reduction is due to the combination of Android SAF API & mmap.

Here's some quick numbers, loading from shared:

llama_print_timings:        load time =   26232.51 ms
llama_print_timings:      sample time =      19.78 ms /    33 runs   (    0.60 ms per token,  1668.01 tokens per second)
llama_print_timings: prompt eval time =  186348.29 ms /    51 tokens ( 3653.89 ms per token,     0.27 tokens per second)
llama_print_timings:        eval time =  248449.20 ms /    32 runs   ( 7764.04 ms per token,     0.13 tokens per second)
llama_print_timings:       total time =  443161.08 ms /    83 tokens

load from shared & --no-mmap

llama_print_timings:        load time =   15297.21 ms
llama_print_timings:      sample time =      26.22 ms /    44 runs   (    0.60 ms per token,  1677.85 tokens per second)
llama_print_timings: prompt eval time =   54639.93 ms /    51 tokens ( 1071.37 ms per token,     0.93 tokens per second)
llama_print_timings:        eval time =   39760.87 ms /    43 runs   (  924.67 ms per token,     1.08 tokens per second)
llama_print_timings:       total time =   96297.49 ms /    94 tokens

load from ~/:

llama_print_timings:        load time =    6302.93 ms
llama_print_timings:      sample time =      32.26 ms /    54 runs   (    0.60 ms per token,  1673.85 tokens per second)
llama_print_timings: prompt eval time =   58406.42 ms /    51 tokens ( 1145.22 ms per token,     0.87 tokens per second)
llama_print_timings:        eval time =   48915.58 ms /    53 runs   (  922.94 ms per token,     1.08 tokens per second)
llama_print_timings:       total time =  108573.70 ms /   104 tokens

load from ~/ & --no-mmap:

llama_print_timings:        load time =    5184.56 ms
llama_print_timings:      sample time =      28.71 ms /    49 runs   (    0.59 ms per token,  1706.54 tokens per second)
llama_print_timings: prompt eval time =   46939.36 ms /    51 tokens (  920.38 ms per token,     1.09 tokens per second)
llama_print_timings:        eval time =   44217.39 ms /    48 runs   (  921.20 ms per token,     1.09 tokens per second)
llama_print_timings:       total time =   92946.78 ms /    99 tokens

Based on these figures, --no-mmap & ~/ is the best to load from. I used Meta-Llama-3-8B-Instruct-IQ3_M.gguf. I'll get a small model, and llama-bench later.

* Tidy Android Instructions README.md Remove CLBlast instructions(outdated), added OpenBlas. * don't assume git is installed Added apt install git, so that git clone works * removed OpenBlas Linked to Linux build instructions * fix typo Remove word "run" * correct style Co-authored-by: slaren <[email protected]> * correct grammar Co-authored-by: slaren <[email protected]> * delete reference to Android API * remove Fdroid reference, link directly to Termux Fdroid is not required Co-authored-by: slaren <[email protected]> * Update README.md Co-authored-by: slaren <[email protected]> --------- Co-authored-by: slaren <[email protected]>

Jeximo · 2024-05-06T00:37:06Z

Tested with TinyLlama-1.1B-Chat-v1.0-Q8_0.gguf using llama-bench. ./llama-bench -t 4 -p 512 -n 128 --mmap 0, --mmap 1

Load from shared, -m /data/data/com.termux/files/home/storage/downloads/TinyLlama-1.1B-Chat-v1.0-Q8_0.gguf

| model                          |       size |     params | backend    |    threads |       mmap | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------: | ---------- | ---------------: |
| llama 1B Q8_0                  |   1.09 GiB |     1.10 B | CPU        |          4 |          0 | pp 512     |     22.82 ± 0.27 |
| llama 1B Q8_0                  |   1.09 GiB |     1.10 B | CPU        |          4 |          0 | tg 128     |     11.68 ± 0.23 |
| llama 1B Q8_0                  |   1.09 GiB |     1.10 B | CPU        |          4 |          1 | pp 512     |     22.30 ± 0.09 |
| llama 1B Q8_0                  |   1.09 GiB |     1.10 B | CPU        |          4 |          1 | tg 128     |     11.53 ± 0.21 |

build: 628b2991 (2794)

Load from ~/, ~/TinyLlama-1.1B-Chat-v1.0-Q8_0.gguf

| model                          |       size |     params | backend    |    threads |       mmap | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ---------: | ---------: | ---------- | ---------------: |
| llama 1B Q8_0                  |   1.09 GiB |     1.10 B | CPU        |          4 |          0 | pp 512     |     22.59 ± 0.22 |
| llama 1B Q8_0                  |   1.09 GiB |     1.10 B | CPU        |          4 |          0 | tg 128     |     11.54 ± 0.08 |
| llama 1B Q8_0                  |   1.09 GiB |     1.10 B | CPU        |          4 |          1 | pp 512     |     22.08 ± 0.08 |
| llama 1B Q8_0                  |   1.09 GiB |     1.10 B | CPU        |          4 |          1 | tg 128     |     11.28 ± 0.25 |

build: 628b2991 (2794)

The results are near identical. Probably Tiny Llama (1.09 GiB) is too small to emphasize difference for this test, even mmap made no difference. I'll leave larger model benching for someone with a better device than mine.

gustrd · 2024-05-11T23:59:24Z

Hey everyone,

As the original author of these README instructions, I have to admit that I now see how they might cause more confusion than clarity.

Just to clarify for future users: I've personally found CLBlast to be quite effective when used with llama.cpp, especially for certain model families like StableLM and OpenLlama (provided you're not offloading layers). In my experience, it has boosted prompt processing speed by roughly 40%.

However, it's important to note that while CLBlast does offer significant speed improvements, it's plagued by bugs. For many model families, or even within the aforementioned subsets when offloading layers, it tends to produce nonsensical output. This is disappointing, considering the untapped potential of the GPUs nestled within our smartphones.

If there's any way I can assist, I'd like to offer a few insights based on my experimentation:

In my tests, OpenBLAS consistently outperforms noblas, particularly for prompts exceeding 256 tokens.
The tip regarding disabling mmap on Android devices is a game-changer. I hadn't been aware of it previously, and it substantially accelerates prompt processing. I strongly advocate for emphasizing this point in the README.

Here's hoping that Vulkan proves to be a more robust solution than OpenGL.

gpokat · 2024-05-12T03:21:50Z

Hey everyone,

As the original author of these README instructions, I have to admit that I now see how they might cause more confusion than clarity.

Just to clarify for future users: I've personally found CLBlast to be quite effective when used with llama.cpp, especially for certain model families like StableLM and OpenLlama (provided you're not offloading layers). In my experience, it has boosted prompt processing speed by roughly 40%.

However, it's important to note that while CLBlast does offer significant speed improvements, it's plagued by bugs. For many model families, or even within the aforementioned subsets when offloading layers, it tends to produce nonsensical output. This is disappointing, considering the untapped potential of the GPUs nestled within our smartphones.

If there's any way I can assist, I'd like to offer a few insights based on my experimentation:

In my tests, OpenBLAS consistently outperforms noblas, particularly for prompts exceeding 256 tokens.

The tip regarding disabling mmap on Android devices is a game-changer. I hadn't been aware of it previously, and it substantially accelerates prompt processing. I strongly advocate for emphasizing this point in the README.

Here's hoping that Vulkan proves to be a more robust solution than OpenGL.

Did your CLBLAST experience involve running corresponding tunners to achive speed for your device ?
Just for reference that in my experience with CLBLAST nonsensical infirence was fixed when I run and applied tunners. Hovewer on low end android device inference speed was the same as on cpu only without any loss in output.

gustrd · 2024-05-13T01:30:46Z

Hey everyone,
As the original author of these README instructions, I have to admit that I now see how they might cause more confusion than clarity.
Just to clarify for future users: I've personally found CLBlast to be quite effective when used with llama.cpp, especially for certain model families like StableLM and OpenLlama (provided you're not offloading layers). In my experience, it has boosted prompt processing speed by roughly 40%.
However, it's important to note that while CLBlast does offer significant speed improvements, it's plagued by bugs. For many model families, or even within the aforementioned subsets when offloading layers, it tends to produce nonsensical output. This is disappointing, considering the untapped potential of the GPUs nestled within our smartphones.
If there's any way I can assist, I'd like to offer a few insights based on my experimentation:

In my tests, OpenBLAS consistently outperforms noblas, particularly for prompts exceeding 256 tokens.

The tip regarding disabling mmap on Android devices is a game-changer. I hadn't been aware of it previously, and it substantially accelerates prompt processing. I strongly advocate for emphasizing this point in the README.

Here's hoping that Vulkan proves to be a more robust solution than OpenGL.

Did your CLBLAST experience involve running corresponding tunners to achive speed for your device ? Just for reference that in my experience with CLBLAST nonsensical infirence was fixed when I run and applied tunners. Hovewer on low end android device inference speed was the same as on cpu only without any loss in output.

No, I have not tried the tunners yet. Good idea, it's a nice experiment to do. Thanks for the idea!

shibe2 · 2024-05-13T15:16:08Z

For many model families, or even within the aforementioned subsets when offloading layers, it tends to produce nonsensical output.

Is this specific to Android builds or can be reproduced on PC too?

gustrd · 2024-05-13T15:19:22Z

For many model families, or even within the aforementioned subsets when offloading layers, it tends to produce nonsensical output.

Is this specific to Android builds or can be reproduced on PC too?

As far as I know, it only happens during Android builds. All my tests were conducted with Adreno GPUs from Snapdragon.

Jeximo added 2 commits April 30, 2024 20:33

Tidy Android Instructions README.md

b115ad4

Remove CLBlast instructions(outdated), added OpenBlas.

don't assume git is installed

57a37f1

Added apt install git, so that git clone works

Jeximo added 2 commits April 30, 2024 21:18

removed OpenBlas

d2b4e1a

Linked to Linux build instructions

fix typo

c7032d3

Remove word "run"

slaren reviewed May 3, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

Jeximo and others added 3 commits May 3, 2024 09:53

correct style

868bb32

Co-authored-by: slaren <[email protected]>

correct grammar

784d08e

Co-authored-by: slaren <[email protected]>

delete reference to Android API

624a689

slaren reviewed May 4, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

slaren reviewed May 4, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

remove Fdroid reference, link directly to Termux

68e8732

Fdroid is not required Co-authored-by: slaren <[email protected]>

Update README.md

c3fa382

Co-authored-by: slaren <[email protected]>

slaren approved these changes May 4, 2024

View reviewed changes

slaren merged commit cf768b7 into ggml-org:master May 4, 2024

Jeximo deleted the patch-2 branch May 5, 2024 16:26

gustrd mentioned this pull request May 11, 2024

Android OpenCL question #5621

Closed

gustrd mentioned this pull request May 12, 2024

How to enable OpenCL with llama.cpp in Android App? #3694

Closed

4 tasks

This was referenced May 12, 2024

[User] Insert summary of your issue or enhancement.. LostRuins/koboldcpp#382

Open

Make in Termux (Android) LostRuins/koboldcpp#247

Closed

mofosyne added documentation Improvements or additions to documentation Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix labels May 12, 2024

Tidy Android Instructions README.md #7016

Tidy Android Instructions README.md #7016

Uh oh!

Conversation

Jeximo commented Apr 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Apr 30, 2024

Uh oh!

Jeximo commented May 1, 2024

Uh oh!

teleprint-me commented May 1, 2024

Uh oh!

slaren commented May 1, 2024

Uh oh!

Jeximo commented May 1, 2024

Uh oh!

teleprint-me commented May 1, 2024

Uh oh!

teleprint-me commented May 1, 2024

Uh oh!

Jeximo commented May 2, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Jeximo commented May 4, 2024

Uh oh!

Jeximo commented May 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jeximo commented May 6, 2024

Uh oh!

gustrd commented May 11, 2024

Uh oh!

gpokat commented May 12, 2024

Uh oh!

gustrd commented May 13, 2024

Uh oh!

shibe2 commented May 13, 2024

Uh oh!

gustrd commented May 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Jeximo commented Apr 30, 2024 •

edited

Loading

Jeximo commented May 5, 2024 •

edited

Loading