Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup C encoder up to 100x #256

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
Open

Speedup C encoder up to 100x #256

wants to merge 14 commits into from

Conversation

homm
Copy link

@homm homm commented Sep 25, 2024

All changes are divided by independent commits, some of them are optional.

In addition to improving performance there are changes:

  • Do not define M_PI in sources, ensure it defined in math.h.
  • Fixed max number of components for blurhash_encoder executable (in line with blurHashForPixels function)
  • Improved Makefile to avoid heavy encode_stb recompilation on each change.

Benchmarks are in the comment.

@homm
Copy link
Author

homm commented Oct 3, 2024

I've also implemented SSE and NEON optimizations in separate branch. The last optimization with unrolling loop in multiplyBasisFunction is actually works better since it allows any compiler effectively autovectorize the code. There are benchmarks for 2000 × 1334 jpeg image on different systems:

Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz

Optimization GCC 13.2.1 Clang 17.0.6
6 4 9 9 6 4 9 9
Master 3181 ms 11844 ms 3154 ms 11124 ms
sRGBToLinear_cache 381 1507 451 1633
cosX cache 82 339 88 270
Single pass 58 177 62 207
SSE (obsolete) 39 114 42 144
Unroll 4x 30 80 32 85

Apple M1 Pro

Optimization GCC 13.2.1 Clang 17.0.6 Clang 14.0.3
6 4 9 9 6 4 9 9 6 4 9 9
Master 1177 ms 4076 ms 1156 ms 4005 ms 1268 ms 4302 ms
sRGBToLinear_cache 212 826 216 839 186 653
cosX cache 44 150 80 271 81 271
Single pass 20 62 32 57 29 70
NEON (obsolete) 27 87 25 80 25 80
Unroll 4x 16 49 15 43 15 42

* Result for M1 Pro was fixed, since previous results was affected by the bug.

@homm homm changed the title Speedup C encoder by factor of 40 Speedup C encoder up to 100x Oct 11, 2024
@homm
Copy link
Author

homm commented Oct 11, 2024

@DagAgren Are you interested in this improvements?

@homm
Copy link
Author

homm commented Oct 24, 2024

I also improved decoder performance about 14 times using the same techniques: caching cos values, linearTosRGB values and unrolling loops. This improves performance of decoding from 6 Mpx/s to 86 Mpx/s on M1.

$ touch decode.c && make blurhash_decoder && ./blurhash_decoder "W7E-z7oyM{8xM{wKwdMepHrE%LV[OVV@BBS\$r@NaR7OrRQNaMKXm" 640 480 _out.png
Time per 30 execution: 49.532 ms

$ touch decode.c && make blurhash_decoder && ./blurhash_decoder "W7E-z7oyM{8xM{wKwdMepHrE%LV[OVV@BBS\$r@NaR7OrRQNaMKXm" 640 480 _out.png
Time per 30 execution: 3.573 ms

This also introduces very minor change in output result. Nothing that could be noticed by human eye, just different binary output.

The method which I use to measure performance is following:

diff --git forkSrcPrefix/C/encode_stb.c forkDstPrefix/C/encode_stb.c
index 811ca00006b45eaa829bfd267904ac0d0c647884..a95c6a2ff96ee7cdaa9d1b35ef28b063161cf01d 100644
--- forkSrcPrefix/C/encode_stb.c
+++ forkDstPrefix/C/encode_stb.c
@@ -4,6 +4,7 @@
 #include "stb_image.h"
 
 #include <stdio.h>
+#include <time.h>
 
 const char *blurHashForFile(int xComponents, int yComponents,const char *filename);
 
@@ -38,6 +39,14 @@ const char *blurHashForFile(int xComponents, int yComponents,const char *filenam
 
 	const char *hash = blurHashForPixels(xComponents, yComponents, width, height, data, width * 3);
 
+	#define TIMES 30
+	clock_t start = clock();
+    for (int i = 0; i < TIMES; i++) {
+        hash = blurHashForPixels(xComponents, yComponents, width, height, data, width * 3);
+    }
+    double time_ms = (double)(clock() - start) / CLOCKS_PER_SEC / TIMES;
+    printf("Time per %d execution: %.3f ms\n", TIMES, time_ms * 1000);
+
 	stbi_image_free(data);
 
 	return hash;
diff --git forkSrcPrefix/C/decode_stb.c forkDstPrefix/C/decode_stb.c
index dab164e1eaf1a7199a751a5e13f6da7099027bd2..3514f53e6f91dc41253429ea07e594893d536598 100644
--- forkSrcPrefix/C/decode_stb.c
+++ forkDstPrefix/C/decode_stb.c
@@ -3,6 +3,8 @@
 #define STB_IMAGE_WRITE_IMPLEMENTATION
 #include "stb_writer.h"
 
+#include <time.h>
+
 int main(int argc, char **argv) {
 	if(argc < 5) {
 		fprintf(stderr, "Usage: %s hash width height output_file [punch]\n", argv[0]);
@@ -34,6 +36,15 @@ int main(int argc, char **argv) {
 
 	freePixelArray(bytes);
 
+	#define TIMES 30
+	clock_t start = clock();
+    for (int i = 0; i < TIMES; i++) {
+    	uint8_t * tmpbytes = decode(hash, width, height, punch, nChannels);
+    	freePixelArray(tmpbytes);
+    }
+    double time_ms = (double)(clock() - start) / CLOCKS_PER_SEC / TIMES;
+    printf("Time per %d execution: %.3f ms\n", TIMES, time_ms * 1000);
+
 	fprintf(stdout, "Decoded blurhash successfully, wrote PNG file %s\n", output_file);
 	return 0;
 }

@homm
Copy link
Author

homm commented Oct 30, 2024

@DagAgren How can I earn your attention?

@vellnes
Copy link

vellnes commented Dec 4, 2024

@DagAgren please note that
We will be very grateful for the optimization of the algorithm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants