Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LibraryScanner: Improve hashing of directory contents #2497

Merged
merged 7 commits into from
Feb 18, 2020
Merged

LibraryScanner: Improve hashing of directory contents #2497

merged 7 commits into from
Feb 18, 2020

Conversation

uklotzde
Copy link
Contributor

@uklotzde uklotzde commented Feb 11, 2020

  • Use a cryptographic SHA256 hash that can be calculated incrementally instead of concatenating all file path strings (requiring many intermediate dynamic memory allocations) and finally hashing the resulting string with qHash()
  • Use 64-bit integers (by mangling alltruncating SHA256 bytes) instead of 32-bit integers from qHash()
  • The hash code 0 is reserved and considered invalid. Tests verify the correct behavior.
  • We don't need a database migration. The stored hash values will be replaced with the next library rescan.

Remark The new cache_key_t typedef (a primitive type) and utility functions could be reused for caching artwork images. The aoide branch already contains the code for hashing images. Unfortunately this migration will require to discard and recalculate the hash of all artwork images. But we need to do it at some point if we want to get rid of the crippled 16-bit integer hashes that cause many hash collisions and wrong artwork display.

src/library/scanner/libraryscanner.cpp Outdated Show resolved Hide resolved
src/util/cache.cpp Outdated Show resolved Hide resolved
@uklotzde
Copy link
Contributor Author

macOS SCons build failures are unrelated. We need to get rid of SCons asap!

@Holzhaus
Copy link
Member

#2498 should fix the SCons/OSX errors btw.

Copy link
Member

@Holzhaus Holzhaus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good, tests pass, library rescan didn't cause any issues. LGTM.

@Holzhaus Holzhaus requested a review from daschuer February 15, 2020 01:28
@uklotzde uklotzde added this to the 2.3.0 milestone Feb 15, 2020
@@ -54,7 +56,7 @@ void RecursiveScanDirectoryTask::run() {
if (currentFileInfo.isFile()) {
const QString& fileName = currentFileInfo.fileName();
if (supportedExtensionsRegex.indexIn(fileName) != -1) {
newHashStr.append(currentFile);
hasher.addData(currentFile.toUtf8());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you made a performance check for toUtf8() compared to use the QString bytes?
I can Imagine that the later is faster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raw bytes of QString are not platform-independent when considering endianess, but UTF-8 is. Does this matter?

Copy link
Contributor Author

@uklotzde uklotzde Feb 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no function to obtain the raw bytes of a QString. Using toLatin1() may lose characters and toLocal8Bit() is platform dependent while the database is portable. Please clarify which option you consider here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was just the idea to check if something like
hasher.addData(reinterprete_cast<char*>(currentFile.utf16()))
is faster or not.
This way there is the double amount to hash but no utf8 conversion.

I don't care about endianes, because this would be only apply if the DB is on an external HD, accessing the same OS with different CPU architectures.
Thinking of this a independent solution is not bad.

Copy link
Contributor Author

@uklotzde uklotzde Feb 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that the additional allocations really matter here and won't invest hours to figure out how to do a realistic performance test to prove this. The optimized solution would also produce non-portable data, in contrast to all other data in the database.

Copy link
Member

@Holzhaus Holzhaus Feb 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was just the idea to check if something like
hasher.addData(reinterprete_cast<char*>(currentFile.utf16()))
is faster or not.
This way there is the double amount to hash but no utf8 conversion

Double the data should not be an issue. sha256 runtime not increase 1:1 with the amount of input data. I'm on my phone, but you should see the same effect on desktop pcs:

$ openssl speed sha256
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
sha256          180714.92k   532643.80k  1088833.54k  1466255.70k  1633716.91k  1647209.13k

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual number of bytes that we feed into QCryptographicHash should not be an issue. But QString might need to allocate a temporary QByteArray for the UTF-8 representation and the conversion itself also takes some time.

Yes, utf16() would be faster, but it is non-portable. If someone is able to proof that it is much faster and worth the drawback, please go ahead ;) I will not do it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curious I did the test. Here are the results:

10 x 30 characters
  6040 nsec hasher.addData(reinterpret_cast<const char*>(lorem.unicode()), lorem.size() * sizeof(QChar));
  8683 nsec hasher.addData(lorem.toUtf8());
 12836 nsec hasher.addData(lorem.toUtf8()); "ööööö.."

10 x 888 characters 
149882 nsec hasher.addData(reinterpret_cast<const char*>(lorem.unicode()), lorem.size() * sizeof(QChar));
 82302 nsec hasher.addData(lorem.toUtf8());
257479 nsec hasher.addData(lorem.toUtf8()); "ööööö.."

Conclusion:
As expected there is only performance gain for ASCII characters. Except in non Latin charter sets two byte utf8 characters are a minority.
For a typical file name of 30 charters, the unicode() version is way faster. If we append all strings first, the utf8 version catches up at one point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those numbers also need to be put in relation to the file system access. I guess that the differences on the overall performance are negligible. That's what I had in mind with "realistic performance test".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes of cause. I was just interested if we have a low hanging fruit here.

@daschuer
Copy link
Member

LGTM, Thank you.

@daschuer daschuer merged commit 030e792 into mixxxdj:master Feb 18, 2020
@uklotzde uklotzde deleted the libraryscannerhash branch February 21, 2020 10:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants