Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign of the Search/Channels feature #3615

Closed
ichorid opened this issue May 6, 2018 · 143 comments · Fixed by #7726
Closed

Redesign of the Search/Channels feature #3615

ichorid opened this issue May 6, 2018 · 143 comments · Fixed by #7726

Comments

@ichorid
Copy link
Contributor

ichorid commented May 6, 2018

Use cases.

We strive to provide the same kind of service that is provided by:

  • Highly organized repositories of torrents (trackers)
  • Scientific articles repositories.
  • Music playing services, like Spotify.
  • Video playing services, like Youtube.
  • Scanned books archives, like Google Books.
  • Internet libraries

On fuzzy searches.

We do not try to become a distributed Google. Google is good at finding the knowledge that is relevant to words. It is used when the user does not know what he or she is looking for. Our user always knows what he or she wants. Therefore, we should disable metadata (or torrent contents) search by default, and only search by the torrent name. On the other hand, the user should be able to use metadata to refine the search results. Like, "sort by date of creation" or "sort by tag/channel", etc. This model already works perfectly for all use cases indicated above (except Youtube).

Data structurization.

We must be able to replicate the basic information organization structures that are used by our example use cases:

  • Tracker forum : tree-like structure, traversable both up and down. When the user finds a torrent/channel, he or she should be able to see its immediate parent/children nodes, and the "root" of the hierarchy. Thus, nested channels.
  • Scientific articles database: a single record in the database should be able to hold an arbitrary amount of metadata fields, and each field must be able to be used as a base for a hierarchy. For example, top-level owner/channel "ScientificArticles" could provide "pseudochannels" for "authors" "journals" "years" "tags", etc. These are essentially selecting the metadata field that would be used to build the structure "just in time". Advanced users would be able to select several criteria at once, to find the intersection of metadata sets. Thus nested channels can fully exploit the features of the underlying relational database.
  • Music search: should be fast and simple. This could be only achieved by implementing a distributed cloud-based search, with some data cached at user's device. We can't get new user do download XX Gigabytes search database, especially for mobile usage (that is basically everything for music playing). One search query should take 5-30 seconds, and it's results should include related info (like, the search for a single song should get that song and reveal the link to the whole album). To solve this problem, at some point we could introduce search mining. Fast hosts producing highly relevant search results get credit.
  • Libraries and book archives fall somewhere between scientific articles and tracker forum.

It is important to note that a single instance of the system is not required to serve all of these features at once. Instead, it should provide building blocks, or modules, to compose the use cases as necessary.
For an average user a single database could be used to serve every type of content, but, e.g., users interested in building a huge scientific database should be able to set up a separate database instance optimized for scientific articles.

Constraints on design choices

The design should consist of several completely independent parts:

  • Networking subsystem
    • Data transfer protocol
    • Trustworthy gossip protocol
  • Storage subsystem
    • Database/store
    • Cache management / mining algorithm
  • Search subsystem
    • Distributed search protocol
    • Distributed search algorithm
    • Data distribution algorithm
  • Metadata format

It is very important that we do not make any of these parts depend on specific implementation of one another. We must be able to design the system in such a way, that it would be trivial to exchange one implementation of some part for another one.

User experience requirements:

  • Search results should appear faster than the user can process them.

Implementation requirements:

  • Scalability (should only sync neighbors, and not the whole network)
@synctext
Copy link
Member

synctext commented May 7, 2018

Mostly duplicate of #2455

@ichorid ichorid added Epic and removed Epic labels May 7, 2018
@ichorid ichorid self-assigned this May 30, 2018
@ichorid
Copy link
Contributor Author

ichorid commented Jun 3, 2018

Frequency distribution of words in all natural languages follow Zipf's law. Speakers generally optimize the term (word, phrase ...) that they assign to the denotate (the physical object/idea ...) for efficiency of communication.
I.e. when I ask someone where can I find "John Doe", the combination of words "John"+"Doe" uniquely identifies the person by that name, at least in the context of the conversation. On the other hand, if I ask where can I find "blue table", this could mean either "a blue table" or "the blue table".

  • "A blue table" is a general search query, and the person would probably point me to a furniture store, or try to sell me a table (if he is in the furniture buisness). "A blue table" is a general search query, Google-style.
  • "The blue table" is an exact search, and the person would probably be confused if there is no special thing in her mind that is denoted as "the blue table". Otherwise, the person would point me to it directly. "The blue table" is a search for the exact infohash (or some special search token), or search in local context.

So, the best thing about natural languages is that humans use them to name things, like movies, game, music, torrents... This means that when someone does a search query for a thing he already knows exists, he addresses it by the terms that identify that thing uniquely enough in the database of human knowledge. And when someone just wants something in some genre - he names it by the name of the genre. Everything in between ("that movie with this guy with cool beard") we leave to Google.
This means we could use the same search mechanism for both exact queries ("Bla-bla-avenger-name 2035") and for genre search ("western 2035").

@ichorid
Copy link
Contributor Author

ichorid commented Jun 3, 2018

FASD: A Fault-tolerant, Adaptive, Scalable, Distributed Search Engine (2003 master thesis) - almost exactly what I came up with. No works by that author afterward, though 😞

Associative search in peer to peer networks: Harnessing latent semantics (2006)

A nice survey on the topic of decentralized search: "Survey of Research towards Robust Peer-to-Peer Networks: Search Methods"

Hmmm... This stuff seems to be beaten to death in the first decade of 00`s. Still, no one truly succeeded.

@ichorid
Copy link
Contributor Author

ichorid commented Jun 4, 2018

Design details for the new Channels/AllChannel/Search subsystem

Class design

All 3 communities inherit from base Metadata Community, that implements ORM and basic metadata query messages:

Community Messages Message handling
🏰 Metadata ("Chant") (1): remote query for metadata based on some constraints; (2): answer to (1) with metadata from the local store Can optionally include filters to indicate what not to send.
🐍 AllChannel ("Snakes") (1),(2) Will ask for metadata elements with high popularity score and/or top-level metadata channels.
🐻 Channels ("Bears") (1),(2) Will ask for recently added metadata elements with specific tags.
🐦 Search ("Birds") (1),(2); (3): answer to (1) by introducing some other node Could result in "long walks" according to some algorithm.

The database interface is implemented with a separate MetadataStore Pony ORM-based Tribler module. The base Metadata Community class implements basic messages for asking a remote peer for metadata element based on some criteria. Child communities differ mostly by the types of reaction to metadata queries, thus creating different search domains. The channel data is downloaded in the form of a torrent stuffed with metadata in Tribler Serializer (Python Struct) format.
A metadata query is sent in the form of a simple logical expression based on tags and ranges.(TBD)

Metadata

Metadata entries are encapsulated in signed payloads and stored in a serialized format according to the following scheme:

    • Metadata type ID
    • The metadata author's public key
    • The pointer into this metadata's author's TrustChain entry documenting its creation
      • Metadata ...
      • ...
      • ...
  • The gossip author's signature (signs everything above)

Basically, the signature and the author's public key form a _ trustworthy gossip container_. Anything could be put in there, but we use it for packing various formats of metadata.

The primary metadata format for a torrent is described below:

Field Type Required
ID pointer Yes
Torrent infohash infohash Yes
Torrent size long int Yes
Torrent date date Yes
Torrent title string No
Tags string No
Tracker info string No

Notes on format:

  • "Tags" are search tokens, that can include, for example, hash of upper-level channels, tags for content type etc.
  • "Tags" format is "."-separated string which can include machine tags, Flickr-style.
  • "Author's signature" digitally signs the metadata entry. The entry is discarded if it is not properly signed. The author's public key is simultaneously used as a channel ID. (The signature uniquely identifies the metadata entry, and is used as a primary key for SQL storage - TBD when Pony ORM fixes python 2.7 buffer hashing problem in release 0.7.7)

Metadata size

For each metadata format, there are two versions: full and short.

  • Short version's serialized and signed size should not exceed 1000 bytes (for future compatibility with BEP44), so it could always be sent in a single IPv8 UDP packet.
  • Full version has no size restrictions. It could only be distributed in metadata torrent collections, and MUST have short version distributed alongside it, for use in IPv8 messages.
  • The reason for this requirements is that some content types (e.g. scientific articles) could have very large metadata sizes, that must still be searched and queried remotely (e.g. by Birds Community).

🐍 AllChannel protocol 🐍

It's like a "snakes and ladders" game, channel tags used as "snakes" or "ladders" to other channels. This requires nested channels support.
When a node receives some channel metadata from another node, the receiver checks if the received metadata contains a priviously unseen tag leading to higher-level channel in the channels hierarchy and queries the peer for this channel. Therefore, "the universal concepts" are exchanged with highest priority.

🐻 Channels protocol 🐻

"TasteBuddies" are established according to seeded torrents, seeded metadata directories (channels), search queries (potential privacy data leak?). Special walker regularly queries buddies for metadata updates.

🐦 Search protocol 🐦

Search engine will use the Summary prefix tree](https://sci-hub.tw/10.1109/NCA.2017.8171372). The queries could be naturally done in parallel. All the returned metadata entries are saved to the local store.

Metadata torrents

Metadata torrent (channel torrent) is formed from concatenated serialized Metatada entries (.mdblob files). When updates are published to the channel, these do first appear in their publisher's local metadata store (LMS), and published with Channels protocol. When enough new torrents were added to a channel, the new metadata entries are concatenated into a new .mdblob file which is then added to the new version of the metadata torrent. This process is append-only, so users don't have to re-download the whole torrent again. The metadata torrent is split into "file-chunks" of a fixed size (1 MB), for efficiency of processing.

Content deletion

When the user wants to delete a torrent from their channel, a special DELETE entry is added to the channel contents. This entry follows the basic Metadata serialized messages format (signatures, timestamps, etc.), and adds a single delete_signature entry that contains the signature of the original metadata enty that should be deleted.

@ichorid
Copy link
Contributor Author

ichorid commented Jun 5, 2018

Summary prefix tree: An over DHT indexing data structure for efficient superset search (Nov 2017) - this is almost exactly what I came up with. The algorithm is able to efficiently look up for queries consisting of supersets of keywords. Its based on a cross between Bloom filters and DHTs.
The problem is, the authors writing style is not too good... Its only 5 pages, but it badly hurt my brain 😵

However, the idea is relatively simple:
text -> tokens -> bloom filter -> prefix tree (trie) -> DHT.
And the authors demonstrate it works in experiment, and scales at least to 10^6 indexed items.

@xoriole
Copy link
Contributor

xoriole commented Jun 8, 2018

level 1: local search
level 2: neighbor search, based on "trust magic"
level 3: DHT long-tail content

architecture is based on 2 components and their connection
tag of fixed length text-field
info-hashes
moderators or endorsers

first release features (strictly limited):

  • create channel
    • add torrent
    • remove torrent
  • vote for channel

@ichorid
Copy link
Contributor Author

ichorid commented Jun 9, 2018

Update procedure

  • An update is always requested and never pushed.
  • Channel update request = channel_public_key + last_update_time + local_channel_version

Initially, metadata torrent is created by concatenating the list of metadata entries in a serialized form. Updates are stored Git-style: each change is represented as a separate file.

  • To add a new MD entry to an existing MD torrent, a new file NNN.mdblob is added to the torrent. The file contains all MD entries from this update.
  • If at some point, there are too many deleted entries in the MD torrent, the author can produce a "defragmented" version of the torrent and publish it.

@synctext
Copy link
Member

synctext commented Jun 9, 2018

All of Allchannels can be removed. If we add to each channel the ability to push 1k content item + channel votes. Votes are simply signed +1s by users. Votes are responsibility of channel. It is incentive compatible to let channels work for their self promotion. Votes for channels are future proof: grow towards votes for tags per torrent, torrent duplicates, torrent bundles, etc.

Cool?

@ichorid
Copy link
Contributor Author

ichorid commented Jun 9, 2018

@synctext, the initial idea was exactly that: remove AllChannels completely. The method of dissemination of information about new channels is the Snakes gossip protocol:

  1. I connect to my friend and tell him: "Give me some popular content that you've got since moment X (our last data exchange)"
  2. My friend sends me some fresh metadata (containing both channel torrents and regular torrents).
  3. I analyze it for new tags and/or Public Keys (PKs).
    3.1 If there are some torrents with PKs and or tags that I have not seen before, I send another request to the same friend, asking for the channel torrent with the corresponding PK/tag.
    3.2 If there is a metadata entry containing new channel torrent, I add it to my database.
  4. I download/update all new channel torrents and add everything from them to my local database.

It is important to note that this algorithm eventually leads to spreading root torrent collections, since tags point to the parent channel.

Opportunistic update of metadata

Depending on my relationships with the peer who sent me this MD, if he sent me the outdated MD, I could push to him the newer version of this MD's parent channel. On the other hand, if I receive some MD that refers to a newer version of the channel I already subscribed to, I could decide to do a preemptive update check on that channel, from the same friend.

@ichorid
Copy link
Contributor Author

ichorid commented Jun 9, 2018

@synctext, the question is, where and how do we store the vote info?
It is very easy to store the votes on a person's TrustChain, but that is a completely decentralized store, it does not sum votes.
To count the votes on an item globally, the whole network should periodically run something like a prefix sum algorithm. We definitely don't want that.

Of course, vote info could slowly travel the network in the form of a gossip. Trust algorithms ensure that no one would cheat on it.

@synctext
Copy link
Member

synctext commented Jun 9, 2018

Yes, ideas seem to converge. Allchannel now has global broadcast of votes. Gossip of votes and filtering based on trust would be a clear required leap forward..

@ichorid
Copy link
Contributor Author

ichorid commented Jun 10, 2018

About basics of gossip protocol:
from a personal standpoint, there is no difference between distrust and distaste.

If a sybil region votes for the things I like, and this vote help me filter info - I trust this vote. If a proven real person I know votes for things I don't like - I distrust the vote. One good example is "paid bloggers" phenomena: these guys maintain high-profile blogs, but sometimes inject paid material. The people who are OK with it form the blogger's auditory. Those who shun it just don't listen to the guy.
The same is especially true for political preferences, etc.

We can't create the perfectly objective view of reality for everyone, but we can give a person the tools to make her own view of reality consistent.

@synctext
Copy link
Member

If-Modified-Since primitive for content discovery makes a lot of sense.

1. I connect to my friend and tell him: "Give me some popular content that
you've got since moment X (our last data exchange)"
2. My friend sends me some fresh metadata (containing both channel torrents
 and regular torrents). I analyze it for new tags and/or Public Keys (PKs).
3.1 If there are some torrents with unknown PK+tag that I have not seen before, I send
another request to the same friend, asking for channel torrent with the corresponding PK+tag.
3.2 If there is a metadata entry containing new channel torrent, I add it to my database.
4. I download/update all new channel torrents and add everything from them to my local database.

The only information you need to download a channel, sample torrents, and check votes is an info_hash. Libtorrent and Flatbuffers can do all the hard work. Every channel community is redundant. Allchannel is redundant. You only need to gossip top-5 most voted upon channels, last seen random 5 channels. When talking about a channel, you provide the "genesis hash and the current-version hash". That's it I believe. Most-minimal-design-probably.

Search community could initiate "metadata insertion promo walks" to facilitate clusterization of metadata. Unsolicited metadata pushes (spam) are mitigated by accepting metadata only from peers with high trust score.

Sounds complex... In the past 12 years we have been live we gotten lots of spam. Features have been removed out of production because they where not spam-resilient. Our trust and reputation function are still very simplistic. The above idea does not sound like something that is ready for testing on forum.tribler.org by next release deadline of 1 August.

A search query is sent in the form of a simple logical expression based on tags and ranges.

Those are perfect attack vectors for a denial-of-service attack, right?

@devos50
Copy link
Contributor

devos50 commented Jun 13, 2018

We had a pop-up developer meeting where we briefly discussed the basics and future of the existing AllChannel 2.0 mechanism (see #3489). This comment briefly describes what we came up with. Main goal: simplicity

First, we make a distinction between a channel producer (creator) and consumer (a user that browses through the channel).

The channel API (for channel consumers):

  • subscribe(infohash): subscribe to a channel and start downloading its content using libtorrent. This method takes an infohash that contains the channel information (metadata, magnet links etc).
  • unsubscribe(infohash): unsubscribe from a channel. This will remove the download from the libtorrent engine.
  • notify_channel_updated(new_infohash): callback when a channel has been updated by a channel producer. This will update the download in the libtorrent engine.
  • sample_channel(infohash): update the piece priority of a channel. This can be used to implement something like the preview channels like we already have in Tribler.
  • get_channel_votes(infohash): returns a list of all cast votes on a specific channel.
  • check_channel_votes(infohash): filter the list, returned by get_channel_votes, and remove invalid (Sybil) votes.

voting on a channel
Users can cast their votes on a channel by creating a half-block in TrustChain (without any counter signature). Note that this is not technically possible at the moment in TrustChain and such blocks will be marked as invalid by the block validation logic. Votes that are present in the blockchains of users can be collected by the channel producer and included in the torrent content.

@ichorid
Copy link
Contributor Author

ichorid commented Jun 13, 2018

Full Text Search

We will use FTS4/5 from SQLite. To take word morphology into account we will use Snowball stemmer. We will produce tokens for FTS on our own (because we don't want to develop a wrapper and a binary version of Snowball for SQLite).
sqlitefts-python could be used as a wrapper to connect external stemmers to SQLite

For the first versions of Chant we will still use Porter stemmer embedded in SQLite.

Tags and title terms share the same search space. When added to the local database, the terms obtained from stemming the words of the torrent metadata title are checked for duplicates with the tags.

@synctext
Copy link
Member

Significant overlap with nearly finished code #3660

@synctext
Copy link
Member

Mental note... This sprint aim to do a big leap forward for metadata. Similar how IPv8 collapsed the code base dramatically. By trying out a new approach for metadata, we make a solid step forward. Sprint deadline is 1 august 2018, functional release on forum.tribler.org

@ichorid
Copy link
Contributor Author

ichorid commented Jun 14, 2018

Nested channels

  • Channels form a hierarchy, like a directory tree structure. At the top level, there is no common root ( the Forest of Trees structure).
  • Each channel is tied to a public key (PK), which functions as a permanent ID (and could be used, for example, for voting on TrustChain or distributing updates a la BEP46).
  • Each metadata entry (MD) has single parent field that points to MD's parent channel in the hierarchy (by providing the parent's PK).
  • A leaf MD's parent always equals to the MD creator's PK.
  • If the user gets a new MD entry with previously unseen parent PK, Chant automatically fetches the parent channel's MD from the same source that was used to obtain the child's MD. Then, Chant checks if that channel really has the original MD entry. If there is none, the source is considered untrustworthy.
  • The process continues recursively until it meets the top-level channel MD ("the Root of the Tree in the Forest").

Notes on security:

  • The recursive process of going all the way up cannot be abused, because Chant always asks the sender of the original MD entry for the whole chain of parent channel torrents, and checks for incorrectness at each level. Therefore, it could not be used for DDOS and amplify attacks.
  • The user won't see the association of a child channel to the parent channel until the parent channel is verified to really contain (approve) the child. This protects higher-level channels from unsolicited associations.

@qstokkink
Copy link
Contributor

For reference, this also overlaps: #3484.

@synctext
Copy link
Member

synctext commented Jun 15, 2018

As the design is becoming more clear, I propose investing some time in a dataset with 1 million real items and performance analysis. Build scientific proof for the Tribler stack, credibility, and good quality assurance infrastructure.

  • build upon the completed foundations 3 commits of Quinten
  • have a list of 1 million magnet links
  • operational code for turning those links into a Libtorrent swarm
  • performance analysis of adding 1 magnet link to this existing swarm and re-seeding it (Jenkins job)
  • implement the Channel API as proposed above
  • determine how 1-million-item channel published by one computer is downloaded by another computer (end-to-end performance experiment)

Open Science is a nice candidate to use as a dataset. For arXiv.org : Total number of submissions shown in graph as of June 15th, 2018 (after 26.9 years) = 1,403,151. Outdated stats for Amazon S3 download: The complete set of PDFs is about 270GB, source files about 190GB, and we make about 40GB of additions/updates each month (2012-02).

The overengeneering approach: "CORE’s aim is to aggregate Open Access content from across the world. We provide seamless access to over 30 million metadata records and 3 million full-text items accessible through a website, API and downloadable Data Sets."

@ichorid
Copy link
Contributor Author

ichorid commented Jun 16, 2018

Gossip DB schema

When the gossip is deserialized from a file or a network message, it goes through several checks:

  1. Filter by trust based on PK. Chant compares its author's PK to the list of known good/bad PKs.
  2. Check for duplicate gossip stored in the local database, based on gossip's signature.
  3. Check for integrity based on PK and signature.

If the gossip passes these checks, it is processed and added to the local database. The database schema mimics the gossip format, but introduces some additional fields, like the addition moment timestamp and stemmed index.

@qstokkink
Copy link
Contributor

A bit of a memory dump; this is the old way of checking for duplicates:

  • Store only correctly signed messages by mid and global_time.
  • Drop incoming messages with an already-known mid + global_time.

The system worked like this, so that you could publish the same message twice. This is useful for messages which are not stored long term.

That said, I don't think we need to support publishing the same message twice anymore and therefore your gossip design should work.

@synctext
Copy link
Member

synctext commented Mar 1, 2021

ahh, nice! Sequential inserts should be no problem. Minimal changes, excellent. The problem is: how can we best package the entire markdown-based Wikipedia (English,Spanish,German, etc.) into a single channel? How much .mdblobs would we have?
image

@ichorid
Copy link
Contributor Author

ichorid commented Mar 2, 2021

The torrent for multistream (seekable) English Wikipedia articles archive is about 17GB of bz2-compressed data. It is assumed that the full unpacked size is about 80GB.
Say, we store the stuff in DB, Brotli-compressed, indexing article titles only. That's about 20GB of mdblobs on disk and about 30GB in the database (the overhead of 10GB is from indexes, titles, signatures, etc). That is 10x our biggest channels experiments and 6x all the channels data out there, combined.

Now come the updates... It is hard to estimate how many pages change daily on English Wikipedia, but the number of daily edits is about 200k, and the page churn is about 1k/day. So, assuming an average of 50 edits per page per day, about 20k pages change daily. That's 600k pages changed per month. Again, I did not find any info on the distribution of page retention time, but from those numbers, we can roughly estimate a monthly change rate of 10% of 6M articles.

So, in the end, the whole English Wikipedia (text only) will take about 60GB on disk. It will take about 40 hours on a fast modern PC to upload it into the database in the current architecture. Wikipedia search results will be shown alongside other channels results. Updating Wikipedia will take about 1-2 hours per month on a fast PC.

@synctext
Copy link
Member

synctext commented Apr 9, 2021

ToDo note for future. Seedbox activity. Limit to Max. 2-3 weeks of effort. As a university we want to showcase the strengths of our technology, while minimising engineering time. static dumps only ❗ Dynamic updates are too difficult.

@drew2a
Copy link
Contributor

drew2a commented May 26, 2021

@drew2a
Copy link
Contributor

drew2a commented May 26, 2021

Downloading of Arxiv's archive will costs us for about 100$ (based on information from https://github.com/mattbierbaum/arxiv-public-datasets#pdfs-aws-download-only)

@drew2a
Copy link
Contributor

drew2a commented May 31, 2021

Seedbox activity is on pause now.

  • Cultural content:
    We are waiting for a server with enough free space to deploy the seeder.
  • Science demo, 1 channel with lots of Arxiv papers (.PDF):
    The archive is not free.
  • Best of Wikipedia:
    Needs a lot of @ichorid time (at least 2 months).

@synctext
Copy link
Member

synctext commented Oct 3, 2021

Channels are permissioned, that needs to change. Brainstorm target if tag experiment is successful: scientific journal without an editor-in-chief. Peer-review driven. Use markdown, picture and infinite amount of PDF attachment with crowd sourcing magic. The exact magic is our university goal: permissionless science. Leaderless science.

Tag terms of both journal and individual PDF files become the -exclusive- terms for discoverability. Human created tag terms dramatically enhance scalability versus all keyword from abstract as search space for all scientific papers. Stepping stone for 'decentral Google' ;-)

@ichorid
Copy link
Contributor Author

ichorid commented Oct 3, 2021

Permissionless crowdsourcing is incompatible with Channels 2.0 architecture that is based on torrents. Channels 3.0 architecture fits permissionless perfectly, though.

@devos50
Copy link
Contributor

devos50 commented Sep 22, 2022

Early impression of content aggregation:

Schermafbeelding 2022-09-22 om 10 49 00

This is where my user journey died 😄

Schermafbeelding 2022-09-22 om 10 52 46

@synctext
Copy link
Member

synctext commented Sep 22, 2022

great stuff 👏 Suggestions for next iteration:

  • Focus on both "linux" and "Ubuntu" search.
    • Linux is a tag given to some Ubuntu swarms plus Debian-Jessie
    • Unique ability of search results beyond filename-keyword string matching
    • Closing the semantic gap 🥇 :godmode: 🔬
  • only display 1 snippet with aggregation, we duplicate Big Tech Google snippet and knowledge panel
    • not 6 items, 6 items, 3 items, 3 items.
    • only 1 snippet with 6 items, others are in traditional non-aggregated view.
    • all top-4 items in a snippet are clickable on search result page (really duplicate what TUDelft search gives us: 4)
    • beyond top-4 the user is forced to make the effort of another click on "other" or something
    • guide user into clicking on 1st item into that 1st aggregated view.
    • Therefore 1st item is most prominent on entire search result screen
    • Use spacing trick that Google also applies to make "Opleidingen", "bachelors", "bekijk alle masters" very prominent (see picture below)
    • Use highlighting of the "search term" inside swarm results using boldness. Duplicate Google:
      image
    • possibly remove the aggregation title, replace with clickable swarms and highlighting????
  • what does a user expect that will happen when clicking on the bold "snippet header"
    • Just download already ⏩
    • Expand my view redraw entire screen 😕
  • Can you think about the ideal outcome from popularity community statistics? 98% correct, 98% reliable, 98% accurate download statistics for each swam for a 1MByte test sample. You know the expected download speed and swarm size. Explain this visually to any TikTok user please 🙏
  • Note the GUI has bad coding smell. Its spaghetti code, hack build upon hack, build upon a hack. See 'brain-dead refresh function which works for some reason'. Developers escape to the core and only cleanup code there...
  • Google approach for inspiration. Really lots of information overload when they are certain that you are looking for this thing.
    image

@devos50
Copy link
Contributor

devos50 commented Sep 23, 2022

The next prototype we designed:

Schermafbeelding 2022-09-23 om 15 32 10

The number of snippets shown, and the number of torrents per snippets can be configured through global variables (default number of snippets shown: 1, default number of torrents per snippet: 4). We decided to keep the information in the snippet to a minimum. A torrent that is included in a snippet is not presented in another row as a "regular" torrent.

@devos50
Copy link
Contributor

devos50 commented Sep 26, 2022

Now that the GUI part of the snippet is mostly figured out, we had a brief discussion on the next step(s).

  • The first iteration(s) will focus on bundeling torrents under similar content items. This requires changes to the underlying tags protocol and database structure.
  • At one point, we most likely want to perform a remote search to obtain content items that are not out our local database yet. But we're not sure how to integrate this in the search panel since searching is an asynchronous operation. We will tackle this problem later.
  • @drew2a will start working on a database format for content items. When a user conducts a search, this content database is searched first in order to build the snippet.

@synctext
Copy link
Member

synctext commented Sep 30, 2022

The upcoming sprint is a good idea to do a deep dive into the Linux query only. So do something that does not scale, manually tune and tweet just a single query and make dedicated regular expressions for it. Work until it is really perfect. Then everybody learned something from actual swarms, turning dirty data into perfect search.
Attempt at doing 3 iterations deployed in Oct, Nov and Dec?

@devos50
Copy link
Contributor

devos50 commented Oct 7, 2022

Sprint Update (Information Architecture for Tribler)

PR #7070 contains the first version of the work described on both this ticket and #6214. This is also a key step towards #7064. Below are some of the insights/design decisions during this sprint.

Last sprint we decided to focus on snippets that bundle torrents that describe the same content. This sprint focussed on building the storage backend. We decided to keep our design as simple as possible while not trying to limit ourselves too much for further iterations. We will in Tribler maintain a knowledge graph with content items, torrents and tags. See the visualisation below.

tribler_kg

We believe this structure is sufficient enough to improve search results by showing snippets, and for further iterations in which we can improve other aspects of Tribler. In the underlying database, this graph is stored as a collection of statements in the format (subject, predicate, object). We took inspiration from existing work in the semantic web, also to establish some language when building new algorithms on top of our infrastructure. For example, a content item - torrent relations is represented as a statement (“Ubuntu 22.04”, “TORRENT”, <infohash>). A torrent-tag relation is represented as a statement (<infohash>, “YEAR”, “2013”). We have updated various names in the Tribler code base to reflect these changes in knowledge representation.

We do not allow arbitrary predicates in the knowledge graph. To represent knowledge, we use the Dublin Core (DC) Metadata format and integrate it with the knowledge graph visualised above. This format gives us a set of attributes that are used to describe torrents, including year, data and format. The predicate field in each statement can take one of the 15 values given by DC. We also have two special predicate fields, namely TORRENT to describe content item - torrent relations, and TAG for backwards compatibility.

In addition to the above, the tag generator has been updated to automatically generate tags for Ubuntu-related content.

There are a few points we can focus on during the next sprint:

  • Improve the snippet generation algorithm, e.g., a search for linux should also give Ubuntu content.
  • We should work towards a GUI design for editing torrent/content item metadata.
  • The networking backend should be updated.
  • Finishing Upgrade the TagComponent to the KnowledgeComponent #7070 (testing, migrating databases, etc).
  • Scalability/validation experiment on the DAS6.

One of the design decisions is whether to add a FTS index to the strings in tuples. This would allow efficient searches in tuples (which I suspect will be a key aspect of querying the knowledge graph) but adds more complexity. We need an experiment to verify the speed-ups.

@drew2a
Copy link
Contributor

drew2a commented Oct 11, 2022

Automatic Content Item generation (Ubuntu, Debian, Linux Mint):

import re
from re import Pattern

from tribler.core.components.tag.rules.tag_rules_base import Rule, RulesList

space = r'[-\._\s]'
two_digit_version = r'(\d{1,2}(?:\.\d{1,2})?)'


def pattern(linux_distribution: str) -> Pattern:
    return re.compile(f'{linux_distribution}{space}*{two_digit_version}', flags=re.IGNORECASE)


content_items_rules: RulesList = [
    Rule(patterns=[pattern('ubuntu')],
         actions=[lambda s: f'Ubuntu {s}']),
    Rule(patterns=[pattern('debian')],
         actions=[lambda s: f'Debian {s}']),
    Rule(patterns=[re.compile(f'linux{space}*mint{space}*{two_digit_version}', flags=re.IGNORECASE)],
         actions=[lambda s: f'Linux Mint {s}']),
]

@synctext
Copy link
Member

synctext commented Oct 11, 2022

Just a reminder to deeply understand what it means to be future proof

Date Tribler version Description of database of Tribler
20 Sep 2005 initial import Read the configuration file from disk
29 June 2011 v5.3.x We ship with a full SQL database engine CURRENT_MAIN_DB_VERSION = 7
4 Dec 2012 v6.0 Lots of upgrades in 1.5 years CURRENT_MAIN_DB_VERSION = 17
25 june 2015 v6.5.2 complex DB versioning support TRIBLER_65PRE4_DB_VERSION = 28
11 June 2016 v7.0.0 Implemented upgrade process of FTS3 to FTS4
18 June 2020 v7.5.0 everything gets changed with gigachannels BETA_DB_VERSIONS = [0, 1, 2, 3, 4, 5]
28 Oct 2021 v7.11 tagging added: 1st metadata crowdsourcing gui_test_mode else "tags.db"

Lessons we never learned, but endured. 1) use mature libraries 2) REPEAT: avoid the latest cool immature tooling 3) never do a complete re-write 4) have an explicit roadmap with intermediaries goals 5) let the dev team make roadmap to ensure buy-in add realism. 6) what are you optimising exactly? 7) important final point: always remain stable; no SegFaults or technical debt

@devos50
Copy link
Contributor

devos50 commented Oct 11, 2022

As we are getting #7070 ready for merge, we made a few additional design decisions:

  • Initially we sorted the snippets based on the number of torrents inside. We changed this to sort the snippet based on the top-seeded torrent inside each snippet. This should push popular, well-seeded content items higher up in the search results.
  • For now we generate snippets from the first 50 search results. Since the search results should already be ranked based on relevance (which will be improved more when Improved ranking for search results and updated search UI without the artifical delay at the loading screen #7025 is merged), these first 50 search results should be sufficient to build meaningful snippets. Next sprint iterations could focus on improving the snippet generation algorithm and doing more active searches in the knowledge database.
  • The TORRENT resource type will be removed and we will use the TITLE relation instead.

@devos50
Copy link
Contributor

devos50 commented Oct 21, 2022

Sprint Update

#7070 has been reviewed and merged, and is ready for deployment with our 7.13 release 🎉. 7.13 is scheduled to be released on Nov 1.

Next steps:

  • Extend the GUI to allow editing of torrent metadata (not necessary for 7.13 release though). Draft PR: Extended GUI to allow editing of metadata #7099
  • Now that that the fundamental layer for content management has been designed, we can discuss further extension to the core algorithm, e.g., leveraging knowledge when searching (beyond snippets).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

8 participants