Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adversarial search state-of-the-art #2547

Closed
synctext opened this issue Sep 20, 2016 · 32 comments
Closed

adversarial search state-of-the-art #2547

synctext opened this issue Sep 20, 2016 · 32 comments
Assignees

Comments

@synctext
Copy link
Member

synctext commented Sep 20, 2016

Keyword search within a self-organising system is a challenging unsolved problem.

Detecting and removing spam has proven to be extremely difficult. Creating a trustworthy search service, out of unreliable and possibly fraudulent resources is a challenge. A starting point is creating a web-of-trust or other feedback mechanism.

Existing work:

Web of trust for voting within Tribler:

@synctext
Copy link
Member Author

synctext commented Sep 29, 2016

Real world $2.44 million fraud with Amazon reviews/votes, thnx @pimveldhuisen
http://www.zdnet.com/article/exclusive-inside-a-million-dollar-amazon-kindle-catfishing-scam/

@jellelicht
Copy link

Most of these were quite useful for gaining some understanding on this topic, thanks @synctext and @pimveldhuisen.

I am currently looking into
Fighting peer-to-peer SPAM and decoys with object reputation, and to a lesser extent parts of P2P-Based Collaborative Spam Detection and Filtering.

It currently seems that lots of partial 'solutions' wrt adversarial search exist and have been researched, but most often they heavily depend on some form of centralisation or have another major drawback.

@jellelicht
Copy link

jellelicht commented Sep 30, 2016

Also, regarding an often-used WoT based system:
the gpg shortkey issue that came up recently is interesting, but I am not sure if focusing on WoTs is wise at this moment in time, seeing as these are implementation details when looking at the state of the art of adversarial search. WDYT, @synctext?

@synctext
Copy link
Member Author

@wordempire A lot of abuse, fraud and spam examples can be found in social media and e-commerce. So that is nice stuff to write about.

but most often they heavily depend on some form of centralisation or have another major drawback.

That is a perfect storyline! Anything more for self-organising systems or P2P? Stuff like, http://www.ece.umd.edu/~goergen/docs/sec-nwatch.pdf ..

Web-of-trust mechanisms can be a minority part of your report, halve, or the majority. Whatever makes the most interesting story. A list of partial, flawed, and fantasy WoT solutions would be ideal.

@synctext
Copy link
Member Author

synctext commented Oct 3, 2016

@synctext
Copy link
Member Author

synctext commented Oct 7, 2016

Fraud with search results with direct financial gain.

@jellelicht
Copy link

lee2006understanding: develops a model that looks at the link between user behavior/awareness and pollution of a p2p network.

@jellelicht
Copy link

yoshida2009controlling: shows that index poisoning is an effective way of dealing with copyright violations when looking at the Winny network for small sets of files. This approach has the potential to disrupt the network as a whole, which might or might not be desirable for an adversary.

@jellelicht
Copy link

  • Determine layers of operation and archetypes for each layer. For each type,
    refer to drawbacks and assumptions made to make it all work.
    1. Trust building/subversion

      Choosing whether to trust an authority or group of peers can be based on
      a variety of existing decision making processes. For example, this would
      be the perfect section to refer to the abuse happening w.r.t. twitter,
      amazon reviews and of course the gpg short key issue. Proxy measurements
      for trusts could also be reviewed here (e.g., if this person shares a
      lot, they might be trustworthy and that kind of thing).

    2. Index pollution/building

      Indexes have to be generated for content in p2p networks, but have been
      polluted for some older networks, such as the Gnutella network. This
      section can focus on whether existing approaches can handle a subset of
      users putting low quality material online. Find out why it happens
      (user-centric), how it can be prevented and perhaps how it could be
      leveraged to practically block access to a undesirable/illegal resource.

    3. Content poisoning

      Older p2p protocols used a very course checksum to verify entire files.
      It was quite to poison a download, therefore forcing the downloader to
      re-download the entire file. Depending on file size, this can be quite
      expensive. Look at the 'evolution' of systems, with at some point
      referring to the BitTorrent and its piecemeal hashing that partially
      alleviates this issue.

@synctext
Copy link
Member Author

synctext commented Oct 12, 2016

OK, + add 4th or 5th section.

start .tex in https://www.google.nl/search?q=ieeee+format format

https://scholar.google.com/scholar?q=dht+poisoning
https://scholar.google.com/scholar?q=link+farm
https://scholar.google.com/scholar?q=kazaa+pollution
Reddit HackerNews, upvote, shadow ban, etc. techniques
https://scholar.google.com/scholar?q=collaborative+spam+filtering
https://en.wikipedia.org/wiki/Stealth_banning
Honesty among drug dealers, 90% satisfaction level with drug deals: http://dl.acm.org/citation.cfm?id=2488408
https://scholar.google.com/scholar?q=explicit+feedback+spam+filtering
User feedback & moderation: http://www.sciencedirect.com/science/article/pii/S0308596108000955

the Tribler voting and spam prevention mechanism
control D, Dispersy, show votecast
sqlitebrowser ~/.Tribler/sqlite/tribler.sdb
browse _ChannelVotes table
create interesting plot

@jellelicht
Copy link

This was the user-study where the assumption that expert users can quickly assess whether something is spam is questioned: Lee, Uichin, et al. "Understanding Pollution Dynamics in P2P File Sharing." IPTPS. Vol. 6. 2006.

@synctext
Copy link
Member Author

synctext commented Jan 11, 2017

first warmup task: understand and plot key daya from AllChannel content discovery and voting mechanism.

Plot ideas:

  • of currently roughly 2710 channels in Tribler: show number of votes for each channel. X-axis consists of channels, sorted by number of votes.
  • of all 250.000 cast votes, try to determine the time of voting. Plot vote activity in time.
  • Examples of plotting usage

@jellelicht
Copy link

jellelicht commented Jan 24, 2017

I am currently still deciding on how to export all my thesis-related artifacts (no generated artifacts in repositories), but for now a preview of a plot from last week(in xkcd style so I won't accidentally include them in a report as-is):
bars

Also quick question: Is there any more recent work than Niels Zeilemaker's thesis from 2010 regarding the search strategy used in tribler nowadays? AFAIS, search is done by first looking in the local data, and then asking your TasteBuddies for more info, but I could of course be mistaken.

After spending some time thinking about the directions we could to go with this project, I would like to expand on the concept of trust and taking into account the possiblity of trustees being compromised. Trustees in this case could be something like "friends", people with similar voting behaviour or perhaps even something that can best be described as "moderators".

Some issues that I would have to research/address/decide on:

  • How to define and bootstrap "trust" (Perhaps I could look into using something similar to TrustChain, or piggyback on the existing records embedded in TrustChain)
  • How to handle the combination of content-based reputation (votes) and user-based reputation (weighing or even ignoring some votes)
  • If I want to base important decisions on someones votes, I somehow need to make sure Sybil attacks don't influence the decisions too much/at all.

@jellelicht
Copy link

I would also like to propose a different issue title, as "adversarial search" is usually used in the context of e.g. game related A.I. things. How about "Spam-resilient search in decentralized systems"

@jellelicht
Copy link

img_20170126_154624

@jellelicht
Copy link

Problem can be split in two parts:

  1. Prevent spam and spam-related meta-information from entering the network
  2. Prevent spam and spam-related meta-information from hindering the proper usage of the network

@synctext
Copy link
Member Author

Survey paper possible elements:

  • discuss proposed stuff + pictures/architecture in the past (Credence, socery by Yale,..)
  • deeply explain AllChannel
  • Measurements: understand and plot key data from AllChannel content discovery and voting mechanism.

@synctext
Copy link
Member Author

synctext commented Jul 19, 2017

ToDo:

  • start a 2-column IEEE format .tex paper
  • crawl and process allchannel votes, create a non-xkcd style picture
  • write down what you did and how it work. {Tribler created an operational system, we crawled this and analysed the distribution of magical votes, etc etc.}
  • say something intelligent about AllChannels. {1980 voting spam, 98% 2%, most active outliers, etc.}
  • read 15 years papers
  • Write a Problem Description: adversarial search in decentralized systems
  • Discuss 30+ related work www.seas.upenn.edu/~cse400/CSE400_2015_2016/reports/report_28.pdf

@jellelicht
Copy link

Draft version:
main.pdf

@synctext
Copy link
Member Author

synctext commented Aug 24, 2017

Draft feedback:

  • improve structure
  • start sections with clean main point.
  • Why, What, How
  • Scientific
  • Sybil :
    image

@jellelicht
Copy link

jellelicht commented Sep 13, 2017

draft v2
main.pdf

@jellelicht
Copy link

@synctext
Copy link
Member Author

synctext commented Sep 13, 2017

  • A system without these drawbacks that consistently classifies peers as honest or dishonest does not currently exist.
    Any web page, tweet, blog post, wikipedia edits, news article can be fake or real.
  • Define IR
  • solutions for dishonesty
  • no 3.B
  • More then [25]
  • Experiment with real distributed search code. Example
    We spend exactly 1 hour which each of these software packages and describe their maturity level. Include screenshot of each.
    • freenet
    • gnunet
    • YACY
    • Tribler
    • etc

@jellelicht
Copy link

main.pdf

@synctext
Copy link
Member Author

synctext commented Nov 7, 2017

kelong_sybil_overview

@jellelicht
Copy link

  • Influence of mass media on perception (shifting the 'normal' distribution e.g. Sybil region) effectively, why did this work as effective as it did?
  • What are the real-world costs associated with creating many 'troll accounts'?

@ghost
Copy link

ghost commented Nov 7, 2017

Given how many legitimate news organizations and people are routinely labelled 'troll' by their competitors (RT / AlJazeera / CNN / FOX are, even if they are biased on questions of russian/qatar/US-blue-team/US-red-team interest) - the question of 'why did this work as effectively as it did' has an underlying truth component of 'because what they were saying was just as true of a constructed narrative of social facts as the competing consensus was'. It's not the whole reason why they are successful but if we're thinking about search and mass media we should keep in mind that in addition to the mass media perception shifting going on from one player in the 'troll account' narrative, there is great (perhaps greater) mass media perception management going on from the other player as well. Some success by the other players may serve to balance out the bias of the network itself in the favour of the incumbents.

To phrase in the context of, say, Kelong Cong's paper 3.3...the 'honest region' does not include either the blue or red team and everyone associated with it, both meatspace and bot, to the extent that shared, necessary illusions involved in group membership are held.

@synctext
Copy link
Member Author

synctext commented Jun 6, 2018

@ichorid See this ticket of related work. Especially the 8000 fake Twitter accounts.

@ichorid
Copy link
Contributor

ichorid commented Jun 6, 2018

@synctext thanks, I'll take this stuff into account.

@ichorid
Copy link
Contributor

ichorid commented Jun 9, 2018

related #3615

@synctext
Copy link
Member Author

Broader vision, beyond keyword search. An extensive technical analysis of the threat model in troubled regions. Aid workers are exposed to difficult challenges, see On Enforcing the Digital Immunity of a Large Humanitarian Organization.

@qstokkink qstokkink added this to the Backlog milestone Nov 27, 2018
@synctext
Copy link
Member Author

Status update after a few years:

  • we avoid solving the general keyword search problem
  • items are bundled into channels, with 1 publisher
  • each channel has a number of votes by voters
  • each voter has a sybil-attack estimator through trustchain
  • count weighted votes to estimate channel + item relevance ranking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants