-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rate limiting is very basic #6
Comments
ref: https://twitter.com/hrbrmstr/status/947064265621549058 This is a 👍 issue to bring up. Given that you're directing this at bug bounty hunters, the per-org bug bounty rules of engagement need to state that it's OK to ignore 🤖 has optional Slightly on-topic is Line 37 in 29116f8
-- switch to override this and use one provided by the user OR to short-name select from an internal select list of canned ones wld be safer for users and you. https://github.com/hrbrmstr/splashr/blob/a7c5406264b91918e60e5abf692b51baf5ab2fb7/R/dsl.r#L422-L468 has a good set of ones to use for that purpose AND the added benefit is that folks can switch OS and from desktop to mobile sine many sites change behaviour with different UAs present.
Completely off-topic: you're capturing tons of juicy data. Consider using HAR format for the output (https://github.com/CyrusBiotechnology/go-har may help with that). It's slightly better than WARC (IMO) for storing some technical info along with the headers and payload but WARC support might also be nice (https://github.com/slyrz/warc may help) since there are at-scale tools to enable working with that. |
Thanks for the heads up and super detailed reply! I'm working on rate limiting at the moment. The tool has been for my own use up until now and I've been putting it off - largely because I've been trying to come up with some kind of requeuing mechanism to keep the pipeline of requests as full as possible and avoid stalling goroutines. Given your input I think I'll implement something simpler but less efficient as an interim measure. Do you think default 5s delay with an override would be OK for now? I think I'm a bit away from fetching and parsing 🤖 yet. WRT the user agent: yeah; that's a leftover from the only-for-my-use days. I've dropped that in 4418a31. I'll add a 'meg' mozilla-like user agent soon and give people the option of overriding if they want to. WRT HAR format: I'll look in to that. The original idea for this tool was to make it easy for me to use standard tools like grep to filter and find things in the output. Is HAR fairly amenable to that? |
@hrbrmstr I've added some basic rate limiting in fffe252 - and a warning about reducing it in the readme. 'basic' might be a bit of an understatement, but it's an improvement. I'll give some more thought to it over the weekend. I'm going to leave this issue open until I'm comfortable about the solution - maybe even until it fetches, parses and respects the robots.txt. Thanks again! |
If I can carve out some cycles in wk2 of Jan (I'm on a pretty big project until then) I'll gladly lend a hand. This is a super nice and focused alternative to generating URL strings for |
w/r/t HAR: aye. It's lovely JSON, so you can even use |
(replying as I read, apologies :-) def 5s is — IMO — 100% 👍 https://rud.is/b/2017/07/28/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r/ |
It'd be a really good idea to rate limit per domain (or maybe per IP) to prevent hammering hosts when there aren't many prefixes.
The text was updated successfully, but these errors were encountered: