goscrape

goscrape is a extensible structured scraper for Go. What does a "structured scraper" mean? In this case, it means that you define what you want to extract from a page in a structured, hierarchical manner, and then goscrape takes care of pagination, splitting the input page, and calling the code to extract chunks of data. However, goscrape is extensible, allowing you to customize nearly every step of this process.

The architecture of goscrape is roughly as follows:

A single request to start scraping (from a given URL) is called a scrape.
Each scrape consists of some number of pages.
Inside each page, there's 1 or more blocks - some logical method of splitting up a page into subcomponents. By default, there's a single block that consists of the pages's <body> element, but you can change this fairly easily.
Inside each block, you define some number of pieces of data that you wish to extract. Each piece consists of a name, a selector, and what data to extract from the current block.

This all sounds rather complicated, but in practice it's quite simple. Here's a short example of how to get a list of all the latest news articles from Wired and dump them as JSON to the screen:

package main

import (
	"encoding/json"
	"fmt"
	"os"

	"github.com/andrew-d/goscrape"
	"github.com/andrew-d/goscrape/extract"
)

func main() {
	config := &scrape.ScrapeConfig{
		DividePage: scrape.DividePageBySelector("#latest-news li"),

		Pieces: []scrape.Piece{
			{Name: "title", Selector: "h5.exchange-sm", Extractor: extract.Text{}},
			{Name: "byline", Selector: "span.byline", Extractor: extract.Text{}},
			{Name: "link", Selector: "a", Extractor: extract.Attr{Attr: "href"}},
		},
	}

	scraper, err := scrape.New(config)
	if err != nil {
		fmt.Fprintf(os.Stderr, "Error creating scraper: %s\n", err)
		os.Exit(1)
	}

	results, err := scraper.Scrape("http://www.wired.com")
	if err != nil {
		fmt.Fprintf(os.Stderr, "Error scraping: %s\n", err)
		os.Exit(1)
	}

	json.NewEncoder(os.Stdout).Encode(results)
}

As you can see, the entire example, including proper error handling, only takes 36 lines of code - short and sweet.

For more usage examples, see the examples directory.

Roadmap

Here's the rough roadmap of things that I'd like to add. If you have a feature request, please let me know by opening an issue!

Allow deduplication of Pieces (a custom callback?)
Improve parallelization (scrape multiple pages at a time, but maintain order)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
_examples		_examples
extract		extract
paginate		paginate
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
doc.go		doc.go
fetcher.go		fetcher.go
helpers.go		helpers.go
options.go		options.go
package_test.go		package_test.go
phantomjs.go		phantomjs.go
results_test.go		results_test.go
scrape.go		scrape.go
util.go		util.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

goscrape

Roadmap

License

About

Releases

Packages

Languages

andrew-d/goscrape

Folders and files

Latest commit

History

Repository files navigation

goscrape

Roadmap

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages