Web Scraping - ownport.github.io/notes

scrapinghub/portia is a tool that allows you to visually scrape websites without any programming knowledge required. With Portia you can annotate a web page to identify the data you wish to extract, and Portia will understand based on these annotations how to scrape data from similar pages.
Frontera is a crawl frontier framework, the part of a crawling system that decides the logic and policies to follow when a crawler is visiting websites (what pages should be crawled next, priorities and ordering, how often pages are revisited, etc). Documentation: http://frontera.readthedocs.org
scrapy/scrapyd is a service for running Scrapy spiders. It allows you to deploy your Scrapy projects and control their spiders using a HTTP JSON API. The documentation (including installation and usage) can be found at: http://scrapyd.readthedocs.org/
scrapy/scrapyd-client Scrapyd-client is a client for scrapyd. It provides the scrapyd-deploy utility which allows you to deploy your project to a Scrapyd server.
scrapy/parsel Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors. Documentation: https://parsel.readthedocs.org.
scrapy/queuelib Queuelib is a collection of persistent (disk-based) queues for Python.Queuelib goals are speed and simplicity. It was originally part of the Scrapy framework and stripped out on its own library.
scrapy/scrapely Scrapely is a library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely constructs a parser for all similar pages.
scrapy/w3lib Python library of web-related functions:
- remove comments, or tags from HTML snippets
- extract base url from HTML snippets
- translate entites on HTML strings
- convert raw HTTP headers to dicts and vice-versa
- construct HTTP auth header
- converting HTML pages to unicode
- sanitize urls (like browsers do)
- extract arguments from urls

Projects based on Scrapy

darkrho/scrapy-redis