-
scrapinghub/portia is a tool that allows you to visually scrape websites without any programming knowledge required. With Portia you can annotate a web page to identify the data you wish to extract, and Portia will understand based on these annotations how to scrape data from similar pages.
-
Frontera is a crawl frontier framework, the part of a crawling system that decides the logic and policies to follow when a crawler is visiting websites (what pages should be crawled next, priorities and ordering, how often pages are revisited, etc). Documentation: http://frontera.readthedocs.org
-
scrapy/scrapyd is a service for running Scrapy spiders. It allows you to deploy your Scrapy projects and control their spiders using a HTTP JSON API. The documentation (including installation and usage) can be found at: http://scrapyd.readthedocs.org/
-
scrapy/scrapyd-client Scrapyd-client is a client for scrapyd. It provides the
scrapyd-deploy
utility which allows you to deploy your project to a Scrapyd server. -
scrapy/parsel Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors. Documentation: https://parsel.readthedocs.org.
-
scrapy/queuelib Queuelib is a collection of persistent (disk-based) queues for Python.Queuelib goals are speed and simplicity. It was originally part of the Scrapy framework and stripped out on its own library.
-
scrapy/scrapely Scrapely is a library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely constructs a parser for all similar pages.
-
scrapy/w3lib Python library of web-related functions:
- remove comments, or tags from HTML snippets
- extract base url from HTML snippets
- translate entites on HTML strings
- convert raw HTTP headers to dicts and vice-versa
- construct HTTP auth header
- converting HTML pages to unicode
- sanitize urls (like browsers do)
- extract arguments from urls
Web Scraping
Sun 23 August 2015