Web Scraping

Sun 23 August 2015
  • scrapinghub/portia is a tool that allows you to visually scrape websites without any programming knowledge required. With Portia you can annotate a web page to identify the data you wish to extract, and Portia will understand based on these annotations how to scrape data from similar pages.

  • Frontera is a crawl frontier framework, the part of a crawling system that decides the logic and policies to follow when a crawler is visiting websites (what pages should be crawled next, priorities and ordering, how often pages are revisited, etc). Documentation: http://frontera.readthedocs.org

  • scrapy/scrapyd is a service for running Scrapy spiders. It allows you to deploy your Scrapy projects and control their spiders using a HTTP JSON API. The documentation (including installation and usage) can be found at: http://scrapyd.readthedocs.org/

  • scrapy/scrapyd-client Scrapyd-client is a client for scrapyd. It provides the scrapyd-deploy utility which allows you to deploy your project to a Scrapyd server.

  • scrapy/parsel Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors. Documentation: https://parsel.readthedocs.org.

  • scrapy/queuelib Queuelib is a collection of persistent (disk-based) queues for Python.Queuelib goals are speed and simplicity. It was originally part of the Scrapy framework and stripped out on its own library.

  • scrapy/scrapely Scrapely is a library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely constructs a parser for all similar pages.

  • scrapy/w3lib Python library of web-related functions:

    • remove comments, or tags from HTML snippets
    • extract base url from HTML snippets
    • translate entites on HTML strings
    • convert raw HTTP headers to dicts and vice-versa
    • construct HTTP auth header
    • converting HTML pages to unicode
    • sanitize urls (like browsers do)
    • extract arguments from urls

Projects based on Scrapy