BigData Data Ingestion links

Sun 20 September 2015
  • Amazon Kinesis: real-time processing of streaming data at massive scale

  • Apache Chukwa: data collection system

  • Apache Flume: service to manage large amount of log data

  • Apache Samza: stream processing framework, based on Kafla and YARN

  • Apache Sqoop: tool to transfer data between Hadoop and a structured datastore

  • Apache UIMA: Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user

  • Cloudera Morphlines: framework that help ETL to Solr, HBase and HDFS

  • Facebook Scribe: streamed log data aggregator

  • Fluentd: tool to collect events and logs

  • Google Photon: geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency

  • Heka: open source stream processing software system

  • HIHO: framework for connecting disparate data sources with Hadoop

  • LinkedIn Databus: stream of change capture events for a database

  • LinkedIn Kamikaze: utility package for compressing sorted integer arrays

  • LinkedIn White Elephant: log aggregator and dashboard

  • Logstash: a tool for managing events and logs

  • Netflix Suro: data pipeline service for collecting, aggregating, and dispatching large volume of application events including log data based on Chukwa

  • Pinterest Secor: is a service implementing Kafka log persistance

  • Record Breaker: Automatic structure for your text-formatted data

  • TIBCO Enterprise Message Service: standards-based messaging middleware

  • Twitter Zipkin: distributed tracing system that helps us gather timing data for all the disparate services at Twitter

  • Vibe Data Stream: streaming data collection for real-time Big Data analytics

  • LinkedIn Camus LinkedIn's Kafka to HDFS pipeline. Camus is being phased out and replaced by Gobblin. For those using or interested in Camus, we suggest taking a look at Gobblin.

  • LinkedIn Gobblin Gobblin is a universal data ingestion framework for extracting, transforming, and loading large volume of data from a variety of data sources, e.g., databases, rest APIs, FTP/SFTP servers, filers, etc., onto Hadoop. Gobblin handles the common routine tasks required for all data ingestion ETLs, including job/task scheduling, task partitioning, error handling, state management, data quality checking, data publishing, etc. Gobblin ingests data from different data sources in the same execution framework, and manages metadata of different sources all in one place. This, combined with other features such as auto scalability, fault tolerance, data quality assurance, extensibility, and the ability of handling data model evolution, makes Gobblin an easy-to-use, self-serving, and efficient data ingestion framework.

  • Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.

    • Web-based user interface: seamless experience between design, control, feedback, and monitoring
    • Highly configurable: loss tolerant vs guaranteed delivery, low latency vs high throughput, dynamic prioritization, flow can be modified at runtime, back pressure
    • Data Provenance: track dataflow from beginning to end
    • Designed for extension: build your own processors and more, enables rapid development and effective testing
    • Secure: SSL, SSH, HTTPS, encrypted content, pluggable role-based authentication/authorization