- AddThis Hydra: distributed data processing and storage system originally developed at AddThis
- Akela: Mozilla's utility library for Hadoop, HBase, Pig, etc.
- AMPLab SIMR: run Spark on Hadoop MapReduce v1
- AMPLab Succinct: Enabling Queries on Compressed Data
- Apache Crunch: Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run. Running on top of Hadoop MapReduce and Apache Spark, the Apache Crunch™ library is a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce. The APIs are especially useful when processing data that does not fit naturally into relational model, such as time series, serialized object formats like protocol buffers or Avro records, and HBase rows and columns. For Scala users, there is the Scrunch API, which is built on top of the Java APIs and includes a REPL (read-eval-print loop) for creating MapReduce pipelines.
- Apache DataFu: collection of user-defined functions for Hadoop and Pig developed by LinkedIn
- Apache Flink: high-performance runtime, and automatic program optimization
- Apache Gora: framework for in-memory data model and persistence
- Apache Hama: BSP (Bulk Synchronous Parallel) computing framework
- Apache MapReduce: programming model for processing large data sets with a parallel, distributed algorithm on a cluster
- Apache Pig: high level language to express data analysis programs for Hadoop
- Apache S4: framework for stream processing, implementation of S4
- Apache Spark: framework for in-memory cluster computing
- Apache Spark Streaming: framework for stream processing, part of Spark
- Apache Storm: framework for stream processing by Twitter also on YARN
- Apache Tez: application framework for executing a complex DAG (directed acyclic graph) of tasks, built on YARN
- Apache Twill: abstraction over YARN that reduces the complexity of developing distributed applications
- Cascalog: data processing and querying library
- Cheetah: High Performance, Custom Data Warehouse on Top of MapReduce
- Concurrent Cascading: framework for data management/analytics on Hadoop
- Damballa Parkour: MapReduce library for Clojure
- Datasalt Pangool: alternative MapReduce paradigm
- DataTorrent StrAM: real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance
- DistributedR: scalable high-performance platform for the R language
- eBay Oink: REST based interface for PIG execution
- Facebook Corona: Hadoop enhancement which removes single point of failure
- Facebook Peregrine: Map Reduce framework
- Facebook Scuba: distributed in-memory datastore
- Geotrellis: geographic data processing engine for high performance applications
- GIS Tools for Hadoop: Big Data Spatial Analytics for the Hadoop Framework
- Google Dataflow: create data pipelines to help themæingest, transform and analyze data
- Google MapReduce: map reduce framework
- Google MillWheel: fault tolerant stream processing framework
- HParser: data parsing transformation environment optimized for Hadoop
- IBM Streams: advanced analytic platform that allows user-developed applications to quickly ingest, analyze and correlate information as it arrives from thousands of real-time sources
- JAQL: declarative programming language for working with structured, semi-structured and unstructured data
- Kite: is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem
- Kryo: Java serialization and cloning: fast, efficient, automatic
- Lipstick: Pig workflow visualization tool
- Metamarkers Druid: framework for real-time analysis of large datasets
- Netflix Aegisthus: Bulk Data Pipeline out of Cassandra. implements a reader for the SSTable format and provides a map/reduce program to create a compacted snapshot of the data contained in a column family
- Netflix Lipstick: Pig Visualization framework
- Netflix Mantis: Event Stream Processing System
- Netflix PigPen: map-reduce for Clojure whiche compiles to Apache Pig
- Netflix STAASH: language-agnostic as well as storage-agnostic web interface for storing data into persistent storage systems
- Netflix Zeno: Netflix's In-Memory Data Propagation Framework
- Nextflow: Dataflow oriented toolkit for parallel and distributed computational pipelines
- Nokia Disco: MapReduce framework developed by Nokia
- PigPen: PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don’t need to know much about Pig to use it
- Pinterest Pinlater: asynchronous job execution system
- Pydoop: Python MapReduce and HDFS API for Hadoop
- ScaleOut hServer: fast, scalable in-memory data grid for Hadoop
- SeqPig: Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop
- SigmoidAnalytics Spork: Pig on Apache Spark
- SpatialHadoop: SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data.
- Spring for Apache Hadoop: unified configuration model and easy to use APIs for using HDFS, MapReduce, Pig, and Hive
- SQLStream Blaze: stream processing platform
- Stratio Streaming: the union of a real-time messaging bus with a complex event processing engine using Spark Streaming
- Stratosphere: general purpose cluster computing framework
- Streamdrill: usefull for counting activities of event streams over different time windows and finding the most active one
- Teradata QueryGrid: data-access layer that can orchestrate multiple modes of analysis across multiple databases plus Hadoop
- TIBCO ActiveSpaces: in-memory data grid
- Torch: Scientific computing for LuaJIT
- Twitter Scalding: Scala library for Map Reduce jobs, built on Cascading
- Twitter Summingbird: Streaming MapReduce with Scalding and Storm, by Twitter
- Twitter TSAR: TimeSeries AggregatoR by Twitter
BigData Distributed Programming links
Tue 28 July 2015