BigData Distributed Programming links - ownport.github.io/notes

AddThis Hydra: distributed data processing and storage system originally developed at AddThis
Akela: Mozilla's utility library for Hadoop, HBase, Pig, etc.
AMPLab SIMR: run Spark on Hadoop MapReduce v1
AMPLab Succinct: Enabling Queries on Compressed Data
Apache Crunch: Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run. Running on top of Hadoop MapReduce and Apache Spark, the Apache Crunch™ library is a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce. The APIs are especially useful when processing data that does not fit naturally into relational model, such as time series, serialized object formats like protocol buffers or Avro records, and HBase rows and columns. For Scala users, there is the Scrunch API, which is built on top of the Java APIs and includes a REPL (read-eval-print loop) for creating MapReduce pipelines.
Apache DataFu: collection of user-defined functions for Hadoop and Pig developed by LinkedIn
Apache Flink: high-performance runtime, and automatic program optimization
Apache Gora: framework for in-memory data model and persistence
Apache Hama: BSP (Bulk Synchronous Parallel) computing framework
Apache MapReduce: programming model for processing large data sets with a parallel, distributed algorithm on a cluster
Apache Pig: high level language to express data analysis programs for Hadoop
Apache S4: framework for stream processing, implementation of S4
Apache Spark: framework for in-memory cluster computing
Apache Spark Streaming: framework for stream processing, part of Spark
Apache Storm: framework for stream processing by Twitter also on YARN
Apache Tez: application framework for executing a complex DAG (directed acyclic graph) of tasks, built on YARN
Apache Twill: abstraction over YARN that reduces the complexity of developing distributed applications
Cascalog: data processing and querying library
Cheetah: High Performance, Custom Data Warehouse on Top of MapReduce
Concurrent Cascading: framework for data management/analytics on Hadoop
Damballa Parkour: MapReduce library for Clojure
Datasalt Pangool: alternative MapReduce paradigm
DataTorrent StrAM: real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance
DistributedR: scalable high-performance platform for the R language
eBay Oink: REST based interface for PIG execution
Facebook Corona: Hadoop enhancement which removes single point of failure
Facebook Peregrine: Map Reduce framework
Facebook Scuba: distributed in-memory datastore
Geotrellis: geographic data processing engine for high performance applications
GIS Tools for Hadoop: Big Data Spatial Analytics for the Hadoop Framework
Google Dataflow: create data pipelines to help themæingest, transform and analyze data
Google MapReduce: map reduce framework
Google MillWheel: fault tolerant stream processing framework
HParser: data parsing transformation environment optimized for Hadoop
IBM Streams: advanced analytic platform that allows user-developed applications to quickly ingest, analyze and correlate information as it arrives from thousands of real-time sources
JAQL: declarative programming language for working with structured, semi-structured and unstructured data
Kite: is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem
Kryo: Java serialization and cloning: fast, efficient, automatic
Lipstick: Pig workflow visualization tool
Metamarkers Druid: framework for real-time analysis of large datasets
Netflix Aegisthus: Bulk Data Pipeline out of Cassandra. implements a reader for the SSTable format and provides a map/reduce program to create a compacted snapshot of the data contained in a column family
Netflix Lipstick: Pig Visualization framework
Netflix Mantis: Event Stream Processing System
Netflix PigPen: map-reduce for Clojure whiche compiles to Apache Pig
Netflix STAASH: language-agnostic as well as storage-agnostic web interface for storing data into persistent storage systems
Netflix Zeno: Netflix's In-Memory Data Propagation Framework
Nextflow: Dataflow oriented toolkit for parallel and distributed computational pipelines
Nokia Disco: MapReduce framework developed by Nokia
PigPen: PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don’t need to know much about Pig to use it
Pinterest Pinlater: asynchronous job execution system
Pydoop: Python MapReduce and HDFS API for Hadoop
ScaleOut hServer: fast, scalable in-memory data grid for Hadoop
SeqPig: Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop
SigmoidAnalytics Spork: Pig on Apache Spark
SpatialHadoop: SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data.
Spring for Apache Hadoop: unified configuration model and easy to use APIs for using HDFS, MapReduce, Pig, and Hive
SQLStream Blaze: stream processing platform
Stratio Streaming: the union of a real-time messaging bus with a complex event processing engine using Spark Streaming
Stratosphere: general purpose cluster computing framework
Streamdrill: usefull for counting activities of event streams over different time windows and finding the most active one
Teradata QueryGrid: data-access layer that can orchestrate multiple modes of analysis across multiple databases plus Hadoop
TIBCO ActiveSpaces: in-memory data grid
Torch: Scientific computing for LuaJIT
Twitter Scalding: Scala library for Map Reduce jobs, built on Cascading
Twitter Summingbird: Streaming MapReduce with Scalding and Storm, by Twitter
Twitter TSAR: TimeSeries AggregatoR by Twitter