ownport.github.io/notes

BigData Applications links

Tue 28 July 2015
Adobe Spindle: Next-generation web analytics processing with Scala, Spark, and Parquet Apache Kiji: framework to collect and analyze data in real-time, based on HBase Apache Nutch: open source web crawler Apache OODT: capturing, processing and sharing of data for NASA's scientific archives Apache Tika: content analysis toolkit Domino: Run ...

BigData Benchmarking links

Tue 28 July 2015
Apache Hadoop Benchmarking: micro-benchmarks for testing Hadoop performances Berkeley SWIM Benchmark: real-world big data workload benchmark Big-Bench: Big Bench Workload Development Hive-benchmarks: some benchmarking queries for Apache Hive Hive-testbench: Testbench for experimenting with Apache Hive at any data scale. Intel HiBench: a Hadoop benchmark suite Netflix Inviso: performance focused Big ...

BigData Columnar Databases links

Tue 28 July 2015
Amazon RedShift: data warehouse service, based on PostgreSQL C-Store: column oriented DBMS Google BigQuery: framework for interactive analysis, implementation of Dremel Google Dremel: framework for interactive analysis, implementation of Dremel MonetDB: column store database Parquet: columnar storage format for Hadoop Pivotal Greenplum: purpose-built, dedicated analytic data warehouse Vertica: is designed ...

BigData Data Warehouse links

Tue 28 July 2015
Google Mesa: highly scalable analytic data warehousing system IBM BigInsights: data processing, warehousing and analytics Microsoft Cosmos: Microsoft's internal BigData analysis platform

BigData Distributed Filesystem links

Tue 28 July 2015
Apache HDFS: a way to store large files across multiple machines BeeGFS: formerly FhGFS, parallel distributed file system Ceph Filesystem: software storage platform designed Disco DDFS: distributed filesystem Facebook Haystack: object storage system Google Colossus: distributed filesystem (GFS2) Google GFS: distributed filesystem Google Megastore: scalable, highly available storage GridGain: GGFS ...

BigData Distributed Programming links

Tue 28 July 2015
AddThis Hydra: distributed data processing and storage system originally developed at AddThis Akela: Mozilla's utility library for Hadoop, HBase, Pig, etc. AMPLab SIMR: run Spark on Hadoop MapReduce v1 AMPLab Succinct: Enabling Queries on Compressed Data Apache Crunch: Java library provides a framework for writing, testing, and running MapReduce ...

BigData Embedded Databases links

Tue 28 July 2015
Actian PSQL: ACID-compliant DBMS developed by Pervasive Software, optimized for embedding in applications BerkeleyDB: a software library that provides a high-performance embedded database for key/value data HamsterDB: transactional key-value database HanoiDB: Erlang LSM BTree Storage LevelDB: a fast key-value storage library written at Google that provides an ordered mapping ...

BigData Frameworks links

Tue 28 July 2015
Apache Hadoop: framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system)

BigData Graph Data Model links

Tue 28 July 2015
Apache Giraph: implementation of Pregel, based on Hadoop Apache Spark Bagel: implementation of Pregel, part of Spark ArangoDB: multi model distribuited database Facebook TAO: TAO is the distributed data store that is widely used at facebook to store and serve the social graph Faunus: Hadoop-based graph analytics engine for analyzing ...

BigData Integrated Development Environments links

Tue 28 July 2015
R-Studio: IDE for R