ownport.github.io/notes

Apache HBase Intoduction

Thu 10 March 2016
Original article: https://learnhbase.wordpress.com/ Use Apache HBase when you need random, realtime read/write access to your Big Data. This project’s goal is the hosting of very large tables — billions of rows X millions of columns — atop clusters of commodity hardware. Apache HBase is an open-source, distributed ...

Oozie Datasets

Thu 10 March 2016
Datasets Lecture notes, Apache Oozie Essentials A Dataset is a collection of data, which is identified by some logical name. For example, the press release can be defined as follows:

{nameNode}/learn_oozie/ch04/input/pressrelease/${YEAR ...

BigData Data Ingestion links

Sun 20 September 2015
Amazon Kinesis: real-time processing of streaming data at massive scale Apache Chukwa: data collection system Apache Flume: service to manage large amount of log data Apache Samza: stream processing framework, based on Kafla and YARN Apache Sqoop: tool to transfer data between Hadoop and a structured datastore Apache UIMA: Unstructured ...

Hadoop Streaming

Tue 15 September 2015
Hadoop Streaming Made Simple using Joins and Keys with Python

BigData Data Visualization links

Mon 14 September 2015
Arbor: graph visualization library using web workers and jQuery Bokeh is a Python interactive visualization library for large datasets that natively uses the latest web technologies. Its goal is to provide elegant, concise construction of novel graphics in the style of Protovis/D3, while delivering high-performance interactivity over large data ...

BigData Business Intelligence links

Mon 14 September 2015
ActivePivot: Java In-Memory OLAP cube stored in columns, with clearly decoupled pre/post processing Adatao: business intelligence and data science platform Apama analytics: platform for streaming analytics and intelligent automated action Atigeo xPatterns: data analytics platform BIME Analytics: business intelligence platform in the cloud Chartio: lean business intelligence platform to ...

BigData System Deployment links

Mon 14 September 2015
Ankush: A big data cluster management tool that creates and manages clusters of different technologies. Apache Ambari: operational framework for Hadoop mangement Apache Bigtop: system deployment framework for the Hadoop ecosystem Apache Helix: cluster management framework Apache Mesos: cluster manager Apache Slider: is a YARN application to deploy existing distributed ...

The R-Hadoop technology stack (notes)

Sun 13 September 2015
R is a free, open-source statistical programming language originally based on the S programming language. Here are a few reasons why R is a great place to start for data analysis: It’s completely free: SAS and SPSS are expensive to get started with, and you often need to buy ...

BigData Document Data Model links

Sun 23 August 2015
Actian Versant: commercial object-oriented database management systems Crate Data: is an open source massively scalable data store. It requires zero administration Facebook Apollo: Facebook’s Paxos-like NoSQL database jumboDB: document oriented datastore over Hadoop LinkedIn Espresso: horizontally scalable document-oriented NoSQL data store MarkLogic: Schema-agnostic Enterprise NoSQL database technology Microsoft DocumentDB ...

BigData Applications links

Tue 28 July 2015
Adobe Spindle: Next-generation web analytics processing with Scala, Spark, and Parquet Apache Kiji: framework to collect and analyze data in real-time, based on HBase Apache Nutch: open source web crawler Apache OODT: capturing, processing and sharing of data for NASA's scientific archives Apache Tika: content analysis toolkit Domino: Run ...