Apache HBase Intoduction

Thu 10 March 2016
Original article: https://learnhbase.wordpress.com/ Use Apache HBase when you need random, realtime read/write access to your Big Data. This project’s goal is the hosting of very large tables — billions of rows X millions of columns — atop clusters of commodity hardware. Apache HBase is an open-source, distributed ...

Oozie Datasets

Thu 10 March 2016
Datasets Lecture notes, Apache Oozie Essentials A Dataset is a collection of data, which is identified by some logical name. For example, the press release can be defined as follows: {nameNode}/learn_oozie/ch04/input/pressrelease/${YEAR ...

BigData Data Ingestion links

Sun 20 September 2015
Amazon Kinesis: real-time processing of streaming data at massive scale Apache Chukwa: data collection system Apache Flume: service to manage large amount of log data Apache Samza: stream processing framework, based on Kafla and YARN Apache Sqoop: tool to transfer data between Hadoop and a structured datastore Apache UIMA: Unstructured ...

BigData Data Visualization links

Mon 14 September 2015
Arbor: graph visualization library using web workers and jQuery Bokeh is a Python interactive visualization library for large datasets that natively uses the latest web technologies. Its goal is to provide elegant, concise construction of novel graphics in the style of Protovis/D3, while delivering high-performance interactivity over large data ...

BigData Testing

Mon 14 September 2015
Slides Testing Big Data: Automated Testing of Hadoop with QuerySurge

BigData Business Intelligence links

Mon 14 September 2015
ActivePivot: Java In-Memory OLAP cube stored in columns, with clearly decoupled pre/post processing Adatao: business intelligence and data science platform Apama analytics: platform for streaming analytics and intelligent automated action Atigeo xPatterns: data analytics platform BIME Analytics: business intelligence platform in the cloud Chartio: lean business intelligence platform to ...

BigData System Deployment links

Mon 14 September 2015
Ankush: A big data cluster management tool that creates and manages clusters of different technologies. Apache Ambari: operational framework for Hadoop mangement Apache Bigtop: system deployment framework for the Hadoop ecosystem Apache Helix: cluster management framework Apache Mesos: cluster manager Apache Slider: is a YARN application to deploy existing distributed ...

The R-Hadoop technology stack (notes)

Sun 13 September 2015
R is a free, open-source statistical programming language originally based on the S programming language. Here are a few reasons why R is a great place to start for data analysis: It’s completely free: SAS and SPSS are expensive to get started with, and you often need to buy ...

Amazon Web Services in Plain English (notes)

Sat 12 September 2015
Read original page: Amazon Web Services in Plain English Base Services EC2 (Amazon Virtual Servers): Host the bits of things you think of as a computer. It's handwavy, but EC2 instances are similar to the virtual private servers you'd get at Linode, DigitalOcean or Rackspace IAM (Users, Keys ...

BigData Document Data Model links

Sun 23 August 2015
Actian Versant: commercial object-oriented database management systems Crate Data: is an open source massively scalable data store. It requires zero administration Facebook Apollo: Facebook’s Paxos-like NoSQL database jumboDB: document oriented datastore over Hadoop LinkedIn Espresso: horizontally scalable document-oriented NoSQL data store MarkLogic: Schema-agnostic Enterprise NoSQL database technology Microsoft DocumentDB ...