Apache HBase Intoduction
Thu 10 March 2016Original article: https://learnhbase.wordpress.com/ Use Apache HBase when you need random, realtime read/write access to your Big Data. This project’s goal is the hosting of very large tables — billions of rows X millions of columns — atop clusters of commodity hardware. Apache HBase is an open-source, distributed ...
Oozie Datasets
Thu 10 March 2016Datasets Lecture notes, Apache Oozie Essentials A Dataset is a collection of data, which is identified by some logical name. For example, the press release can be defined as follows:
BigData Data Ingestion links
Sun 20 September 2015Amazon Kinesis: real-time processing of streaming data at massive scale Apache Chukwa: data collection system Apache Flume: service to manage large amount of log data Apache Samza: stream processing framework, based on Kafla and YARN Apache Sqoop: tool to transfer data between Hadoop and a structured datastore Apache UIMA: Unstructured ...
Hadoop Streaming
Tue 15 September 2015Hadoop Streaming Made Simple using Joins and Keys with Python
BigData Data Visualization links
Mon 14 September 2015Arbor: graph visualization library using web workers and jQuery Bokeh is a Python interactive visualization library for large datasets that natively uses the latest web technologies. Its goal is to provide elegant, concise construction of novel graphics in the style of Protovis/D3, while delivering high-performance interactivity over large data ...
BigData Business Intelligence links
Mon 14 September 2015ActivePivot: Java In-Memory OLAP cube stored in columns, with clearly decoupled pre/post processing Adatao: business intelligence and data science platform Apama analytics: platform for streaming analytics and intelligent automated action Atigeo xPatterns: data analytics platform BIME Analytics: business intelligence platform in the cloud Chartio: lean business intelligence platform to ...
BigData System Deployment links
Mon 14 September 2015Ankush: A big data cluster management tool that creates and manages clusters of different technologies. Apache Ambari: operational framework for Hadoop mangement Apache Bigtop: system deployment framework for the Hadoop ecosystem Apache Helix: cluster management framework Apache Mesos: cluster manager Apache Slider: is a YARN application to deploy existing distributed ...
The R-Hadoop technology stack (notes)
Sun 13 September 2015R is a free, open-source statistical programming language originally based on the S programming language. Here are a few reasons why R is a great place to start for data analysis: It’s completely free: SAS and SPSS are expensive to get started with, and you often need to buy ...
BigData Document Data Model links
Sun 23 August 2015Actian Versant: commercial object-oriented database management systems Crate Data: is an open source massively scalable data store. It requires zero administration Facebook Apollo: Facebook’s Paxos-like NoSQL database jumboDB: document oriented datastore over Hadoop LinkedIn Espresso: horizontally scalable document-oriented NoSQL data store MarkLogic: Schema-agnostic Enterprise NoSQL database technology Microsoft DocumentDB ...
BigData Applications links
Tue 28 July 2015Adobe Spindle: Next-generation web analytics processing with Scala, Spark, and Parquet Apache Kiji: framework to collect and analyze data in real-time, based on HBase Apache Nutch: open source web crawler Apache OODT: capturing, processing and sharing of data for NASA's scientific archives Apache Tika: content analysis toolkit Domino: Run ...