Apache HBase Intoduction
Thu 10 March 2016Original article: https://learnhbase.wordpress.com/ Use Apache HBase when you need random, realtime read/write access to your Big Data. This project’s goal is the hosting of very large tables — billions of rows X millions of columns — atop clusters of commodity hardware. Apache HBase is an open-source, distributed ...
Oozie Datasets
Thu 10 March 2016Datasets Lecture notes, Apache Oozie Essentials A Dataset is a collection of data, which is identified by some logical name. For example, the press release can be defined as follows:
BigData Data Ingestion links
Sun 20 September 2015Amazon Kinesis: real-time processing of streaming data at massive scale Apache Chukwa: data collection system Apache Flume: service to manage large amount of log data Apache Samza: stream processing framework, based on Kafla and YARN Apache Sqoop: tool to transfer data between Hadoop and a structured datastore Apache UIMA: Unstructured ...
BigData Data Visualization links
Mon 14 September 2015Arbor: graph visualization library using web workers and jQuery Bokeh is a Python interactive visualization library for large datasets that natively uses the latest web technologies. Its goal is to provide elegant, concise construction of novel graphics in the style of Protovis/D3, while delivering high-performance interactivity over large data ...
BigData Testing
Mon 14 September 2015Slides Testing Big Data: Automated Testing of Hadoop with QuerySurge
BigData Business Intelligence links
Mon 14 September 2015ActivePivot: Java In-Memory OLAP cube stored in columns, with clearly decoupled pre/post processing Adatao: business intelligence and data science platform Apama analytics: platform for streaming analytics and intelligent automated action Atigeo xPatterns: data analytics platform BIME Analytics: business intelligence platform in the cloud Chartio: lean business intelligence platform to ...
BigData System Deployment links
Mon 14 September 2015Ankush: A big data cluster management tool that creates and manages clusters of different technologies. Apache Ambari: operational framework for Hadoop mangement Apache Bigtop: system deployment framework for the Hadoop ecosystem Apache Helix: cluster management framework Apache Mesos: cluster manager Apache Slider: is a YARN application to deploy existing distributed ...
The R-Hadoop technology stack (notes)
Sun 13 September 2015R is a free, open-source statistical programming language originally based on the S programming language. Here are a few reasons why R is a great place to start for data analysis: It’s completely free: SAS and SPSS are expensive to get started with, and you often need to buy ...
Amazon Web Services in Plain English (notes)
Sat 12 September 2015Read original page: Amazon Web Services in Plain English Base Services EC2 (Amazon Virtual Servers): Host the bits of things you think of as a computer. It's handwavy, but EC2 instances are similar to the virtual private servers you'd get at Linode, DigitalOcean or Rackspace IAM (Users, Keys ...
BigData Document Data Model links
Sun 23 August 2015Actian Versant: commercial object-oriented database management systems Crate Data: is an open source massively scalable data store. It requires zero administration Facebook Apollo: Facebook’s Paxos-like NoSQL database jumboDB: document oriented datastore over Hadoop LinkedIn Espresso: horizontally scalable document-oriented NoSQL data store MarkLogic: Schema-agnostic Enterprise NoSQL database technology Microsoft DocumentDB ...