ownport.github.io/notes

BigData Data Ingestion links

Sun 20 September 2015
Amazon Kinesis: real-time processing of streaming data at massive scale Apache Chukwa: data collection system Apache Flume: service to manage large amount of log data Apache Samza: stream processing framework, based on Kafla and YARN Apache Sqoop: tool to transfer data between Hadoop and a structured datastore Apache UIMA: Unstructured ...

Hadoop Streaming

Tue 15 September 2015
Hadoop Streaming Made Simple using Joins and Keys with Python

BigData Data Visualization links

Mon 14 September 2015
Arbor: graph visualization library using web workers and jQuery Bokeh is a Python interactive visualization library for large datasets that natively uses the latest web technologies. Its goal is to provide elegant, concise construction of novel graphics in the style of Protovis/D3, while delivering high-performance interactivity over large data ...

Spark Links

Mon 14 September 2015
Articles A Docker Image for Graph Analytics on Neo4j with Apache Spark GraphX

Hive useful Links

Mon 14 September 2015
Articles Using GenericUDFs to return multiple values in Apache Hive

BigData Testing

Mon 14 September 2015
Slides Testing Big Data: Automated Testing of Hadoop with QuerySurge

BigData Business Intelligence links

Mon 14 September 2015
ActivePivot: Java In-Memory OLAP cube stored in columns, with clearly decoupled pre/post processing Adatao: business intelligence and data science platform Apama analytics: platform for streaming analytics and intelligent automated action Atigeo xPatterns: data analytics platform BIME Analytics: business intelligence platform in the cloud Chartio: lean business intelligence platform to ...

BigData System Deployment links

Mon 14 September 2015
Ankush: A big data cluster management tool that creates and manages clusters of different technologies. Apache Ambari: operational framework for Hadoop mangement Apache Bigtop: system deployment framework for the Hadoop ecosystem Apache Helix: cluster management framework Apache Mesos: cluster manager Apache Slider: is a YARN application to deploy existing distributed ...

The R-Hadoop technology stack (notes)

Sun 13 September 2015
R is a free, open-source statistical programming language originally based on the S programming language. Here are a few reasons why R is a great place to start for data analysis: It’s completely free: SAS and SPSS are expensive to get started with, and you often need to buy ...

BigData Document Data Model links

Sun 23 August 2015
Actian Versant: commercial object-oriented database management systems Crate Data: is an open source massively scalable data store. It requires zero administration Facebook Apollo: Facebook’s Paxos-like NoSQL database jumboDB: document oriented datastore over Hadoop LinkedIn Espresso: horizontally scalable document-oriented NoSQL data store MarkLogic: Schema-agnostic Enterprise NoSQL database technology Microsoft DocumentDB ...