Big Data Analysis Tools/Resources: An Annotated Bibliography

Big Data Analysis Tools/Resources

1.     Chris Stucchio’s blog

This blog makes this list primarily for one article: “Don’t use Hadoop – your data isn’t that big”, which provides a guide for deciding whether your data really qualifies as big data, based on its size, purpose, and expected future growth. And, given this decision, which suite of tools is best suited to your needs. The remainder of the blog posts deal with developer level problems in big data, as well as an assortment of economics-, and other news-related items.

Keywords: big data, medium data, kind of large data

Audience: dataset owners considering big data tools, big data beginners

2.     Infochimps

Infochimps provides extensive resources for those interested in big data analysis. These include, whitepapers: detailed overviews of topics in big data analysis; case studies: other companies’ successful use of (Infochimps’) big data products; as well as video webinars about topics in big data.  Some of these represent not-even-thinly veiled promotional pitches for Infochimps’ products and services; however, others will provide useful information for advanced big-data users.

Keywords: big data

Audience: intermediate- advanced

3.     Hadoop

The Apache Hadoop software library is a free, open-source, Java-based programming framework that allows for the processing of huge datasets across clusters of computers. Hadoop is designed to scale easily from a single server to thousands of machines, and its distributed structure gives it a high degree of fault tolerance.

Keywords: Hadoop, big data, scalability

Audience: professional or academic big data users with server capacity

4.     Big Data University

Big Data University is an online education portal run by a collective of big data users and enthusiasts, academics, data management companies, and professionals. It has an impressive range of free and fee-based courses, focused on big data topics. Among these offerings is a course on Hadoop fundamentals. The courses are mostly around 4-6 hours long, and include a test of competence at the end.

Keywords: Hadoop, MapReduce, Pig, big data, open courseware

Audience: big data users of all levels and occupational types

5.     The Cloud Avenue

The Cloud Avenue is the blog of Praveen Sripati, a big data specialist. The posts focus around introductory topics in big data, in particular Hadoop. Sripati sometimes answers reader submitted questions, and provides guides for big data beginners to make a low-budget or no-budget start in big data using cloud based services. A good place to start is to search for the tag “getting started”:

Keywords: Hadoop, big data, blog, how-to, guide, Flume, Hive

Audience: big data beginners, Hadoop beginners

6.     Udacity

Udacity has an impressive range of course offerings related to big data. All have a cost attached, but it is possible to access the course materials (meaning the course experience minus the personal coaching, and certification) free of charge. The courses are shorter than MOOCs, with units composed of roughly 20 1-2 minute video lectures. The material is well presented, using a live whiteboard feed, and integration with the course materials is effective. Udacity is a good option for someone who doesn’t have the time to invest in a full length MOOC.

Keywords: online courseware, video lectures, big data

Audience: big data beginners

7.     Cloudera

Cloud Era is a for-profit data company offering enterprise level data storage services. Cloud Era offers a range of paid data workshops and training courses, but the site also hosts a range of free materials. Finding these requires a bit of searching through paid and promotional materials; there are, however, some treasures to be found, such as this one-hour webinar on implementing big data insights in their own enterprise/industry level datasets.

Keywords: big data, Hadoop, Cloud Era, webinar

Audience: advanced – professional big data users

8.     Hadoop Wizard

Hadoop Wizard is a compendium of Hadoop related materials, including tutorials, blog posts with advice about Hadoop topics, a list of Hadoop experts, and a list of online and in-person Hadoop training events. Many of these materials point to other Hadoop learning resources, making this a great gateway for those interested in developing an advanced knowledge of Hadoop. Posts are searchable and organized by tags.

Keywords: Hadoop, tutorials

Audience: data scientists

9.     A DBA’s Journey Into The Cloud

This blog is the story of a database administrator, George Trujillo, as he learns to use Hadoop. One way to read it would be to start at the first post and discover the issues encountered by Trujillo as he works his way to a deeper understanding of Hadoop. Another would be to search according to keywords. Useful posts include “How to Learn Hadoop”:, and “Choosing MySQL or Oracle for your Hadoop Repositories”:

Keywords: Hadoop, database administration, MySQL, SQL, Oracle

Audience: intermediate-professional users of Hadoop

10.  Johns Hopkins/Coursera “Computing for Data Analysis”

Computing for Data Analysis focuses on the R programming language. The course explains how to program, read data, create graphs and display information, and perform statistical analyses, all using R. The course has all of the benefits of Coursera’s other MOOCs: free, useful readings, video lectures, a community of learners, ongoing assessment, and the opportunity to receive a certificate of proficiency.

Keywords: MOOC, R, big data, data visualization, data analysis

Audience: big data beginners

11.  Microsoft Virtual Academy “Getting Started with Microsoft Big Data”

This short course from Microsoft offers an introduction to big data. This is a module based course covering: MapReduce, Hive, and .NET. The 5 modules are each composed of around 20 minutes of video lectures with PowerPoint slides, a free e-book, as well as an option to test your knowledge with a self-assessment task at the end of each module.

Keywords: Microsoft, MapReduce, Hive

Audience: big data beginners interested in Microsoft products

12.  Orange

Orange is an open source data mining and visualization tool. It works with a highly intuitive GUI, capable of suggesting appropriate tools for a given dataset, and is supported by the capacity to program algorithms in Python. All basic statistical analysis tasks are supported. Additionally, the open-source license means that many specialized or discipline-specific data analysis tasks are supported by downloadable third-party widgets. The Orange website is also host to an active community of users, who maintain a helpful forum containing hundreds of topics.

Keywords: data mining, data visualization, big data, open-source, statistics

Audience: statisticians, economists, market researchers, sociologists

13.  WEKA Data Mining Software

Waikato Environment for Knowledge Analysis or (WEKA) is a data mining workbench developed by the Machine Learning Group at the University of Waikato, New Zealand. The website of the Machine Learning Group also contains a selection of resources related to data analysis and data mining. WEKA is better suited to mid-sized data sets, as the machine learning algorithms are extremely demanding and struggle to function in truly large datasets. There are, however, a number of downloadable packages for big datasets, programmed with basic, non-dataset specific tasks such as “map” and “reduce”. A pair of high quality MOOCs on “Data mining with WEKA”, run every few months by the University of Waikato supports the WEKA software.

Keywords: big data, machine learning, MOOC, tutorial

Audience: mid to large-sized dataset users

14.  Lucene

Lucene is a free big data-capable search algorithm by Apache. It is able to index up to 150GB per hour, and its RAM overhead is extremely light at less than 1MB. Lucene offers fielded searching, data-range searching, ranked searching, as well as multi-index searching. It offers phrase queries, wildcard queries, proximity queries, and range queries, among others. Lucene is capable of full-text indexing in database objects and standard documents types: TXT, PDF, HTML, DOC etc.

Keywords: big data search, Apache, ranked searching, multi-index searching, index

Audience: advanced – professional users of large databases

15.  DB4objects

Db4objects is an object database engine. Object databases store data in the form of objects without mapping the relationships between these objects. DB4objects is open-source and supported by a huge community of users. The program comes with an interactive tutorial, which will aid new users. One of Db4objects’ most important features is its support for Native Queries, in which users can query the database using a single line of Java code, without needing to switch between SQL or other string-based APIs.

Keywords: object database engine, storage, object-oriented database, native query, open-source

Audience: database users looking for object-oriented database, commercial database users looking for open-source database clients

16.  DZone Big Data

DZone’s Big Data page contains a library of user-submitted articles related to big data. The articles are most useful for their introductions to big data tools, tutorials addressing specific problems in big data, or their big data related news.  DZone hosts articles suitable for beginners, intermediate and advanced users. Topics can be searched by keyword or browsed chronologically, but the diversity of topics is not accommodated by the organization of the site.

Keywords: big data, tutorials, Hadoop, R, Map Reduce

Audience: beginner to advanced users of a variety of big data tools

17.  Presto

Presto is Facebook’s SQL Hadoop engine, which was only released as an open-source distribution in late 2013. It is a competitor product to other big data query engines, in particular the popular Hive framework. The advantage of Presto over Hive: Presto is ANSI-SQL compatible, making it easy to integrate with popular data toolkits. This is a resource for serious data users with terabytes to exabytes of data to manage. Presto has already been picked up by Airbnb and Dropbox, and seems to promise faster, cheaper data warehousing and querying than existing products.

Keywords: big data, Facebook, Hadoop, Hive, Airbnb, Dropbox, SQL, ANSI-SQL

Audience: professional big data users with server capacity

18.  NumPy

NumPy is a scientific computing package for Python. NumPy adds support for multi-dimensional arrays and matrices to Python. It also includes a huge library of mathematical functions.  Numpy is useful for Python programmers who need to compute complex mathematical functions, or need to work in machine learning.

Keywords: Python, machine learning, big data

Audience: Python users, machine-learning developers

19.  Blaze

Blaze is a tool specifically for Python developers working with big data, in particular those already working with NumPy. Blaze is an alternative to NumPy with a few improvements: Blaze is able to operate on out-of-core computations for large datasets that exceed the system memory. Blaze also supports common features needed by big data users, such as missing values and labeled arrays.

Keywords: Python, NumPy, big data

Audience: Python users, NumPy users

20.  Neo4j Spatial

Neo4j Spatial is the most popular open source, desktop graph database. It is popular for its reliability and the variety of drivers and libraries available. Neo4j Spatial is one of the most interesting of these libraries. It enables spatial operations on data, including the option to query data by a range of specified regions. Neo4j offers language drivers providing compatibility with basically all popular programming languages.

Keywords: graph database, geographic, GIS, cartographic

Audience: social scientists, big data users, geographers

21.  Kristoff Kovacs’ comparison of NoSQL databases

This resource is for those who have already decided a NoSQL database is suitable for them but are not sure which database to use. Kovacs compares the following NoSQL databases: Cassandra, Mongodb, CouchDB, Redis, Riak, Couchbase (ex-Membase), Hypertable, ElasticSearch, Accumulo, VoltDB, Kyoto Tycoon, Scalaris, Neo4j and HBase. The table compares programming language, type of license, and protocol for each of the NoSQL databases. Kovacs also summarizes the important features of each database, an archetypical project for which each database would be most suitable, and a list of generic tasks at which each database is superior.

Keywords: NoSQL databases, comparison, big data

Audience: beginner NoSQL database users

22.  Talend Open Studio

Talend Open Studio offers a suite of big data products to facilitate data integration, data management, and data quality. The products include Talend Big Data, Talend Data Integration, Master Data Management, and Data Quality. Talend Open Studio is open source and Apache license. The company offers a premium package called Talend Enterprise, with additional features. These products are designed for ease of use, and are operated using a GUI, without the need to input code. This can make the nature of the operations being performed somewhat obtuse.

Keywords: NoSQL, big data, data quality

Audience: corporate users of big data

23.  BigML

BigML is a user-friendly machine-learning tool. Given a large enough data input, either unstructured in source form, or in a structured dataset, BigML works to unpack the relationships between variables to produce a predictive model. Having achieved this, users can request a prediction using the GUI according to a number of input fields, or use BigML to generate automatic predictions.  BigML is free for small to medium size data sets, though more intensive users will need to sign up for a subscription plan.

Keywords: GUI, user-friendly, machine learning

Audience: big data beginners

24.  Statwing

Statwing is a more traditional statistics program. It operates in browser and is a sort of user-friendly, partially free version of popular statistical packages such as STATA or SPSS. Users upload data and select the variables for their analysis. Statwing will provide a range of descriptive and analytical statistics for these variables, including a plain English description of the relationships between them. In addition, the output provides a number of graphs describing the relationships between the variables. Statwing’s free service supports datasets up to 25MB, which will be publicly visible to all users, making it unsuitable for proprietary data.

Keywords: statistics, user-friendly

Audience: students, independent researchers

25.  Apache Mahout

Mahout is an Apache product to enable machine-learning tasks. It allows applications to analyse large sets of data. Mahout takes advantage of Hadoop’s power to solve complex machine learning problems by breaking these up into parallel tasks. Mahout offers three main machine-learning functions: recommendation, classification and clustering.

Keywords: Apache, Hadoop, machine learning

Audience: data scientists, corporations

This entry was posted in Careers And Work, Data Science, Education, Information Technology, Resource-a-rama and tagged , , . Bookmark the permalink.

2 Responses to Big Data Analysis Tools/Resources: An Annotated Bibliography

  1. steward says:

    Hadoop Online Training is the truth that it educates a person in regards to the broad range of features that are attached to the big data.

    Hadoop Online Training It assesses the insights of the information which ensure the coverage and dashboard is handled effectively.

    Hadoop Online Training

    Throughout the early years of Hadoop online traning code writers and data analysts required to deal with big info came with elaborate degrees in higher education and years of expertise as well as instruction.

  2. Pingback: Top 5 Cloud Related Skills | Basics of Java and Cloud Computing

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s