Big Data Analysis Tools/Resources
1. Chris Stucchio’s blog
This blog makes this list primarily for one article: “Don’t use Hadoop – your data isn’t that big”, which provides a guide for deciding whether your data really qualifies as big data, based on its size, purpose, and expected future growth. And, given this decision, which suite of tools is best suited to your needs. The remainder of the blog posts deal with developer level problems in big data, as well as an assortment of economics-, and other news-related items.
Keywords: big data, medium data, kind of large data
Audience: dataset owners considering big data tools, big data beginners
Infochimps provides extensive resources for those interested in big data analysis. These include, whitepapers: detailed overviews of topics in big data analysis; case studies: other companies’ successful use of (Infochimps’) big data products; as well as video webinars about topics in big data. Some of these represent not-even-thinly veiled promotional pitches for Infochimps’ products and services; however, others will provide useful information for advanced big-data users.
Keywords: big data
Audience: intermediate- advanced
The Apache Hadoop software library is a free, open-source, Java-based programming framework that allows for the processing of huge datasets across clusters of computers. Hadoop is designed to scale easily from a single server to thousands of machines, and its distributed structure gives it a high degree of fault tolerance.
Keywords: Hadoop, big data, scalability
Audience: professional or academic big data users with server capacity
4. Big Data University
Big Data University is an online education portal run by a collective of big data users and enthusiasts, academics, data management companies, and professionals. It has an impressive range of free and fee-based courses, focused on big data topics. Among these offerings is a course on Hadoop fundamentals. The courses are mostly around 4-6 hours long, and include a test of competence at the end.
Keywords: Hadoop, MapReduce, Pig, big data, open courseware
Audience: big data users of all levels and occupational types
5. The Cloud Avenue
The Cloud Avenue is the blog of Praveen Sripati, a big data specialist. The posts focus around introductory topics in big data, in particular Hadoop. Sripati sometimes answers reader submitted questions, and provides guides for big data beginners to make a low-budget or no-budget start in big data using cloud based services. A good place to start is to search for the tag “getting started”: http://www.thecloudavenue.com/search/label/gettingstarted
Keywords: Hadoop, big data, blog, how-to, guide, Flume, Hive
Audience: big data beginners, Hadoop beginners
Udacity has an impressive range of course offerings related to big data. All have a cost attached, but it is possible to access the course materials (meaning the course experience minus the personal coaching, and certification) free of charge. The courses are shorter than MOOCs, with units composed of roughly 20 1-2 minute video lectures. The material is well presented, using a live whiteboard feed, and integration with the course materials is effective. Udacity is a good option for someone who doesn’t have the time to invest in a full length MOOC.
Keywords: online courseware, video lectures, big data
Audience: big data beginners
Cloud Era is a for-profit data company offering enterprise level data storage services. Cloud Era offers a range of paid data workshops and training courses, but the site also hosts a range of free materials. Finding these requires a bit of searching through paid and promotional materials; there are, however, some treasures to be found, such as this one-hour webinar on implementing big data insights in their own enterprise/industry level datasets.
Keywords: big data, Hadoop, Cloud Era, webinar
Audience: advanced – professional big data users
8. Hadoop Wizard
Hadoop Wizard is a compendium of Hadoop related materials, including tutorials, blog posts with advice about Hadoop topics, a list of Hadoop experts, and a list of online and in-person Hadoop training events. Many of these materials point to other Hadoop learning resources, making this a great gateway for those interested in developing an advanced knowledge of Hadoop. Posts are searchable and organized by tags.
Keywords: Hadoop, tutorials
Audience: data scientists
9. A DBA’s Journey Into The Cloud
This blog is the story of a database administrator, George Trujillo, as he learns to use Hadoop. One way to read it would be to start at the first post and discover the issues encountered by Trujillo as he works his way to a deeper understanding of Hadoop. Another would be to search according to keywords. Useful posts include “How to Learn Hadoop”: http://cloud-dba-journey.blogspot.mx/2013/02/how-to-learn-hadoop.html, and “Choosing MySQL or Oracle for your Hadoop Repositories”: http://cloud-dba-journey.blogspot.mx/2014/01/choosing-mysql-or-oracle-for-your.html.
Keywords: Hadoop, database administration, MySQL, SQL, Oracle
Audience: intermediate-professional users of Hadoop
10. Johns Hopkins/Coursera “Computing for Data Analysis”
Computing for Data Analysis focuses on the R programming language. The course explains how to program, read data, create graphs and display information, and perform statistical analyses, all using R. The course has all of the benefits of Coursera’s other MOOCs: free, useful readings, video lectures, a community of learners, ongoing assessment, and the opportunity to receive a certificate of proficiency.
Keywords: MOOC, R, big data, data visualization, data analysis
Audience: big data beginners
11. Microsoft Virtual Academy “Getting Started with Microsoft Big Data”
This short course from Microsoft offers an introduction to big data. This is a module based course covering: MapReduce, Hive, and .NET. The 5 modules are each composed of around 20 minutes of video lectures with PowerPoint slides, a free e-book, as well as an option to test your knowledge with a self-assessment task at the end of each module.
Keywords: Microsoft, MapReduce, Hive
Audience: big data beginners interested in Microsoft products
Orange is an open source data mining and visualization tool. It works with a highly intuitive GUI, capable of suggesting appropriate tools for a given dataset, and is supported by the capacity to program algorithms in Python. All basic statistical analysis tasks are supported. Additionally, the open-source license means that many specialized or discipline-specific data analysis tasks are supported by downloadable third-party widgets. The Orange website is also host to an active community of users, who maintain a helpful forum containing hundreds of topics.
Keywords: data mining, data visualization, big data, open-source, statistics
Audience: statisticians, economists, market researchers, sociologists
13. WEKA Data Mining Software
Waikato Environment for Knowledge Analysis or (WEKA) is a data mining workbench developed by the Machine Learning Group at the University of Waikato, New Zealand. The website of the Machine Learning Group also contains a selection of resources related to data analysis and data mining. WEKA is better suited to mid-sized data sets, as the machine learning algorithms are extremely demanding and struggle to function in truly large datasets. There are, however, a number of downloadable packages for big datasets, programmed with basic, non-dataset specific tasks such as “map” and “reduce”. A pair of high quality MOOCs on “Data mining with WEKA”, run every few months by the University of Waikato supports the WEKA software.
Keywords: big data, machine learning, MOOC, tutorial
Audience: mid to large-sized dataset users
Lucene is a free big data-capable search algorithm by Apache. It is able to index up to 150GB per hour, and its RAM overhead is extremely light at less than 1MB. Lucene offers fielded searching, data-range searching, ranked searching, as well as multi-index searching. It offers phrase queries, wildcard queries, proximity queries, and range queries, among others. Lucene is capable of full-text indexing in database objects and standard documents types: TXT, PDF, HTML, DOC etc.
Keywords: big data search, Apache, ranked searching, multi-index searching, index
Audience: advanced – professional users of large databases
Db4objects is an object database engine. Object databases store data in the form of objects without mapping the relationships between these objects. DB4objects is open-source and supported by a huge community of users. The program comes with an interactive tutorial, which will aid new users. One of Db4objects’ most important features is its support for Native Queries, in which users can query the database using a single line of Java code, without needing to switch between SQL or other string-based APIs.
Keywords: object database engine, storage, object-oriented database, native query, open-source
Audience: database users looking for object-oriented database, commercial database users looking for open-source database clients
16. DZone Big Data
DZone’s Big Data page contains a library of user-submitted articles related to big data. The articles are most useful for their introductions to big data tools, tutorials addressing specific problems in big data, or their big data related news. DZone hosts articles suitable for beginners, intermediate and advanced users. Topics can be searched by keyword or browsed chronologically, but the diversity of topics is not accommodated by the organization of the site.
Keywords: big data, tutorials, Hadoop, R, Map Reduce
Audience: beginner to advanced users of a variety of big data tools
Presto is Facebook’s SQL Hadoop engine, which was only released as an open-source distribution in late 2013. It is a competitor product to other big data query engines, in particular the popular Hive framework. The advantage of Presto over Hive: Presto is ANSI-SQL compatible, making it easy to integrate with popular data toolkits. This is a resource for serious data users with terabytes to exabytes of data to manage. Presto has already been picked up by Airbnb and Dropbox, and seems to promise faster, cheaper data warehousing and querying than existing products.
Keywords: big data, Facebook, Hadoop, Hive, Airbnb, Dropbox, SQL, ANSI-SQL
Audience: professional big data users with server capacity
NumPy is a scientific computing package for Python. NumPy adds support for multi-dimensional arrays and matrices to Python. It also includes a huge library of mathematical functions. Numpy is useful for Python programmers who need to compute complex mathematical functions, or need to work in machine learning.
Keywords: Python, machine learning, big data
Audience: Python users, machine-learning developers
Blaze is a tool specifically for Python developers working with big data, in particular those already working with NumPy. Blaze is an alternative to NumPy with a few improvements: Blaze is able to operate on out-of-core computations for large datasets that exceed the system memory. Blaze also supports common features needed by big data users, such as missing values and labeled arrays.
Keywords: Python, NumPy, big data
Audience: Python users, NumPy users
20. Neo4j Spatial
Neo4j Spatial is the most popular open source, desktop graph database. It is popular for its reliability and the variety of drivers and libraries available. Neo4j Spatial is one of the most interesting of these libraries. It enables spatial operations on data, including the option to query data by a range of specified regions. Neo4j offers language drivers providing compatibility with basically all popular programming languages.
Keywords: graph database, geographic, GIS, cartographic
Audience: social scientists, big data users, geographers
21. Kristoff Kovacs’ comparison of NoSQL databases
This resource is for those who have already decided a NoSQL database is suitable for them but are not sure which database to use. Kovacs compares the following NoSQL databases: Cassandra, Mongodb, CouchDB, Redis, Riak, Couchbase (ex-Membase), Hypertable, ElasticSearch, Accumulo, VoltDB, Kyoto Tycoon, Scalaris, Neo4j and HBase. The table compares programming language, type of license, and protocol for each of the NoSQL databases. Kovacs also summarizes the important features of each database, an archetypical project for which each database would be most suitable, and a list of generic tasks at which each database is superior.
Keywords: NoSQL databases, comparison, big data
Audience: beginner NoSQL database users
22. Talend Open Studio
Talend Open Studio offers a suite of big data products to facilitate data integration, data management, and data quality. The products include Talend Big Data, Talend Data Integration, Master Data Management, and Data Quality. Talend Open Studio is open source and Apache license. The company offers a premium package called Talend Enterprise, with additional features. These products are designed for ease of use, and are operated using a GUI, without the need to input code. This can make the nature of the operations being performed somewhat obtuse.
Keywords: NoSQL, big data, data quality
Audience: corporate users of big data
BigML is a user-friendly machine-learning tool. Given a large enough data input, either unstructured in source form, or in a structured dataset, BigML works to unpack the relationships between variables to produce a predictive model. Having achieved this, users can request a prediction using the GUI according to a number of input fields, or use BigML to generate automatic predictions. BigML is free for small to medium size data sets, though more intensive users will need to sign up for a subscription plan.
Keywords: GUI, user-friendly, machine learning
Audience: big data beginners
Statwing is a more traditional statistics program. It operates in browser and is a sort of user-friendly, partially free version of popular statistical packages such as STATA or SPSS. Users upload data and select the variables for their analysis. Statwing will provide a range of descriptive and analytical statistics for these variables, including a plain English description of the relationships between them. In addition, the output provides a number of graphs describing the relationships between the variables. Statwing’s free service supports datasets up to 25MB, which will be publicly visible to all users, making it unsuitable for proprietary data.
Keywords: statistics, user-friendly
Audience: students, independent researchers
25. Apache Mahout
Mahout is an Apache product to enable machine-learning tasks. It allows applications to analyse large sets of data. Mahout takes advantage of Hadoop’s power to solve complex machine learning problems by breaking these up into parallel tasks. Mahout offers three main machine-learning functions: recommendation, classification and clustering.
Keywords: Apache, Hadoop, machine learning
Audience: data scientists, corporations