Big Data Analysis Tools/Resources: An Annotated Bibliography

1.     Chris Stucchio’s blog

This blog makes this list primarily for one article: “Don’t use Hadoop – your data isn’t that big”, which provides a guide for deciding whether your data really qualifies as big data, based on its size, purpose, and expected future growth. And, given this decision, which suite of tools is best suited to your needs. The remainder of the blog posts deal with developer level problems in big data, as well as an assortment of economics-, and other news-related items.

Keywords: big data, medium data, kind of large data

Audience: dataset owners considering big data tools, big data beginners

2.     Infochimps

Infochimps provides extensive resources for those interested in big data analysis. These include, whitepapers: detailed overviews of topics in big data analysis; case studies: other companies’ successful use of (Infochimps’) big data products; as well as video webinars about topics in big data.  Some of these represent not-even-thinly veiled promotional pitches for Infochimps’ products and services; however, others will provide useful information for advanced big-data users.

Keywords: big data

Audience: intermediate- advanced

SQL Programming Resources: An Annotated Bibliography

1.     Coursera/Stanford “Introduction to Databases”

One of the biggest names in Massive Open Online Courseware (MOOC), Coursera has a huge database of university level courses. This offering from Stanford University has 9 units devoted to SQL programming, in addition to providing a comprehensive introduction to databases. It was one of the first courses offered by Coursera and remains one its most popular. The SQL units are introductory, making this an excellent place to begin.

The course includes all of the best features of a Coursera MOOC: helpful readings, a community of learners, ongoing assessment and the option to receive a certification upon successful completion of the course.

Keywords: MOOC, Stanford, Coursera, video lecture, exam, assessment

Audience: beginner SQL users

2.     The SchemaVerse

SchemaVerse offers a gamified SQL learning experience. The experience is built around an online galactic conquest game implemented within a PostgreSQL database, in which the player controls his army of spaceships through SQL commands. The game consists in sending ships to conquer new planets, mine resources, and defend conquered planets against other players.

The game/tutorial provides a good introduction to the basics of SQL, supported by an excellent tutorial page, for those who find entirely self-directed learning a challenge. It would be difficult to extend much beyond an intermediate level with SchemaVerse, as the available commands are limited and somewhat repetitive.

Keywords: gamification, SQL, tutorial, basic, intermediate

Audience: beginner SQL users, non-autodidacts

‘Big Data’ Public Databases: An Annotated Bibliography

1.     Kin Lane’s Federal Dataset Tool

Many of the following listings refer to US Federal Government datasets. These are some of the biggest public datasets available. Unfortunately, much of this data is messy, published without much regard to its consumption and use.

This project, from Kin Lane, is both an index of a huge number of Federal Government datasets, and a way to access versions of these datasets on GitHub as these are cleaned and repurposed by other users – including descriptions of the alterations made. This tool requires a GitHub account.

Keywords: big data, public data, Open Data Policy-Managing Information, GitHub

Audience: social scientists, economists, advocacy groups

2.     IPUMS project (Minnesota Population Centre)

The Integrated Public Use Microdata Series (IPUMS) is an enormous database of individual level microdata. The data in IPUMS USA is drawn from the United States census, the American Community Survey and the Current Population Survey. IPUMS International includes census data from 73 countries, harmonized to allow comparisons across different times and places.

Combined, these datasets comprise a truly massive resource of census-type information, carefully prepared for ready comparison and interrogation. The website has a built-in SQL selector, a useful FAQ covering the basic questions about the data and its use, and a forum covering more technical questions.

Keywords: IPUMS, Minnesota Population Centre, big data, microdata, census

Audience: population survey researchers, marketers, social scientists, political scientists

Database Training Resources: An Annotated Bibliography

1.     Coursera/Stanford “Introduction to Databases”

Introduction to Databases from Stanford University was one of the first Massive Open Online Courses (MOOCs) offered by Coursera in 2011, and has remained consistently popular. The course covers database design and the use of database management systems for applications. The course begins with the fundamental theory of database design, including the relational model and SQL. It moves on to cover contemporary issues in database management including JSON and NoSQL systems.  The course uses PostgreSQL, SQLite, and MySQL. The course is comprised of video lectures, assignments, and exams. Discussion forums and the possibility of local meet-ups support learning.

Keywords: MOOCs, Stanford, database, SQL, NoSQL, PostgreSQL, SQLite, and MySQL

Audience: beginner – intermediate database users

2.     Coursera/University of Washington “Introduction to Data Science”

This MOOC is called “Introduction to Data Science” but the first of its two major units is devoted to databases. This includes an introduction to MapReduce, and Hadoop, as well as an SQL programming assignment. The course is comprised of the same materials as “Introduction to Databases” above, and has the same support system. As with “Introduction to Databases”, the added benefit of completing “Introduction to Data Science”, over and above the learning, is the opportunity to earn a formal certificate of recognition of the knowledge acquired, which may be useful to those hoping to apply their knowledge of databases professionally.

Keywords: MOOC, University of Washington, video lectures, assignments, exams, peer-support

Audience: beginner-intermediate database users

Data Visualization Resources: An Annotated Bibliography

1.     The St Louis Federal Reserve

The St Louis Federal Reserve Economic Data series is perhaps the most comprehensive repository of time-series data. It also offers an in-browser, cross-platform, data visualization tool. The time series are collected from a huge variety of US government sources, as well as a number of international organizations such as the OECD and World Bank. The FRED tools, including charts, graphs and maps, are extremely simple and user-friendly. They lack the flashy design of other data visualization kits, but preserve a consistent and legible style across all platforms.

Keywords: Bureau of Labor Statistics

Audience: economists, political scientists, advocacy groups, journalists

2.     Google Charts

Google Charts is a simple browser-based data visualization utility, specifically designed for web functions, including data sourcing and display. Website display of the charts is extremely durable, making Google Charts a good choice for projects where browser-compatibility is a priority.  Google Charts is tightly integrated with Google Spreadsheets, including dynamic updating of the chart when the source data changes. The statistical processing available in Charts is basic, making Charts a poor choice for complex analyses. The styling of the charts is basic, in keeping with Google’s minimalist aesthetic, and not CSS customizable.

Keywords: Google, html

Audience: students, teachers, advocacy groups

Game Design and Choices of Creation

Concerning my self-imposed goals to write and produce games: Human decision-making is largely fueled by seeking out novel/familiar stimuli, as well as avoiding previous pain points and repeating pleasurable experiences. We will need to keep this paradigm in mind: People will avoid previously painful experiences, repeat pleasurable ones, and are dually pushed and pulled by existing behavior patterns and the opportunities to have new life experiences. Maybe my decision-making and goal-setting is just fueled by the thought that I haven’t had painful experiences, or perhaps my threshold for cerebral adventure is higher than most people.

Writing is a private event made public: when you write, you’re putting your words into context for an audience.  Game-creation is also a private choice that you can make public, very much in the same manner as a writer publishes a book. I’ve got one game set fairly finished, and another game is in production and prep for the first prototype printing. These started as ideas, and had to be fostered into reality, brought forth one conceptualized structure at a time until the framework was present – only then the cards could be developed and put into print. From there, many other iterations and changes have to take place before a finished product can be sold.

As it is, I will probably be setting my first game into a IndieGogo or Kickstarter in order to raise funds (kind of like a pre-release of the game, but without a corporate sponsor). Afterwards, I’ll have more time to work on the second game in development. When you create something of lasting value, it is like an errant child – sometimes it circles back and you realize what you could have done differently. That’s part of the sacrifice of releasing your creations into the world – once you’ve let go, stop grasping.

Jaro-Winkler in ORACLE and textual fuzzy matching

There is a little-known (and hence heavily under-utilized) function in Oracle 11g and up. This is the Jaro-Winkler algorithm (and the companion algorithm named Edit Distance). The Jaro-Winkler algorithm tells you what level of permutation would be necessary in order to transform ‘String A’ into ‘String B’.

You can find the official Oracle documentation here. I implemented it using the BUILT_IN Oracle function UTL_MATCH, which is used with SQL code similar to:


A vitally important feature of the Text Similarity function is that it allows you to measure difference with both normalized (0-1) and scalar (0-100) measures. By close examination, you can see the levels of difference involved with different string permutations. I used it to match diagnoses from Medicare CMS data to our internal data, but the function is versatile and not confined to any specific application (any text will work).

Note: strings starting with ‘0’ cannot compare to strings not starting with ‘0’ within the Jaro-Winkler function, but can compare with Edit Distance. This was an intuitive find that I spotted, but that isn’t defined in the literature anywhere.

Example: DX: ‘0100’ compared to  DX: ‘100’ will return about a 95 with Edit Distance, but a 0 with Jaro-Winkler.

