Big Public Databases
1. Kin Lane’s Federal Dataset Tool
Many of the following listings refer to US Federal Government datasets. These are some of the biggest public datasets available. Unfortunately, much of this data is messy, published without much regard to its consumption and use.
This project, from Kin Lane, is both an index of a huge number of Federal Government datasets, and a way to access versions of these datasets on GitHub as these are cleaned and repurposed by other users – including descriptions of the alterations made. This tool requires a GitHub account.
Keywords: big data, public data, Open Data Policy-Managing Information, GitHub
Audience: social scientists, economists, advocacy groups
2. IPUMS project (Minnesota Population Centre)
The Integrated Public Use Microdata Series (IPUMS) is an enormous database of individual level microdata. The data in IPUMS USA is drawn from the United States census, the American Community Survey and the Current Population Survey. IPUMS International includes census data from 73 countries, harmonized to allow comparisons across different times and places.
Combined, these datasets comprise a truly massive resource of census-type information, carefully prepared for ready comparison and interrogation. The website has a built-in SQL selector, a useful FAQ covering the basic questions about the data and its use, and a forum covering more technical questions.
Keywords: IPUMS, Minnesota Population Centre, big data, microdata, census
Audience: population survey researchers, marketers, social scientists, political scientists
3. The American Economic Association
The American Economic Association (AEA) has a list of links to publicly available, big datasets to assist their members with research projects. These datasets are mostly from US government sources, including the Federal Reserve banks, the Bureau of Labor Statistics, Medicare and Medicaid, the Agriculture Economic Research Service and many more. Several major longitudinal datasets are available, including the Panel Study on Income Dynamics, the National Longitudinal Survey of Young Men and Older Men etc. US government sources generally support data queries in a wide range of formats.
The list also includes links to a huge variety of non-US sources, including datasets provided by a huge variety of national governments, policy organizations, private consultancies, financial market associations, and international organizations such as the World Bank and the UN.
Keywords: American Economic Association, big data, government data, development indicators
Audience: economists, social scientists, political scientists
4. “Airline on-time performance” dataset (American Statistical Association)
The Airline on-time performance dataset was constructed for the 2009 American Statistical Association challenge (to interrogate the causes of delayed flights). The files contain data on US Domestic Flights between 1987 and 2008. The data comprises 123 million observations, across 29 variables. Each dataset is described in detail. Due to its depth and size, this dataset is useful for machine learning exercises, and practicing statistical analyses, or indeed for discovering the causes of airline delays, if that happens to be your driving question.
Keywords: airline, weather, American Statistical Association, machine learning, big data
Audience: statisticians, data scientists, machine-learning specialists
5. World Bank datasets
The World Bank data portal contains datasets related to agriculture and rural development, debt, trade, financial information, health, infrastructure etc. for 214 countries (though with some information missing). These datasets are broken down by theme, e.g. income share of lowest 20%, Agricultural land (% total land area). The datasets are available in EXCEL, XML, and CSV formats.
The portal also contains a substantial library of microdata sets, likely to be of great interest to data scientists interested in research questions related to human development and sociology.
Keywords: World Bank, United Nations, microdata, development, sociology
Audience: economists, development economists, sociologists, advocacy groups, NGOs
6. Amazon Web Services
Amazon Web Services contains 55 large public datasets, including the Common Corpus Crawl, a database of web traffic composed of over 5 billion web pages, the 1000 Genomes Project, containing the genome data of 2600 people, and the Wikipedia Traffic Statistics for 16 months. Some of these datasets are tied to Amazon’s EC2 or Elastic Map Reduce services, both of which are pay-for-use. EC2 has a generous free trial option that should be sufficient for a small number of projects per month.
Keywords: machine learning, public datasets, big data
Audience: data scientists, machine-learning specialists
Infochimps offers an extensive data marketplace. Only some of the datasets featured are free, and it is not possible to filter search results to see only free datasets. Some of the datasets are hosted offsite, but many are hosted by Infochimps and can be downloaded straight from the site. The datasets are hosted in a variety of formats, depending on the original supplier.
Keywords: big data, public datasets
Audience: researchers, advocacy groups, academics, statisticians, corporate, there’s something here for everyone
DataMarket is an Icelandic-based data startup offering a library of approximately 45 thousand data sets. The search and index function on the website is the most clear and intuitive of all the big data libraries featured in this list, with datasets grouped by country of origin, topic area, and data provider. The other great feature of DataMarket is that all its datasets are natively hosted, meaning they can be downloaded in XLS, CSV, or TSV formats.
Keywords: big data, public, XLS, CSV, TSV
Audience: researchers, advocacy groups, academics, statisticians, corporate
Knoema is a data library comprising over 100 million time series that offers the user-friendliest experience of any of the data warehouses featured here. It has a huge, highly intuitive library of datasets, grouped by topic, source and region of origin. Knoema’s best feature is its data browser/visualizer, which allows you to filter and select data, or to visualize data using a compressive range of charts.
Knoema is also designed for sharing. It is possible to post data visualizations to your profile, and then share these with other Knoema users or on the web. Supported data export formats are XSLSV, CSV, and SDMX.
Keywords: big data, public, visualization, XSLSV, CSV, SDMX
Audiences: researchers, advocacy groups, academics, statisticians, corporate
10. Google Public Data
Google Public Data hosts a library of 128 public data sets. These can be explored in tabular and visual form online, as well as exported in
The best feature of Google Public Data is the ability to upload and explore your own dataset using the visualization engine. Google Public Data does not currently support data downloads, though these can be found by following the link to the original data provider.
Keywords: public data, big data
Audience: students, teachers, economists, social scientists
Quandl is a 100% open library of big data sets, with a particular focus on financial market information. The remaining data sets are mostly census-type data from a limited range of countries. The site is completely minimalist, leaving nothing between you and your data. Data can be browsed using a search bar, or through a limited range of subjects (a notable weakness).
Quandl incorporates an elegant data visualizer; perhaps the best of any of the services featured here, which combines chart and table views of your data. Quandl also supports a truly impressive range of data formats: XLSV, CSV, JSON, XML, and R.
Keywords: big data, library, searchable, XLSV, CSV, JSON, XML, R
Audience: economists, financial industry workers, advocacy groups, statisticians
Figshare is an open data library specifically for researchers and academics. The site offers 1GB of private data storage, unlimited public data storage, and the opportunity to browse all other public data, completely free. Premium plans increase data storage space and allow greater collaboration on a single account.
The data found on Figshare is research data, the datasets used by academics for research projects and studies. This gives it a highly varied quality. In keeping with this, the data browser is organized by academic subject, and the datasets are often named for the research papers to which they contributed. The formats available depend on the originating researchers.
Keywords: collaboration, genetics, medical, biology, big data, research, fact checking
Audience: academics, students, medical researchers
Datahub is a data management platform provided by the Open Knowledge Foundation. The indexing system used by Datahub is its most distinctive feature, with the option to filter according to data format and license. This will be of particular interest to those searching for less common data formats.
The datasets have a strong open data flavor to them, thanks to the Open Knowledge Foundation’s profile. They mainly contain demographic information, though there are datasets from several NGOs and civil society organizations.
Keywords: open data, big data, census data, CSV, XML, API/SPARQI, and ASPX
Audience: advocacy groups, academics, students, social scientists
14. Open Science Data Cloud
The Open Science Data Cloud contains some truly big data, with several datasets in the terabyte range. The datasets are organized by keywords, which mostly refer to academic disciplines, and cover topics from demographics, to biology, to geology, to linguistics.
Due to the huge file sizes involved in many of the Open Science Data Cloud datasets, it is necessary to download UDT, a high-speed data transfer protocol.
Keywords: big data, UDT, UDR, UDP
Audience: data scientists, machine-learning specialists
Datamob is a New York-based startup aiming to harness the disruptive potential of public data. The datasets are not actually hosted on the Datamob site; the site itself only links to the original content providers. The data browsing system is visually attractive, although the fact datasets are only indexed alphabetically is a major limitation. The data on Datamob is mostly related to social-causes and activism.
For those interested in communicating through data, Datamob also hosts apps that present data drawn from Datamob sources, giving these and immediate platform.
Keywords: social causes, public data, activism
Audience: NGOs, students, activists, public relations professionals
16. Open Data (Socrata)
Open Data is a project of Socrata, an open data organization focused on increasing government transparency. The data hosted has a notable civil society/governance theme.
The Open Data browser leaves a little to be desired. Keywords are presented in the form of a word-cloud, and categories include Business, Education, Fun, Government and Personal.
Keywords: civil society, activism, open data, governance
Audiences: NGOs, civil society organizations, students
Thinknum is a library of financial data, aimed at helping finance professionals build and test models. The site groups together a diverse array of financial data under a single user interface.
Data is grouped according to company name. Analysis is probably best achieved on Thinknum’s own website, but it is also possible to download. Data is retrieved through an HTTP call system, explained on the Thinknum website, that takes a little getting used to.
Keywords: financial data, open data
Audience: financial industry workers, economists
18. Wikipedia Open Data
While this page neither hosts datasets, nor indexes any datasets, it is an invaluable resource for anyone searching for big open public datasets for big data research. The page contains an introduction to the open data movement, as well as a list of links to the open data resources of 38 governments.
In addition, the subheading “Organizations Promoting Open Data” lists many institutions that maintain their own open data directories.
Keywords: Wikipedia, Wikimedia, big data, social data, pubic data, open data
Audience: students, teachers, economists, social scientists
19. Open Data Catalogs
DataCatalogs is a registry of open data catalogs from around the world, curated by data experts. The catalog includes open data organizations ranging from local, to state, to national governments, in addition to other open data organizations.
The major weakness of the Open Data Catalogs is its lack of an index, except for a few unhelpful tags. This means if you don’t know exactly what you’re looking for, you will have to wade through a lot of local government listings until you find a treasure.
Keywords: registry, open data, government data
Audience: data scientists, machine learning specialists, urban planners, economists
20. Commonwealth Scientific and Industrial Research Organization (CSIRO)
Australia’s CSIRO offers a huge open database of research data from its thousands of research projects. These are organized exhaustively by theme (many themes containing only one result) and concerned with applied scientific research subjects. The datasets are well described, including all the details of the original research project from which they are drawn, and available data formats depend on the format of the original research data.
Keywords: science, open data, big data
Audience: biologists, chemists, physicists, geologists
21. Stanford Large Network Dataset Collection
The Stanford Large Network Dataset Collection is a treasure trove for big data researchers interested in network data. Networks are organized according to type, nodes and edges. The datasets are mainly sources from online networks: social networks, online communities, online reviews etc. Though there are also datasets for road networks and scientific collaboration networks.
The files are enormous, and the website does not provide any capacity to analyze them online. These will therefore most likely be suitable for people working in a university or professional capacity.
Keywords: networks, social networks, nodes, edges, graphs
Audience: data scientists, academics, students
22. GRANIT (State of New Hampshire)
New Hampshire’s GRANIT is just one of many US state clearinghouses for GIS data. Other databases can be accessed by searching GIS clearinghouse and looking for the various US state names among the results. Many are also listed by Rice University’s library at the previous link.
Together these clearinghouses represent a truly enormous repository of GIS data, which can be used for a variety of GIS projects.
Keywords: GIS data, public data, big data
Audience: geographers, social scientists, conservationists, advocacy groups
23. University College Dublin Dynamics Lab
The Social Network Analysis Interactive Dataset Library from the University of Dublin contains links to data from more than two hundred networks, which will be of interest to social scientists and other social network researchers.
The data can be explored through an interactive visualizer, which displays data according to data of publication, data type and number of datasets. There is also a reference table, which seems to be more reliable.
The datasets are hosted offsite and data formatting is dependent on the host.
Keywords: network, public data, visualizer
Audience: social scientists, sociologists
24. NASA Shuttle Radar Topography Mission
NASA’s Shuttle Radar Topography Mission (SRTM) contains high-resolution digital topographic information about virtually the entire surface of the planet. Areas of missing data include very high peaks and very deep valleys, thought these represent less than 0.2% of total data.
This data is a great back-end for all sorts of cartographic projects. It is the dataset of choice for Digital Elevation Model data. One example is the following OpenCycle map of world cycle paths.
Keywords: GIS, geographic data, NASA, OpenMaps, topography
Audience: geographers, GIS specialists, cartographers
25. Google Ngram
Google’s Ngrams dataset contains data on text from millions of books scanned by Google. It contains information on letter combinations, words and phrases. This can be used to track the presence or popularity of words or phrases over time, or geographical variations in language usage.
The Ngram viewer is optimized for phrase-based querying, but the entire dataset is available for download.
Keywords: Google, big data, linguistic data
Audience: linguists, journalists, historians