Open Data technologies repository

CKAN (Comprehensive Knowledge Archive Network) Dataset cataloguing solution that aims to make data accessible by providing tools to publish, share, find and use datasets. Developer/maintainer: Open Knowledge Foundation . Used by:  Many governments and organisations. The most visible ones being:,,,; the Open Data portals of the cities Amsterdam, Copenhagen, Berlin. Functionalities:  Functionalities for publishing are, the ability to:
  • enter metadata on datasets via a website form, API or bulk spreadsheet import
  • harvest datasets from other portals: CSW servers, other CKAN instances, ArcGis, …
  • publish these metadata publicly or privately to authorized organisations
  • optionally store the data themselves in a data store
  • theming the look and feel of the portal to represent the own identity
  • extend the portal with additional features. There are more than 60 extensions available.
Functionalities for reusing are, the ability to:
  • search and discover datasets. The system offers search on metadata, full text search, fuzzy matching, faceted search. Also geospatial search and discovery is available.
  • broadcast data to social media. There is an integration with Twitter, Facebook and Google+
  • get updates on dataset changes using a RSS/Atom feed
  • visualize the data managed in the data store as a table, graphic, map and/or image depending on the nature and content of the data
  • investigate the history of edits and versions of a dataset
  • exploit the metadata and data via API’s
Technical constraints: The software can be easily installed on 64 bits Ubuntu. Also, it can be built from source for other Unix flavours. There are dependencies on Python, Java, PostgreSQL, Solr and other technologies. License: Open Source e.g. Affero GNU GPL v3.0. Contact Website: Download:
OpenDataSoft Platform that has been specifically designed for non-technical business users to share, publish and reuse structured data. It is more than a data catalog solution, because the platform also manages the data themselves leading to additional functionalities such as showing the data as tables, maps and graphics, to convert them to different output formats and offering api’s for app developers. Developer/maintainer: OpenDataSoft . Used by:  National, regional and local administrations. Companies and organisations in the domains of transport, energy & environment, agriculture, chemical, tourism, media. Functionalities: Functionalities for publishing are, the ability to:
  • load the data in the data store. During the load processing the data can be pre-processed, optimised, enriched and configured for optimal visualization,
  • enter metadata on the dataset via a website form and API,
  • harvest data from external API’s,
  • publish the data in other formats than uploaded,
  • publish the data publicly or privately according to access control rules based on users, groups and roles management,
  • monitor the use of the datasets.
Functionalities for reusing are, the ability to:
  • search and discover datasets. The system offers search on metadata, full text search, fuzzy matching, faceted search. Also search based on geographical coordinates,
  • filter data within the dataset,
  • visualize data as tables, maps, graphics, calendar, images depending on type and content of data,
  • download data in the format chosen,
  • subscribe to a dataset,
  • broadcast to social media such as Twitter, LinkedIn, Facebook and Google+,
  • comment on datasets and data and post reuse proposals,
  • exploit the datasets and data via API.
Technical constraints: The software is offered as a service (SaaS). Its pay-as-you-use subscription fee is based on data volume, usage (number of UI/API queries) and SLAs. License: Commercial. Contact Website:
Socrata Open Data Cloud based solution for managing and publishing data offering much more than cataloguing services. Developer/maintainer: Socrata . Used by: City of New York, Melbourne, Bath (UK), Chicago, San Francisco, New Orleans, Boston, Las Vegas, Dallas Functionalities: Functionalities for publishing are, the ability to:
  • load the data in the datastore. During the load processing the data can be pre-processed and configured,
  • sync data with the master dataset,
  • enter metadata on the dataset,
  • edit the data and creating snapshots,
  • create additional views and visualisations,
  • publish the data in other formats than uploaded,
  • publish the data publicly or privately.
Functionalities for reusing are, the ability to:
  • search and discover datasets by search and metadata filtering,
  • filter data within the dataset,
  • explore and create additional views (graphics, maps, calendars, dashboards),
  • download data in the format chosen (including RDF),
  • discuss datasets,
  • exploit the datasets and data via API,
  • expose the data  as OData for easier integration in the Microsoft Ecosystem.
Technical constraints: The software is offered as a service (SaaS). Pricing is unclear. License: Commercial. Contact Website:
Dataproofer Cross platform desktop app to run a collection of tests over a data file supplied in xlsx, xls, CSV, TSV, PSV formats. Developer/maintainer: Knight Foundation and Vocativ . Used by: Vocativ.
  • Functionalities: Dataproofer indicates which of a series of tests passed. As a default, it loads 15 tests for indicating:
  • string and numeric cells
  • empty and duplicate cells
  • outliers related to mean and median
  • incorrect geographical coordinates.
The tests can be extended. Technical constraints: None. Based on web technology. License: Open Source, GNU General Public License Contact Website: Download:
Csv Lint Online service to verify the quality of csv files. It can also be installed locally as a command-line application. Developer/maintainer: ODI Used by: Lots of individuals and organisations as indicated by the log. Functionalities:  The software checks for common errors and warnings. But in addition to the default list of checks one can supply a table schema in JSON that declares additional constraints for data fields. Some examples are: field is required, field value needs to be unique, to have a minimum or maximum value, the value needs to follow a certain pattern, etc. A report is generated enumerating the errors and warnings, if any. Technical constraints: Runs as a service on the web or as a command-line tool. The submitted file may not be larger than 700Mb. License: Open Source MIT. Contact Website: Download:
ARX Open source software for anonymizing sensitive personal data. Developer/maintainer: Technische Universität Munchen. Used by: Hundreds of users; the algorithms are also integrated and used in the Weka data mining tool. Functionalities: ARX comes with a cross-platform graphical tool, which supports data import and cleansing, wizards for creating transformation rules, ways for tailoring the anonymized dataset to the requirements and visualizations of data utility and risks. ARX is also available as a software library with an API that delivers data anonymization capabilities to any Java program. ARX reads SQL databases, MS Excel and CSV files. Technical constraints: The GUI is available on Windows, Mac and Linux. License: Open source (Apache License, Version 2.0) Contact Website: Download:
OpenRefine Tool for working with messy data: cleaning it; transforming it from one format into another; and extending it via web services and external data. Developer/maintainer: It originally was developed by Google for adding structured and cleaned data to Freebase, described as an "open, shared database of the world's knowledge". October 2nd, 2012, Google stopped supporting the project which has been taken over by a group of volunteers. Used by: This is the preferred tool in the Open Data space being use in many open data curricula. Some examples: pILOD Netherlands, Open Data day Flanders, ODI, TUDelft, Cooper-Hewitt National Design Museum, LSU Libraries, University of Texas.
  • Librarians: DST4L – LODLAM
  • Journalists: NYT, Chicago Tribune, Le Monde, The Guardian
  • Open Data Communities: Sunlight Foundation, OKFN
  • Educational tool: School of Data
Functionalities: With OpenRefine one can import following data formats: delimited text files, fixed width, JSON, XML, ODS spreadsheets, excel, RDF. Profiling functionalities are available but not shown automatically. The user needs to choose which columns/fields he wants to investigate. Per chosen facet the distribution of values (including NA’s) is shown. Very elaborate clustering algorithms are available for merging entries with different spelling to a canonical one. There is a dedicated transformation language e.g. GREL (General Refine Expression Language) which offers functions for transforming strings, arrays and for handling math, dates and booleans. A lot of these functions can be called from the OpenRefine user interface. Furthermore, all actions and function calls are kept in a script that can used to undo and redo actions and to replay a complete transformation. A lot of emphasis also went in offering facilities to reconcile values with entities found on the web. There are extensions available to add additional functionalities. At the end, the cleansed data can be exported into excel, ODF, tab and comma separated text. With the RDF extension also in triples to be used as linked open data. Technical constraints: OpenRefine is available for Mac, Windows and Linux. It starts up as a local web server. License: Open Source. Contact Website: Download:
Trifacta Wrangler Tool specifically made to help a non-programmer address all data wrangling tasks in a very interactive way. It is the commercial successor of Stanford’s Data Wrangler. Developer/maintainer: Trifacta . Used by: The New York Times, Atlassian, McGrawHill Education, Google etc. Functionalities: Trifacta Wrangler allows you to import text delimited files, json and Microsoft excel files. Once loaded the software offers several profiling functionalities. It infers the data type and other properties of each field/column. It shows for each column a bar indicating the quality and using a histogram the distribution of the values. It has transformation functions for:
  • Restructuring (splitting/collapsing columns, deleting fields, pivoting rows)
  • Cleaning (replacing/deleting missing or mismatched values)
  • Enriching (joining multiple data sources)
These transformations are suggested based on user actions (clicking a diagram, selecting text…). All these transformations are translated in a domain-specific declarative language called Wrangle (which can be edited by a tech savvy user) and recorded in a script that can be rerun leading to reproducible results. At the end of the process the software helps you to validate the wrangling script on the full dataset. Once satisfied one can publish to CSV, JSON and Tableau. Technical constraints: Windows or OSX (Mac) with min 4GB RAM and 2 GB free disk space and an Internet connection. One needs to register. License: Commercial but the desktop version is free. Contact Website: Download:
Talend Data Preparation Allows you to read a dataset, define a recipe with the necessary transformations and generate an output based on this preparation. Developer/maintainer: Talend . Used by: Not known. Functionalities: Talend Data Preparation reads CSV and Microsoft excel files. Once loaded the software offers several profiling functionalities. It infers the data type and other properties of each field/column. It shows for each column a bar indicating the quality and using a histogram the distribution of the values. It has transformation functions for:
  • Restructuring (splitting columns, deleting fields).
  • Cleaning (replacing/deleting missing or mismatched values).
  • Enriching (joining multiple data sources).
These transformations are suggested depending on a row or column selection. However, deduplication is not yet supported. Transformations are kept in a script that can be rerun and exchanged. The results of the preparation can be exported to csv and xlsx. Technical constraints: Windows and OSX with 1GB RAM and 5GB Hard disk space. License: Unknown. Contact Website: Download:
Exploratory Desktop application for data wrangling, visualization, and advanced analytics based on the R programming Language. Developer/maintainer: Exploratory . Used by: Unknown; the product is still in bèta. Functionalities: In this part we only focus on the data wrangling functionalities. Exploratory allows to read from the local file system delimited and fixed width files, excel, SPSS/SAS/STATA, R data files and JSON. Supported remote data sources are: MongoDB, MySQL, Redshift, PostgreSQL, Google spreadsheets, BigQuery etc. Exploratory comes with a range of transformation functions as foreseen by the R package dplyr, being a grammar for data manipulation. These transformation functions can be called and run by choosing menu items in the interface. The dyplr package has transformation functions for: reshaping, subsetting rows and columns, summarizing, making new columns, grouping data and combining datasets. Technical constraints: Still in beta. Available on Mac and Windows and connects to the Exploratory website. Installs also R. License: Unclear. Commercial with a free plan. Limitations of the free plan are not known yet. Contact Website: Download:
Dataiku Data Science Studio Free Edition Integrated development platform for data professionals to turn raw data into predictions in a collaborative way. Developer/maintainer: Dataiku . Used By: Axa, l’Oréal, Cap Gemini, Coyote. DSS is one of the fastest growing products in the data science space. Functionalities: Limited to the data wrangling field. The free version connects to file systems (local, via http, ftp etc.) supporting delimited and fixed text files, JSON, excel. Also connections to MySQL and PostgreSQL are available. When opening a data source, it offers data profiling automatically detecting the contained datatypes, indicating the percentages of empty and wrong values. Furthermore, the distributions per field can be shown for outlier detection and clustering algorithms can be applied comparable to those used in OpenRefine (5.4.3). Next to the profiling functionalities datasets can be sampled, split, grouped, joined with other datasets, reshaped (restructured) and the values of the cells can be transformed and cleansed. This is done with a very elaborate library containing 81 processors. Technical constraints: Available on Mac OS, Windows, Linux and as a VMWare or a VirtualBox. It starts up a local web server using a connection to the Dataiku server. License: Commercial with a limited free version.
OpenRefine OpenRefine (cf.supra) takes TSV, CSV, *SV, Excel (.xls and .xlsx), JSON, XML, RDF as XML, and Google Data documents as input docs and is able to convert those into TSV, CSV, excel (xls, xlsx), ODF spreadsheet and in any text format using templates.
The DataTank Server that connects to the source dataset and converts these in other formats and exposes the dataset with a RESTful API. Developer/maintainer: Open Knowledge Belgium . Used by: The Flemish Open Data portal, the cities of Antwerp, Ghent, Kortrijk. Functionalities: Captures DCAT-AP compliant metadata of datasets. Import from: CSV, XLS, XML, JSON-LD, SHP, JSON, RDF, SPARQL stores, MySQL stores. Publishing as CSV, JSON, XML, RDF. Depending on the data content data can be presented as HTML or as a map. License: Open source. Contact Website: Download:
Data Voyager Visualization browser for open-ended data exploration. It is built using Vega-Lite, a high-level visualization grammar. Developer/maintainer: University of Washington Interactive Data Lab led by Geoffrey Heer, also co-founder of Trifacta . Used by: Not known. De underlying Vega technology is used in other products. Functionalities: Voyager creates a gallery of automatically-generated visualizations which can be navigated in a interactive way. After loading a dataset the system detects the datatypes of the fields and calculates descriptive analytics: min, max, mean, standard deviation, median, a sample etc. And for every field a visualization is proposed using a recommendation engine using the type and distribution of data as input. One can select a field of interest and then Voyager automatically updates the view with relevant visualizations of the chosen field and its relation to all the others. When you combine a field with another field scatterplots are build. If the user sees a visualization of interest it can be bookmarked for further use. The authors indicate that this type of interaction is better suited for free exploration of a dataset. When a specific analytics question needs to be addressed they offer a companion product named Polestar which offers an approach comparable to Tableau. Technical constraints: Data Voyager is available as a web service. It can also be locally installed. There is a dependency then on node.js. License: Open Source. Contact Website: Download:
Dataseed Online platform for interactive data visualisation, analysis and reporting. Developer/maintainer: Atchai . Used by: Harvest, Tailster, Resultsmark, hscic. Functionalities: One can upload spreadsheet files or connect with Google Drive, Github, and  DropBox. Data will be automatically aggregated and visualised. The charts are chosen based on the nature of the data and are clickable for filtering and further exploration. The automatically generated charts can be ameliorated and adapted to personal preferences. The charts can be published and shared. There is an open-source toolkit that allows to create custom visualisations driven by the dataseed back-end. Technical constraints: The open source toolkit has dependencies on nodeJS and npm. License: For the open source toolkit the GNU Affero General Public License. Contact Website: Download:
Tableau Desktop Public Tool to visualize and share data. Developer/maintainer: Tableau . Used by: More than 190.000 users are known. Functionalities: One can open Excel, text files, statistical files and connect to Google Sheets and OData servers as data sources. The software detects automatically the data type of each field and table restructuring functions are available. For each data source one is able to define multiple worksheets visualisations.The complete list of possibilities is: text tables, bar, line, pie, map, scatter plot, gannt, bubble, histogram, bubble, heat, highlight, treemap, box-and-whisker plot. Once a visualisation built one can overlay it with analytics indicators such as an average line, the median with quartiles, a distribution band, box plot, … and the latest version v10.0 also offers clustering. It is also important to notice that Tableau Desktop is able to work with multidimensional cubes as found in statistical datasets. Technical constraints: Available for Windows and Mac. License: Commercial, but a free public version is available. The visualisations made by this version become publicly available on the Tableau cloudserver. Contact Website: Download:
Exploratory Relevant functionalities: For the single variables min, max, mean and median are calculated and the distribution is shown using histograms. The available chart types are: bar, line, area, histogram, scatter, boxplot, map, cloropleth, heatmap, contour. Bivariate analysis can be done by constructing scatterplots.
Dataiku Data Science Studio Relevant functionalities: Dataiku offers: min, max, mean, median, standard deviation, distinct values and a histogram of the distribution. Available chart types are: bars, bars 100%, histogram, staked, stacked 100%, lines, stacked area, 100% stacked area, pie, donut, scatter plot, bubble, hexagon, grouped bubbles, scatter and grid maps.
Google Fusion Tables Web application to visualize and share data tables. Used by: Users are: the Guardian, the Toronto Globe and Mail, UCSF Global Health Sciences, Honda, Texas Tribune etc. Functionalities: The service allows to:
  • filter and summarize data
  • combine data with other datasets
  • visualize the data using a chart, map, network graph, or custom layout
  • embed and share
  • offer an API to the data
Technical constraints: A modern web browser. License: Google’s terms of service. Contact Website: Download:
Vega and Vega-Lite Vega is a declarative format for creating, saving, and sharing visualizations. With Vega, visualizations are described in JSON, and interactive views can be generated using either HTML5 Canvas or SVG. Vega-Lite provides a higher-level grammar for visual analysis that generates complete Vega specifications. Online text editors for both syntaxes are available: There is also an online design environment named LYRA that enables custom visualization design without writing any code ( Developer/maintainer: IDL . Used by: The Vega family is used in Trifacta, DataVoyager, PoleStar. Vega can be used from Python, Julia, R and is integrated in ggvis, MediaWiki and Cedar. Functionalities Vega-Lite allows to describe a visualization as a set of encodings that map data fields to the properties of graphical marks, using a JSON format.Vega-Lite supports data transformations such as aggregation, binning, filtering, and sorting and layout transformations including stacked layouts and faceting into small multiples. Technical constraints: Client-side it is just using javascript and web components. Server-side there is a dependency on node.js. License: Open Source Contact Website: Download:
Plotly Cloud environment for the creation of data visualisations and dashboards with integrated collaboration facilities. Next to the cloud environment high-level, declarative charting libraries in R, Python, JavaScript and MatLab are available. The JS library has been open sourced. Developer/maintainer: Plotly . Used by: Google, US Airforce, New York University, NetFlix a.o. Functionalities: Data can be imported. Data formats supported vary depending on the subscription level. The main focus is on high end visualization. The supported charts however once again depend on the subscription level. Plotly v2 also allows to generate descriptive analytics per field: mean, medium, quartiles, standard deviation, variance. Technical constraints: None for the cloud service except for a modern browser. License: Plotly is free for unlimited public use. The Javascript library is open source. API use can be limited according to the plan subscribed to. Contact Website: Download:
Quadrigram A visual drag-and-drop data editor allowing to create interactive visualisations without coding. Multiple components can be combined for storytelling and everything made is shareable. Developer/maintainer: Bestiario . Used by: Mostly individuals. Functionalities: One can load data and store them on Google Drive. Data can be filtered, aggregated and sorted. Following charts are available: bar, scatter, stacked bar, stacked area, a map. Charts can be connected. The result can be published and shared over the internet. The latest version also includes pivot tables for handling multidimensional values. Technical constraints: None. License: Terms of Use. Contact Website:
Datawrapper Service to create and publish graphics. Developer/maintainer: Journalism++ Cologne . Used by: Mostly used in the news and journalism domains: The Guardian, Washington Post, de Standaard. Functionalities: The available chart types are: line, bar, stacked bar, map, donut and table. Technical constraints: None. License: Free software under MIT license. Contact Website: Download:
Raw A web service for creating graphs which are not easily available in other tools: alluvial, bump, circle, circular dendogram, cluster dendogram, clustered force layout, convex hull, hexagonal bins, parallel coordinates, steamgraph, … and hence no support for pie charts, histograms or line charts. Developer/maintainer: Density Design Research Lab . Used by: Unknown. Functionalities: Choose a graphic type and customise the graphic. Once satisfied one can export the graphic as svg or png. Raw is highly extensible and is accessible by developers via API. Technical constraints: Raw can also be locally installed and then there are dependencies on git, bower and python. License: Open license (LGPL license). Contact Website: Download:
Tableau Public Desktop Tableau allows to build dashboards. A dashboard is a collection of several worksheets and supporting information shown in a single place so you can compare and monitor a variety of data simultaneously. When you create a dashboard, you can add views from any worksheet. You can also add a variety of supporting objects such as text areas, web pages, and images. From the dashboard, you can format, annotate, drill-down, edit axes, and more. Tableau also supports story building. A story is a sheet that contains a sequence of worksheets or dashboards that work together to convey information. One can create stories to show how facts are connected, provide context, demonstrate how decisions relate to outcomes, or simply make a compelling case.
Plotly Online Dashboards is an open source web application for arranging plotly graphs into web dashboards.
Quadrigram Quadrigram offers a canvas where multiple components (graphics, maps, shapes, text and media) can be combined.
BigML Webservice that lets you build models and make predictions with these models. Developer/maintainer: BigML Inc. Used by: Datatricks, Persontyle, Quintl etc. Functionalities: BigML offers: decision trees, ensemble learning, clustering, anomaly detection and association discovery. Next to the web interface, there is a Mac app, an open source command-line tool and programming language bindings to Python, Java, node.js, clojure, swift. Technical constraints: For the web interface a modern browser. License: See terms of service. Contact Website: Download:
DataScienceStudio Offers decision trees and clustering, leveraging ML technologies (Scikit-Learn, MLlib, XGboost, etc.) One can build & optimize models in Python or R and integrate any external ML library through code APIs (H2O, Skytree, etc.), and get instant visual & statistical feedback on the performance of the model.
SkyTree Express single-user Desktop Machine learning environment. Developer/maintainer: SkyTree . Used by: Amex, PayPal, Thomson Reuters, Panasonic. Functionalities: A machine-learning Platform accessible via Python, Java, GUI or Command Line that automatically selects parameters and builds models, offers visualisations to explain the model results and automatically documents the whole process. Technical constraints: The single-user Desktop GUI version needs to be installed in a VirtualBox. Minimum hardware requirements: 8 GB RAM, 2 physical cores. The free version is limited to 100 million data elements. License: The attached license is valid for 1 year. Contact Website: Download:
RapidMiner Studio Open source data science platform of which RapidMiner Studio is the desktop version with a visual development environment. Developer/maintainer: RapidMiner Gmbh . Used by: More than 100.000 users. Functionalities: The tool allows to connect to datasets, to profile the dataset, to cleanse and enrich. At any point descriptive analytics can be calculated and it offers a full range of visualisations. More than 120 modelling and prediction algorithms are available. It also has features to score and evaluate the models. Technical constraints: Is available for Windows, Mac and Linux. Free version is limited to 10.000 data rows and 1 processor. License: An open source core is published under AGPL-3.0. The source code is available on GitHub. Contact Website: Download:
TopBraid Composer Free IDE for working with RDF triples and linked data. Developer/maintainer: TopQuadrant . Used by: KOOP, P&G, Mayo Clinic, Lockheed Martin, Thomson Reuters, AstraZeneca, UCB, Pearson, Lilly, Nasa, JPMorganChase etc. Functionalities: TBC allows to import RDF files, to integrate those, to edit triples, to validate the triples against constraints, to infer new triples based on ontologies and/or rules, to query the triples full-text and via SPARQL. The triples can be exported again into several serialisations. The standard edition, which is available for evaluation for 30 days, adds a lot e.g. graphical representations of the resources and the model, many ways to convert legacy data such as tsv’s, relational databases into rdf and connections with the leading triple stores. Technical constraints: TBC runs in the Eclipse 4.3 platform and requires Java 8 (Oracle JRE/JDK). TBC is available for Windows, Mac and Linux. License: Closed source, commercial. Contact Website: Download:
fluidOps Information Workbench A platform for Linked Data application development. Developer/maintainer: fluidOps . Used by: Many cloud infrastructure and data center users. Functionalities: IWB comes with facilities to convert legacy data in formats such as CSV, TSV, relational databases, XML, JSON into RDF. IWB allows to integrate several datasources and to search and query these. For every resource one can define a wiki page. In such a wiki page several widgets can be plugged in for visualisation, social media integration, LOD integration, editing etc. Technical constraints: IWB can be installed on Windows, Mac and Linux and runs as a webservice then. License: Free licenses are available for educational use. Contact Website: Download: