You can have access to 25 million datasets right now!


Google’s released last January 23rd, 2020, an impressive dataset repository and search tool with more than 25 million records.


Across the web, there are millions of datasets about nearly any subject that interests you. If you’re looking to buy a puppy, you could find datasets compiling complaints of puppy buyers or studies on puppy cognition. Or if you like skiing, you could find data on revenue of ski resorts or injury rates and participation numbers.
Google recently release “Dataset Search”, a free tool with more than 25 million publicly available datasets, mostly related to geosciences, biology, and agriculture.
The new tool represents a single place to search for datasets and find links to where the data is, enabling users to find datasets stored across the Web through a simple keyword search. The tool surfaces information about datasets hosted in thousands of repositories across the Web, making these datasets universally accessible and useful.
The search tool has filters to limit the results based on their type of license (free or paid), format (CSV, images, etc.), and update time. The results also consist of descriptions of the dataset’s contents as well as author citations.



Google’s dataset aggregation method differs from other dataset repositories like Amazon’s open records registry. Unlike different repositories that curate and host the datasets themselves, Google does not curate or offer direct access to the 25 million datasets directly. Instead, Google relies on the dataset publishers to use the open standards of schema.Org to describe their dataset’s metadata. Google then indexes and makes that metadata searchable across publishers.
The giant from Mountain View also informs that over 2 million datasets belong to US government agencies.
Everyone can submit their datasets to the tool, using the open-standards of schema.org. It means that the quantity of publicly available datasets is probable to continue growing as more publishers will adopt to the standard.
Clearly, this project will have the additional benefits of creating a data-sharing ecosystem that will encourage data publishers to follow best practices for data storage and publication and giving Data Scientists a way to show the impact of their work through the citation of datasets that they have produced.

Well done again, Google!


Post a Comment

0 Comments