Google Introduces Groundsource: Turning Millions of News Articles into Structured Data with Gemini

In a significant step toward bridging the gap between unstructured text and actionable machine learning datasets, Google Research has unveiled Groundsource — a new methodology that leverages the Gemini large language model to extract structured, geo-tagged event data from vast archives of public news reports.

Announced on March 12, 2026, Groundsource represents one of the first large-scale demonstrations of using frontier LLMs not just to generate content, but to systematically create high-quality, labeled training data from the open web.

The Problem Groundsource Solves

Google Introduces Groundsource: Turning Millions of News Articles into Structured Data with Gemini Flash floods — sudden, deadly deluges that often strike with little warning — are among the most lethal natural hazards worldwide, killing thousands annually. Yet reliable, high-resolution historical records of these events remain scarce outside of a few well-monitored countries. Traditional observation networks (stream gauges, satellite imagery) suffer from geographic sparsity, high cost, and inconsistent maintenance.

Public news articles, by contrast, are abundant: millions of reports in dozens of languages document floods, their locations, dates, and impacts. Until now, turning that noisy, unstructured text into usable quantitative datasets has been prohibitively labor-intensive.

Groundsource changes this equation.

How It Works

Google Introduces Groundsource: Turning Millions of News Articles into Structured Data with Gemini Google researchers processed over 5 million publicly available news articles spanning two decades and 150+ countries.

Gemini was used to:

Identify flood-related reports;
Extract structured facts (location, date/time, severity indicators);
Geocode events with high precision using Google Maps;
Filter and validate entries to reduce noise.

The result is Groundsource Flash Floods — an open-access dataset containing 2.6 million historical flood events. It dramatically expands coverage compared to existing archives such as GDACS (which lists ~10,000 major events) or national databases that are limited in scope and geography.

Manual validation and cross-matching against official sources (e.g., GDACS severe events 2020–2026) show the pipeline achieves **85–100% recall** for high-impact floods while capturing many smaller, localized incidents previously undocumented in global records.

From Dataset to Life-Saving Predictions

The Groundsource data immediately powered a new flash-flood forecasting model trained on historical patterns and real-time weather inputs. Built on a Long Short-Term Memory (LSTM) architecture, the model generates probabilistic forecasts up to **24 hours** in advance for urban areas across 150+ countries.

These predictions are now live in Google Flood Hub, the company’s public flood forecasting platform, significantly expanding its coverage and granularity. Local authorities, emergency responders, and vulnerable communities can access real-time risk maps to prepare for sudden deluges.

Broader Implications for AI & Data Science

Groundsource is more than a flood dataset — it is a proof-of-concept for a new paradigm in data creation:

LLMs as data factories — transforming unstructured internet text into clean, structured training corpora;
Filling long-standing gaps — where official statistics are sparse (climate disasters, economic shocks, disease outbreaks, geopolitical events), news archives can become a near-real-time signal;
Scalability — the methodology is domain-agnostic and can be applied to heat waves, landslides, economic indicators, public health trends, and more.

By making the Groundsource flash-flood dataset freely downloadable (via Zenodo and EarthArXiv preprints), Google invites the global research community to build on this work — improving models, validating findings, and extending the pipeline to other hazards.

Looking Ahead

As large language models continue to improve at reasoning over unstructured text, techniques like Groundsource could dramatically accelerate the creation of high-quality labeled data across disciplines. The bottleneck in many machine-learning applications is no longer compute or algorithms — it is labeled data. When LLMs can reliably mine that data from the open web, the pace of progress in climate modeling, disaster resilience, public health forecasting, and beyond could accelerate dramatically.

Groundsource is an early, practical demonstration of that future — where AI not only predicts events, but first reconstructs the historical record needed to make those predictions trustworthy.

Explore the dataset and methodology in the official Google Research blog post and accompanying preprints. The forecasts are already live in Flood Hub for anyone to check.