Portable Data Engineering Pipeline

CORDIS to Supabase Pipeline

A portable Python ETL flow that prepares cleaned CORDIS project data for the web application and loads the final web dataset into Supabase.

PythonPandasParquetSupabaseData CleaningETL

Document Preview

This document explains the Python ETL design, data cleaning logic, validation checks, Gold-style outputs and Supabase loading process.

82,370

Web-ready CORDIS project records validated after cleaning.

0 blanks

No blank programme or status values in the final validation output.

Portable

Designed to run outside Microsoft Fabric when needed.

The pipeline prepares the final web dataset using Python so the data can be rebuilt and loaded again when needed.

Source files→Extract→Transform→Gold outputs→Web dataset→Supabase load

Standardised programme values across FP7, H2020 and Horizon Europe.
Improved country mapping and reduced unknown country labels.
Generated Gold-style fact and dimension outputs.
Created a web-optimised cordis_projects table for search and dashboard use.
Added validation checks for row counts, duplicate IDs, blank programme values and unknown country rows.

The pipeline supports the live CORDIS Research Explorer app and gives a repeatable path to rebuild the cleaned dataset.