Building data pipelines for fintech with Python

- Python
- Fintech
13 Dec 2024 |
01 Min read

At Mercury, we needed to process millions of transactions daily for reconciliation, reporting, and compliance. Python's data ecosystem-Pandas, Polars, and Dask-made this manageable, but building reliable pipelines required careful design.

The Challenge

Financial data pipelines must be:

Accurate (no data loss or corruption)
Fast (process millions of records)
Reliable (handle failures gracefully)
Auditable (track every transformation)

Our Architecture

We used a combination of:

Pandas for in-memory transformations
Polars for larger datasets (faster, more memory-efficient)
Dask for distributed processing
Apache Airflow for orchestration

Example Pipeline

import pandas as pd
import polars as pl
from dask import dataframe as dd

def process_transactions(file_path: str) -> pd.DataFrame:
    # Read with Polars for speed
    df = pl.read_csv(file_path)
    
    # Transformations
    df = df.filter(pl.col("amount") > 0)
    df = df.with_columns([
        pl.col("timestamp").str.strptime(pl.Datetime),
        pl.col("amount").cast(pl.Decimal(10, 2))
    ])
    
    # Convert to Pandas for final processing
    return df.to_pandas()

def reconcile_transactions(transactions: pd.DataFrame, 
                          ledger: pd.DataFrame) -> pd.DataFrame:
    merged = transactions.merge(
        ledger, 
        on=["transaction_id", "amount"],
        how="outer",
        indicator=True
    )
    
    # Find discrepancies
    discrepancies = merged[merged["_merge"] != "both"]
    return discrepancies

Error Handling

def safe_process(file_path: str) -> Optional[pd.DataFrame]:
    try:
        return process_transactions(file_path)
    except FileNotFoundError:
        logger.error(f"File not found: {file_path}")
        return None
    except pd.errors.EmptyDataError:
        logger.warning(f"Empty file: {file_path}")
        return pd.DataFrame()
    except Exception as e:
        logger.exception(f"Unexpected error: {e}")
        raise

Performance Optimization

Use Polars for large datasets
Parallelize with Dask when possible
Cache intermediate results
Use appropriate data types
Profile and optimize bottlenecks

Testing

def test_transaction_processing():
    test_data = pd.DataFrame({
        "transaction_id": ["T1", "T2"],
        "amount": [100.50, 200.75],
        "timestamp": ["2023-04-01 10:00:00", "2023-04-01 11:00:00"]
    })
    
    result = process_transactions(test_data)
    assert len(result) == 2
    assert result["amount"].sum() == 301.25

Lessons Learned

Start with Pandas, optimize with Polars when needed
Always validate data at pipeline boundaries
Log everything for debugging
Test with production-like data volumes
Monitor pipeline performance and costs

"Data pipelines are only as good as their error handling."

Related Posts

01 Jan 2023

Building 1099 interest payout systems

Continue Reading

22 Mar 2023

AI-powered auto-categorization in accounting software

Continue Reading

29 Mar 2023

Automated reconciliation with partner banks

Continue Reading

05 Apr 2023

Building payment pipelines with Python

Continue Reading

11 Apr 2023

Building real-time FX trading systems

Continue Reading

18 Apr 2023

Building secure payment APIs

Continue Reading

25 Apr 2023

Building secure payment APIs

Continue Reading

01 May 2023

Building secure payment APIs

Continue Reading

08 May 2023

Building secure payment APIs

Continue Reading

15 May 2023

Building secure payment APIs

Continue Reading

22 May 2023

Building secure payment APIs

Continue Reading

28 May 2023

Building secure payment APIs

Continue Reading

04 Jun 2023

Building secure payment APIs

Continue Reading

11 Jun 2023

Building secure payment APIs

Continue Reading

17 Jun 2023

Building secure payment APIs

Continue Reading

28 Jul 2023

Scaling payments globally with decoupled checkout

Continue Reading

10 Aug 2023

Event-driven data ingestion at Pangea

Continue Reading

17 Aug 2023

FX market data pipelines

Continue Reading

24 Aug 2023

FX market data pipelines

Continue Reading

30 Aug 2023

FX market data pipelines

Continue Reading

06 Sep 2023

FX market data pipelines

Continue Reading

13 Sep 2023

FX market data pipelines

Continue Reading

19 Sep 2023

FX market data pipelines

Continue Reading

26 Sep 2023

FX market data pipelines

Continue Reading

03 Oct 2023

FX market data pipelines

Continue Reading

10 Oct 2023

FX market data pipelines

Continue Reading

16 Oct 2023

FX market data pipelines

Continue Reading

30 Oct 2023

Integrating Jiko, Plaid, Clerk, and Xata at Pangea

Continue Reading

15 Apr 2024

Migrating from Heroku to AWS: lessons learned

Continue Reading

28 Apr 2024

Next.js for financial dashboards

Continue Reading

18 Jul 2024

Payment reconciliation automation

Continue Reading

25 Jul 2024

Payment reconciliation automation

Continue Reading

31 Jul 2024

Payment reconciliation automation

Continue Reading

07 Aug 2024

Payment reconciliation automation

Continue Reading

14 Aug 2024

Payment reconciliation automation

Continue Reading

20 Aug 2024

Payment reconciliation automation

Continue Reading

27 Aug 2024

Payment reconciliation automation

Continue Reading

03 Sep 2024

Payment reconciliation automation

Continue Reading

10 Sep 2024

Payment reconciliation automation

Continue Reading

16 Sep 2024

Payment reconciliation automation

Continue Reading

23 Sep 2024

Integrating Plaid with TypeScript

Continue Reading

30 Sep 2024

Python async best practices

Continue Reading

06 Oct 2024

Python async best practices

Continue Reading

13 Oct 2024

Python async best practices

Continue Reading

20 Oct 2024

Python async best practices

Continue Reading

27 Oct 2024

Python async best practices

Continue Reading

02 Nov 2024

Python async best practices

Continue Reading

09 Nov 2024

Python async best practices

Continue Reading

16 Nov 2024

Python async best practices

Continue Reading

22 Nov 2024

Python async best practices

Continue Reading

29 Nov 2024

Python async best practices

Continue Reading

06 Dec 2024

Python async patterns for payment processing

Continue Reading

19 Dec 2024

Python testing strategies for fintech

Continue Reading

17 Mar 2025

Building a revenue recognition pipeline in Python

Continue Reading

19 Apr 2025

Managing GCP infrastructure with Terraform at Pangea

Continue Reading

16 Jul 2025

TypeScript for financial calculations

Continue Reading

22 Jul 2025

TypeScript monorepo best practices

Continue Reading

29 Jul 2025

TypeScript monorepo best practices

Continue Reading

05 Aug 2025

TypeScript monorepo best practices

Continue Reading

11 Aug 2025

TypeScript monorepo best practices

Continue Reading

18 Aug 2025

TypeScript monorepo best practices

Continue Reading

25 Aug 2025

TypeScript monorepo best practices

Continue Reading

01 Sep 2025

TypeScript monorepo best practices

Continue Reading

07 Sep 2025

TypeScript monorepo best practices

Continue Reading

14 Sep 2025

TypeScript monorepo best practices

Continue Reading

21 Sep 2025

TypeScript monorepo best practices

Continue Reading

27 Sep 2025

TypeScript monorepos in fintech

Continue Reading

04 Oct 2025

TypeScript type patterns

Continue Reading

11 Oct 2025

TypeScript type patterns

Continue Reading

18 Oct 2025

TypeScript type patterns

Continue Reading

24 Oct 2025

Type-safe form validation with TypeScript and Zod

Continue Reading

31 Oct 2025

Type-safe dependency injection with TypeScript

Continue Reading

07 Nov 2025

Advanced TypeScript utility types for financial data

Continue Reading

13 Nov 2025

Type-safe database queries with TypeScript

Continue Reading

20 Nov 2025

Building type-safe APIs with TypeScript

Continue Reading

27 Nov 2025

Type-safe event handling with TypeScript

Continue Reading

04 Dec 2025

Advanced TypeScript type patterns for financial systems

Continue Reading