Validate Pandas DataFrames using Pydantic: pandantic

4 min readMay 2, 2023

As I cycled through the beautiful centre of Amsterdam, I tuned in to the Python Bytes podcast. This time the hosts discussed a new release of the Pydantic package, which comes with a major increase of computational performance (potentially 50 times faster) since the Pydantic core implementation is rewritten in Rust. The hosts mention the implication of using tools like FastAPI but I remembered one of my earlier shower thoughts around the use of Pydantic in combination with another famous package.

I always wondered why Pydantic is not used to validate various DataFrames from the various frameworks like Pandas.

Pydantic offers a great API which is widely used, its BaseModels are widely used as schema validators in a lot of different projects. Except for Pandas Dataframes. Of course I searched the internet and there are some github gists laying around that could make validation of the dataframe work. However, I never liked any of the approaches of those proposed solutions. I concluded it must be a performance thing that kept Pydantic away from the good old DataFrames. But now that Pydantic could potentially speed up computation up to 50 times, I decided to give it a go that same evening and I got to a working solution pretty quick (I thought).

pandantic

Why and when to use pandantic

One of the methods I appreciate the most while using pydantic’s BaseModel is the parse_obj class methods. Therefore, I considered that adding another parse method called parse_df for parsing DataFrame specific objects would make the most sense. This is where the package pandantic comes in.

Why would you want to use pandantic over any other DataFrames validation packages like pandera or great-expectations? Firstly, I would like to point out the well designed API of pydantic is very user friendly. Personally, I feel that both API’s of pandera and great-expectations are not that straightforward and easy to use in comparison to using pydantic.

Another example where pandantic could improve your life is when you need to validate items and dataframes with the same logic. For example, a BaseModel could be used for validating JSON as the input of a FastAPI application for real-time predictions on a trained ML model (which is a quite common approach). How convenient would it be when this same BaseModel can be utilized to validate the training DataFrame is used to fit that same ML model? That’s when you might want to use pandantic.

A cute example

First install the pandantic package, which can be considered to be a fork of the pydantic package.

pip install pandantic

To call the parse_df class method, instead of importing the BaseModel from the pydantic package you should import the BaseModel from the pandantic package. Just like a pydantic user normally would, specify the schema of the BaseModel class:

from pandantic import BaseModel
from pydantic.types import StrictInt


class DataFrameSchema(BaseModel):
    """Example schema for testing."""
    
    example_str: str
    example_int: StrictInt

Let’s try this schema on a simple pandas.DataFrame. It will become more complex later on, promised! Use the class method parse_df from the freshly defined DataFrameSchema and specify the `dataframe` that should be validated. In this example, the user wants to filter out the bad rows and return the resulting dataset containing only valid rows. There are more options for the errors argument like raise to raise a ValueError after validating the whole dataframe.

import pandas as pd

df_invalid = pd.DataFrame(
    data={
        "example_str": ["foo", "bar", 1],
        "example_int": ["1", 2, 3.0],
    }
)

df_filtered = DataFrameSchema.parse_df(
    dataframe=df_invalid,
    errors="filter",
)

df_filtered would only consist of the second row containing bar and 1, filtering out all the invalid records of the dataframe.

Spicing things up using a custom validator

Let’s make things more interesting and implement a custom field validator from Pydantic into the BaseModel class and apply it on the same dataframe.

Let’s create a custom validation method, to validate one of the fields in the pandantic.BaseModel. Imagine the situation that the BaseModel should be able to make sure the example_int field’s value is an even number. Like using the normal pydantic package, it is as easy as implementing the (new) field_validator decorator from pydantic and code the right logic to make sure the integer is even.

from pandantic import BaseModel
from pydantic import ValidationError, field_validator
from pydantic.types import StrictInt

class DataFrameSchema(BaseModel):
    """Example schema for testing."""

    example_str: str
    example_int: StrictInt

    @field_validator("example_int")
    def validate_even_integer(
        cls, x: int
    ) -> int:
        """Example custom validator to validate if int is even."""
        if x % 2 != 0:
            raise ValidationError(f"example_int must be even, is {x}.")
        return x

Using the example BaseModel from above, it is possible to parse the following pandas.DataFrame. If the validation does not pass an error is raised. By setting the errors argument to raise, the code will raise a ValueError after validating every row as the first row contains an uneven number.

example_df_invalid = pd.DataFrame(
    data={
        "example_str": ["foo", "bar", "baz"],
        "example_int": [1, 4, 12],
    }
)

df_raised_error = DataFrameSchema.parse_df(
    dataframe=example_df_invalid,
    errors="raise",
)

Make it fly using multiple processes

In order to compete with the big guys (pandera and great-expectations), I decided to spend some more time on making the code run in parallel batches. This can speedup the validation process significantly for bigger datasets. This will not work on smaller datasets (but the validation on smaller dataframes is lightning fast anyway).

Set the amount of processes using the n_jobs arguments. On default it would only use one process but by specifying this argument you can utilize the maximum of your processor. Benchmark tests are on the way, the first results show great potential.

Documentation and next steps

Documentation on the parse_df method can be found here including some similar usage examples. Next steps for the pandantic project will focus on implementing the dataframe validation method on other famous DataFrames frameworks like Polars and PySpark. Thank you for taking the time to read the blog. I hope the pandantic package is able to improve your coding workflow (and life).