Validate Pandas DataFrames using Pydantic: pandantic

Wessel Huising
4 min readMay 2, 2023

--

PSA: The example code in this article has been updated accordantly as a breaking API version (1.0.0) has been released. One of the new noticeable features for this official release is the addition of a Pandas plugin. Please see the article of Xavier (as he joined the project as a contributor) with examples how to use this plugin out of the box!

As I cycled through the beautiful centre of Amsterdam, I tuned in to the Python Bytes podcast. This time the hosts discussed a new release of the Pydantic package, which comes with a major increase of computational performance (potentially 50 times faster) since the Pydantic core implementation is rewritten in Rust. The hosts mention the implication of using tools like FastAPI but I remembered one of my earlier shower thoughts around the use of Pydantic in combination with another famous package.

I always wondered why Pydantic is not used to validate various DataFrames from the various frameworks like Pandas.

Pydantic offers a great API which is widely used, its BaseModels are widely used as schema validators in a lot of different projects. Except for Pandas Dataframes. Of course I searched the internet and there are some github gists laying around that could make validation of the dataframe work. However, I never liked any of the approaches of those proposed solutions. I concluded it must be a performance thing that kept Pydantic away from the good old DataFrames. But now that Pydantic could potentially speed up computation up to 50 times, I decided to give it a go that same evening and I got to a working solution pretty quick (I thought).

pandantic

Why and when to use pandantic

One of the methods I appreciate the most while using pydantic’s BaseModel is the parse_obj class methods. Therefore, I considered that adding another parse method called parse_df for parsing DataFrame specific objects would make the most sense. This is where the package pandantic comes in.

Why would you want to use pandantic over any other DataFrames validation packages like pandera or great-expectations? Firstly, I would like to point out the well designed API of pydantic is very user friendly. Personally, I feel that both API’s of pandera and great-expectations are not that straightforward and easy to use in comparison to using pydantic.

Another example where pandantic could improve your life is when you need to validate items and dataframes with the same logic. For example, a BaseModel could be used for validating JSON as the input of a FastAPI application for real-time predictions on a trained ML model (which is a quite common approach). How convenient would it be when this same BaseModel can be utilized to validate the training DataFrame is used to fit that same ML model? That’s when you might want to use pandantic.

A cute example

First install the pandantic package, which can be considered to be a fork of the pydantic package.

pip install pandantic

The pandantic package contains a Pandantic class which would be used later on to initialize a validator with the given schema. Just like a pydantic user normally would, specify the schema of the BaseModel class:

from pydantic import BaseModel
from pydantic.types import StrictInt

# Define your schema using Pydantic BaseModel
class DataFrameSchema(BaseModel):
"""Example schema for testing."""

example_str: str
example_int: StrictInt

Let’s try this schema on a simple pandas.DataFrame. It will become more complex later on, promised! Use the class method validate from the freshly initialized Pandantic class and specify the `dataframe` that should be validated. In this example, the user wants to filter out the bad rows and return the resulting dataset containing only valid rows. There are more options for the errors argument like raise to raise a ValueError after validating the whole dataframe.

from pandantic import Pandantic

validator = Pandantic(schema=DataFrameSchema)

# Example DataFrame with some invalid data
df_invalid = pd.DataFrame(
data={
"example_str": ["foo", "bar", 1], # Last value is invalid (int instead of str)
"example_int": ["1", 2, 3.0], # First and last values are invalid (str and float)
}
)

# Validate with error raising
try:
validator.validate(dataframe=df_invalid, errors="raise")
except ValueError:
print("Validation failed!")

# Or filter out invalid rows
df_valid = validator.validate(dataframe=df_invalid, errors="skip")
# Only the second row remains as it's the only valid one

df_valid would only consist of the second row containing bar and 2, filtering out all the invalid records of the dataframe.

Spicing things up using a custom validator

Let’s make things more interesting and implement a custom field validator from Pydantic into the BaseModel class and apply it on the same dataframe.

Let’s create a custom validation method, to validate one of the fields in the pydantic.BaseModel. Imagine the situation that the BaseModel should be able to make sure the example_int field’s value is an even number. Like using the normal pydantic package, it is as easy as implementing the (new) field_validator decorator from pydantic and code the right logic to make sure the integer is even.

from pydantic import BaseModel, field_validator
from pandantic import Pandantic

class CustomSchema(BaseModel):
example_str: str
example_int: int

@field_validator("example_int")
def must_be_even(cls, v: int) -> int:
if v % 2 != 0:
raise ValueError("Number must be even")
return v

validator = Pandantic(schema=CustomSchema)

Using the example BaseModel from above, it is possible to parse the following pandas.DataFrame. If the validation does not pass an error is raised. By setting the errors argument to raise, the code will raise a ValueError after validating every row as the first row contains an uneven number.

example_df_invalid = pd.DataFrame(
data={
"example_str": ["foo", "bar", "baz"],
"example_int": [1, 4, 12],
}
)

df_raised_error = DataFrameSchema.parse_df(
dataframe=example_df_invalid,
errors="raise",
)

Documentation and next steps

Documentation on the Pandantic class can be found here including some similar usage examples. Next steps for the pandantic project will focus on implementing the dataframe validation method on other famous DataFrames frameworks like Polars and PySpark. Thank you for taking the time to read the blog. I hope the pandantic package is able to improve your coding workflow (and life).

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Wessel Huising
Wessel Huising

Written by Wessel Huising

Randomly Picked Stack Engineer. Amsterdam, slowly becoming older.

Responses (2)

Write a response