Validate Pandas DataFrames using Pydantic: pandantic
As I cycled through the beautiful centre of Amsterdam, I tuned in to the Python Bytes podcast. This time the hosts discussed a new release of the Pydantic package, which comes with a major increase of computational performance (potentially 50 times faster) since the Pydantic core implementation is rewritten in Rust. The hosts mention the implication of using tools like FastAPI but I remembered one of my earlier shower thoughts around the use of Pydantic in combination with another famous package.
I always wondered why Pydantic is not used to validate various DataFrames from the various frameworks like Pandas.
Pydantic offers a great API which is widely used, its BaseModels
are widely used as schema validators in a lot of different projects. Except for Pandas Dataframes. Of course I searched the internet and there are some github gists laying around that could make validation of the dataframe work. However, I never liked any of the approaches of those proposed solutions. I concluded it must be a performance thing that kept Pydantic away from the good old DataFrames. But now that Pydantic could potentially speed up computation up to 50 times, I decided to give it a go that same evening and I got to a working solution pretty quick (I thought).
pandantic
Why and when to use pandantic
One of the methods I appreciate the most while using pydantic’s BaseModel
is the parse_obj
class methods. Therefore, I considered that adding another parse method called parse_df
for parsing DataFrame specific objects would make the most sense. This is where the package pandantic
comes in.
Why would you want to use pandantic
over any other DataFrames validation packages like pandera
or great-expectations
? Firstly, I would like to point out the well designed API of pydantic
is very user friendly. Personally, I feel that both API’s of pandera
and great-expectations
are not that straightforward and easy to use in comparison to using pydantic
.
Another example where pandantic
could improve your life is when you need to validate items and dataframes with the same logic. For example, a BaseModel
could be used for validating JSON as the input of a FastAPI application for real-time predictions on a trained ML model (which is a quite common approach). How convenient would it be when this same BaseModel
can be utilized to validate the training DataFrame is used to fit that same ML model? That’s when you might want to use pandantic
.
A cute example
First install the pandantic
package, which can be considered to be a fork of the pydantic
package.
pip install pandantic
To call the parse_df
class method, instead of importing the BaseModel
from the pydantic
package you should import the BaseModel
from the pandantic
package. Just like a pydantic
user normally would, specify the schema of the BaseModel
class:
from pandantic import BaseModel
from pydantic.types import StrictInt
class DataFrameSchema(BaseModel):
"""Example schema for testing."""
example_str: str
example_int: StrictInt
Let’s try this schema on a simple pandas.DataFrame
. It will become more complex later on, promised! Use the class method parse_df
from the freshly defined DataFrameSchema
and specify the `dataframe` that should be validated. In this example, the user wants to filter
out the bad rows and return the resulting dataset containing only valid rows. There are more options for the errors
argument like raise
to raise a ValueError after validating the whole dataframe.
import pandas as pd
df_invalid = pd.DataFrame(
data={
"example_str": ["foo", "bar", 1],
"example_int": ["1", 2, 3.0],
}
)
df_filtered = DataFrameSchema.parse_df(
dataframe=df_invalid,
errors="filter",
)
df_filtered
would only consist of the second row containing bar
and 1
, filtering out all the invalid records of the dataframe.
Spicing things up using a custom validator
Let’s make things more interesting and implement a custom field validator from Pydantic into the BaseModel
class and apply it on the same dataframe.
Let’s create a custom validation method, to validate one of the fields in the pandantic.BaseModel
. Imagine the situation that the BaseModel
should be able to make sure the example_int
field’s value is an even number. Like using the normal pydantic
package, it is as easy as implementing the (new) field_validator
decorator from pydantic
and code the right logic to make sure the integer is even.
from pandantic import BaseModel
from pydantic import ValidationError, field_validator
from pydantic.types import StrictInt
class DataFrameSchema(BaseModel):
"""Example schema for testing."""
example_str: str
example_int: StrictInt
@field_validator("example_int")
def validate_even_integer(
cls, x: int
) -> int:
"""Example custom validator to validate if int is even."""
if x % 2 != 0:
raise ValidationError(f"example_int must be even, is {x}.")
return x
Using the example BaseModel
from above, it is possible to parse the following pandas.DataFrame
. If the validation does not pass an error is raised. By setting the errors
argument to raise
, the code will raise a ValueError
after validating every row as the first row contains an uneven number.
example_df_invalid = pd.DataFrame(
data={
"example_str": ["foo", "bar", "baz"],
"example_int": [1, 4, 12],
}
)
df_raised_error = DataFrameSchema.parse_df(
dataframe=example_df_invalid,
errors="raise",
)
Make it fly using multiple processes
In order to compete with the big guys (pandera
and great-expectations
), I decided to spend some more time on making the code run in parallel batches. This can speedup the validation process significantly for bigger datasets. This will not work on smaller datasets (but the validation on smaller dataframes is lightning fast anyway).
Set the amount of processes using the n_jobs
arguments. On default it would only use one process but by specifying this argument you can utilize the maximum of your processor. Benchmark tests are on the way, the first results show great potential.
Documentation and next steps
Documentation on the parse_df
method can be found here including some similar usage examples. Next steps for the pandantic
project will focus on implementing the dataframe validation method on other famous DataFrames frameworks like Polars and PySpark. Thank you for taking the time to read the blog. I hope the pandantic
package is able to improve your coding workflow (and life).