top of page
shivamshinde92722

Data Reliability 101: A Practical Guide to Data Validation Using Pydantic in Data Science Projects

This article will explain Why data validation is needed for the Python code, How it’s done using the Pydantic library, and How to integrate it into your data science projects.



Table of Content
  1. Why Data Validation is Needed in Python?

  2. Pydantic Python Library

  3. Pydantic Components:- Models, Fields, Required, Optional, and Nullable Fields, Field Validators

  4. Using Pydantic for Data Validation of Data in DataFrame Format


Why is Data Validation Needed in Python?

Python is a dynamically typed language. That means the datatype of a variable is decided based on its value, and while initializing the variable, you don’t have to mention its data type. An interpreter assigns the types to variables at runtime.


This makes Python easy to start with. However, this approach has some disadvantages.

In Python, you can override the value of the variable with another value having a different datatype.


a = 10 # Initializing the variable (Note that we have not mentioned the type)
a = "ten" # replacing the variable with the new value "10"

This seems fine at this point but, it could create issues at a later point in code unintentionally.

Also, it is not easy to understand the datatype of a variable at first glance. This is particularly inconvenient in the case of functions.


Suppose we have a function named resize_image as follows:


def resize_image(image, dim):
    # ...

Now, in this function, we won’t be able to understand the datatype of the parameter named dim just by looking at it. Additionally, if we assume dim is a list or tuple then another question arises which dimension comes first (does x come first or does y come first?).


Another problem that dynamic typing causes is that it will allow to create objects with incorrect datatypes and we won’t know about the errors it could create until we use those objects. For example,


P1 = Person(name="Kelsier", age=24)
P2 = Person(name="Breeze", age="35")

Here, both objects will be created even though one of the values of age is passed as a string. But again, this could create bugs in our code later in the stage.


The developer would want to know about the errors as early as possible in their development life cycle.


Pydantic Python Library

Pydantic is a data validation library in Python. We can make use of Pydantic to validate the data types before using them in any kind of operation. This way, we can avoid potential bugs that are similar to the ones mentioned earlier.


Pydantic Library does more than just validate the datatype as we will see next.


Pydantic Components

Models


One of the basic ways of using validations is via Models. Models are the classes that are inherited from the pydantic.BaseModel class and define the fields as annotated attributes.


When we pass the data that could contain some mismatched data types, after parsing and validation, Pydantic guarantees that the instances of the resultant model conform to the datatypes mentioned in the model. And if it is not possible then it will throw a validation error.


Model Usage:


from pydantic import BaseModel


class User(BaseModel):
    id: int
    name: str = 'Jane Doe'

user = User(id='123')

assert user.id == 123
assert isinstance(user.id, int)
# Note that '123' was coerced to an int and its value is 123

user2 = User(id="onetwothree") 
# Pydantic will throw a validation error because "onetwothree" cannot be
# converted to an int

In the above example, we have created a class User with two attributes id and name. We have also specified their datatypes. Now, when we give the string value “123” to the id field after creating an object, pydantic automatically converts “123” into 123 and then assigns it to the id. And, if such conversion is not possible (just like when we created user2 object) then it will throw a validation error.


Notice that the name string has a default value ‘Jane Doe’.


Fields


The Field function is used to customize the validation of fields of the model.


  1. Setting the default value to the field


we use the default keyword inside the Field function to give a default value to the field.


from pydantic import BaseModel, Field

class User(BaseModel):
    name: str = Field(default='John Doe')

user = User()
print(user)
#> name='John Doe'

2. Adding numerical constraints to the fields


we use the following keywords inside the Field function to add numerical constraints to the field


  • gt - greater than

  • lt - less than

  • ge - greater than or equal to

  • le - less than or equal to

  • multiple_of - a multiple of the given number

  • allow_inf_nan - allow 'inf', '-inf', 'nan' values


from pydantic import BaseModel, Field


class Foo(BaseModel):
    positive: int = Field(gt=0)
    non_negative: int = Field(ge=0)
    negative: int = Field(lt=0)
    non_positive: int = Field(le=0)
    even: int = Field(multiple_of=2)
    love_for_pydantic: float = Field(allow_inf_nan=True)


foo = Foo(
    positive=1,
    non_negative=0,
    negative=-1,
    non_positive=0,
    even=2,
    love_for_pydantic=float('inf'),
)
print(foo)
"""
positive=1 non_negative=0 negative=-1 non_positive=0 even=2 love_for_pydantic=inf
"""

3. String constraints


We use the following keywords to constrain the strings:


  • min_length: Minimum length of the string.

  • max_length: Maximum length of the string.

  • pattern: A regular expression that the string must match.


from pydantic import BaseModel, Field

class Foo(BaseModel):
    short: str = Field(min_length=3)
    long: str = Field(max_length=10)
    regex: str = Field(pattern=r'^\d*$')  

foo = Foo(short='foo', long='foobarbaz', regex='123')
print(foo)
#> short='foo' long='foobarbaz' regex='123'

Required, Optional, and Nullable Fields

We can set constraints on fields to indicate if they are required, optional, or cannot be None.


We can use the following table as a guide while creating a class having these constraints to its fields:



from typing import Optional

from pydantic import BaseModel, ValidationError


class Foo(BaseModel):
    f1: str  # required, cannot be None
    f2: Optional[str]  # required, can be None - same as str | None
    f3: Optional[str] = None  # not required, can be None
    f4: str = 'Foobar'  # not required, but cannot be None


try:
    Foo(f1=None, f2=None, f4='b')
except ValidationError as e:
    print(e)
    """
    1 validation error for Foo
    f1
      Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]
    """

Here, the f1 field is required but it cannot have a None value. However, while creating an object, f1 fields get a None value and hence we get a validation error.


Field Validators

If we want to apply custom validation to any of your fields then you can do that by creating a method with the validation criteria and the @field_validator decorator.


from pydantic import (
    BaseModel,
    ValidationError,
    ValidationInfo,
    field_validator,
)


class UserModel(BaseModel):
    id: int
    name: str

    @field_validator('name')
    @classmethod
    def name_must_contain_space(cls, v: str) -> str:
        if ' ' not in v:
            raise ValueError('must contain a space')
        return v.title()

    # you can select multiple fields, or use '*' to select all fields
    @field_validator('id', 'name')
    @classmethod
    def check_alphanumeric(cls, v: str, info: ValidationInfo) -> str:
        if isinstance(v, str):
            # info.field_name is the name of the field being validated
            is_alphanumeric = v.replace(' ', '').isalnum()
            assert is_alphanumeric, f'{info.field_name} must be alphanumeric'
        return v


print(UserModel(id=1, name='John Doe'))
#> id=1 name='John Doe'

try:
    UserModel(id=1, name='samuel')
except ValidationError as e:
    print(e)
    """
    1 validation error for UserModel
    name
      Value error, must contain a space [type=value_error, input_value='samuel', input_type=str]
    """

try:
    UserModel(id='abc', name='John Doe')
except ValidationError as e:
    print(e)
    """
    1 validation error for UserModel
    id
      Input should be a valid integer, unable to parse string as an integer [type=int_parsing, input_value='abc', input_type=str]
    """

try:
    UserModel(id=1, name='John Doe!')
except ValidationError as e:
    print(e)
    """
    1 validation error for UserModel
    name
      Assertion failed, name must be alphanumeric
    assert False [type=assertion_error, input_value='John Doe!', input_type=str]
    """

Here, we have created two custom validations using methods name_must_contain_space(…) and check_alphanumeric(…). The first method applies the validation on the name field. It guarantees that the name field has at least one space. The second method checks if the string field is alphanumeric or not.


We also created three objects in the code.


UserModel(id=1, name='samuel')

Here, we get the validation error because of the name_must_contain_space() field validator, since the name field does not have a space in it.


UserModel(id='abc', name='John Doe')

Here, we get the validation error because the id is not of integer datatype and pydantic is not able to convert it into one.


UserModel(id=1, name='John Doe!')

Here, we get the validation error because of the check_alphanumeric(…) field validator, since the name field has an exclamation mark (!) in it, it is no longer an alphanumeric string.


Using Pydantic for Data Validation of Data in DataFrame Format

Now that we have seen how to use Pydantic for the validation of fields in a class, let’s extend this knowledge to our data science project.


We will use Pydantic validations to constrain data records in our dataframe. This ensures we don’t use any invalid data while building our machine-learning model.


For a demonstration of this, let’s use the ‘Thyroid Disease Dataset’ from Kaggle. I will skip the exploratory data analysis and data preprocessing steps since they are outside the scope of this article. If you are interested in all the steps of the project, check out the whole code using the below link.



Now let’s see how to perform data validation on the dataframe.


from pydantic import BaseModel, ValidationError, Field
from typing import List, Optional


class Dictvalidator(BaseModel):

    age: int = Field(gt=0, le=100)
    sex: Optional[str]
    on_thyroxine: Optional[str]
    query_on_thyroxine: Optional[str]
    on_antithyroid_meds: Optional[str]
    sick: Optional[str]
    pregnant: Optional[str]
    thyroid_surgery: Optional[str]
    I131_treatment: Optional[str]
    query_hypothyroid: Optional[str]
    query_hyperthyroid: Optional[str]
    lithium: Optional[str]
    goitre: Optional[str]
    tumor: Optional[str]
    hypopituitary: Optional[str]
    psych: Optional[str]
    TSH_measured: str
    TSH: Optional[float]
    T3_measured: str
    T3: Optional[float]
    TT4_measured: str
    TT4: Optional[float]
    T4U_measured: str
    T4U: Optional[float]
    FTI_measured: str
    FTI: Optional[float]
    TBG_measured: str
    TBG: Optional[float]
    referral_source: Optional[str]
    target: str
    patient_id: int


class dataframe_validator(BaseModel):

    df_dict: List[Dictvalidator]


if __name__ == '__main__':

    df = pd.read_csv(raw_data_file_path)

    try:
        dataframe_validator(df_dict=df.to_dict(orient='records'))
    except ValidationError as e:
        raise e

First of all, we will create a class named Dictvalidator with all the features of a dataframe as fields. We will add all the constraints we wish on these fields as shown in the code above.


Next, we will create another class named dataframe_validator which will have a field that is the list of Dictvalidator. Now when we create an instance of the dataframe_validator class and pass the dataframe records as a list of dictionaries, all the fields of the dataframe will be validated.


Also, there is another way of validating the pandas dataframe. We can use a Python library called Pandantic for this. You can check out the article by Wessel Huising on how to use this library for the validation of dataframes. The link for the article is given in the reference section.


 
References





 


Thanks for reading!


Connect with me on LinkedIn


Similarly, you can follow me on Medium


Have a great day!

Comments


bottom of page