shivamshinde92722

Feb 95 min read

Unlocking the Potential of Pandas: Must-Know Methods for Efficient Data Handling in Python

In this article, you will learn the very important and must-know methods present in the Python Pandas library.

Creating a dataframe

Using a dictionary:

Before understanding any data processing method, one needs to know the data loading method. To create a dataframe from the dictionary, we can use pandas.DataFrame method.

# Importing the pandas library
import pandas as pd

# Creating a dataframe from the dictionary
df = pd.DataFrame(
  {"a": [2, 33, 94, 84],
   "z": [59, 73, 93, 29],
   "k": [24, 56, 84, 93]  
  }
)

Using a csv file:

Every time we do not get the data in the dictionary format. Most of the time the data is in CSV format. So we read the given CSV file as a dataframe using pandas.read_csv method.

df = pd.read_csv("folder_location/filename.csv")

Similarly, the pandas library also has similar methods to read the data from the tsv files, text files, etc.

Obtaining the subset of the dataframe (Rows)

Getting the first n rows of the dataframe:

df.head(n)

Getting the last n rows of the dataframe:

df.tail(n)

Randomly selecting rows of the dataframe:

# Selecting n rows
df.sample(n)

# Selecting 50% of rows
df.sample(0.5)

Getting n rows where certain column has the highest and lowest values:

# Getting the rows where n largest values of column_name are present
df.nlargest(n, 'column_name')

# Getting the rows where n smallest values of column_name are present
df.nsmallest(n, 'column_name')

Getting rows by their position:

df.iloc[20:30]

here iloc means index location. So here, we will get the rows with index locations 10 to 20.

Getting the rows that satisfy a certain condition:

df[df.age > 10]

here we are only getting the rows where the value of feature ‘age’ is greater than 10.

Obtaining the subset of the dataframe (Columns)

Getting a single column:

# Way - 1
df['column_name']

# Way - 2
df.column_name

Getting the multiple columns at once:

df[['column_name1', 'column_name2', 'column_name3']]

here we are giving the list of column names to the dataframe to obtain the data for those columns.

Filtering the columns using regex expression:

df.filter(regex='regex_Expression')

here we will get all the columns whose names match the regular expression provided.

Obtaining the subset of the dataframe (Rows and Columns)

Getting the columns using ‘loc’ keyword:

loc stands for location. We can use this keyword to get single or multiple columns along with any filtering of rows.

# getting all the rows with only some of the columns
df.loc[:, 'a':'b']

here the first argument in the list is the ‘:’ symbol means that we are getting all the rows. The second argument ‘a’:’b’ means that we want to get the values for the columns from ‘a’ to ‘b’.

df.loc[df['a'] > 15, 'a':'b']

Here, the first argument is a condition (df[‘a’] > 15). This condition is used to filter the rows. So overall the above code gives us the columns from ‘a’ to ‘b’ with the rows that satisfy the given condition.

Getting the columns using ‘iloc’ keyword:

‘iloc’ stands for index location. When we used loc keyword to get the data, we used the column names and a condition for row filtering. On the other hand, when we use the iloc keyword to get the data, we use the index position of columns and rows to get the data.

df[:, [1,2,3]]

The above code gives us all the rows for the columns with indices 1, 2, and 3.

df[10:20, [4,5]]

The above code gives us the rows from 10 to 20 for the columns with indices 4 and 5.

Reshaping the dataframe

concatenating two dataframes row-wise:

pd.concat([df1, df2], axis=0)

here we will give the two dataframes as input and we will get the new dataframe which is formed by stacking input dataframes row-wise.

concatenating two dataframes column-wise:

pd.concat([df1, df2], axis=1)

The only thing that we need to change here is the value of the parameter axis from 0 to 1. This way we will get the new dataframe which is formed by stacking input dataframes column-wise.

sorting the rows by the values of a certain column:

# in ascending order
df.sort_values('column_name', ascending=True)

# in descending order
df.sort_values('column_name', ascending=False)

pandas sort_values method will sort the rows by the values of a certain column. There the parameter ‘ascending’ determines if we want to sort from low to high or high to low.

renaming a column:

df.rename(columns = {'old_name': 'new_name'})

making a new column:

df['new_column_name'] = [29, 2938, 382, 9439]

Grouping the data

Let’s say we have a column named ‘sex’ in our dataset. The values present in the ‘sex’ column are ‘Male’, ‘Female’, and ‘Other’. We also have another column named ‘salary’ in our data. Then to answer questions like ‘What is the average salary of all males in the data?’ we could use the grouping technique.

df.groupby(by = 'sex')['salary'].mean()

In the above code snippet, we first group our data using the ‘sex’ column. Then we will get the data for the ‘salary’ column and then average the data points for each of the sexes.

Summarizing the data

finding the count of each unique value for the column:

Let’s say we have a column named ‘day_of_the_week’. Then it would only take 7 values. Let’s say we want to know how many records are present for each of the days of the week in our data. To calculate this we can use the pandas value_counts method.

df['column_name'].value_counts()

This function is very important because if we use this function on the dependent column then we would get the idea about whether our data is imbalanced or not in case of classification problem.

finding the unique values in the column and their number:

# finding the list of unique values for the column
df['column_name'].unique()

# finding the number of unique values present in the column
df['column_name'].nunique()

getting the information about the datatypes of the values for each column:

df.info()

getting a report about the statistics of all the numerical columns:

df.describe()

getting minimum, maximum, median, mean, variance, standard deviation, and quantiles of the values in the column:

df['column_name'].min()
df['column_name'].max()
df['column_name'].median()
df['column_name'].mean()
df['column_name'].var()
df['column_name'].std()
df['column_name'].quantile([0.25, 0.50, 0.75])

Applying a function to the column values

# Applying function to the whole dataframe
df.apply(lambda x: x**2)

# Applying function to the column or group of columns
df['column_name'].apply(lambda x: x+1)

This is one of the most important functions of pandas. As you can see in the above code snippet, we have given the function that will be applied to all the columns mentioned.

Handling missing data

removing all the missing values:

df.dropna()

filling the missing values with another value:

df['column_name'].fillna(value)

we fill the missing value in each column with the median of each column in most cases.

Handling the duplicate data

finding all the duplicate records:

df.duplicated()

removing all the duplicate records:

df.drop_duplicates()

Performing operations on string data

In pandas, we can perform most of the string operations on the string data after adding ‘.str’ after accessing the column.

df['column_name'].str.upper()
df['column_name'].str.lower()
df['column_name'].str.len()
df['column_name'].str.count('a')

Creating plots using pandas plot function

df.plot(kind='bar')

Here the kind of the plot could be ‘bar’, ‘barh’, ‘hist’, ‘box, ‘kde’, ‘density’, ‘area’, ‘scatter’, ‘hexbin’, ‘pie’, etc.

Outro

Thanks for reading!

Connect with me on LinkedIn
Similarly, you can follow me on Medium

Have a great day!