Logo

How to drop rows of Pandas DataFrame whose value in a certain column is NaN?

Dropping rows with missing or NaN values is a common data-cleaning task in Pandas. Below are some straightforward ways to drop rows specifically where a certain column contains NaN.

1. Using dropna() with subset Parameter

The most direct way is to use dropna() and specify which column (or columns) to look at via the subset parameter.

import pandas as pd import numpy as np df = pd.DataFrame({ "A": [1, 2, np.nan, 4], "B": [5, np.nan, 7, 8], "C": [9, 10, 11, np.nan] }) # Drop rows where column "A" has NaN df_dropped = df.dropna(subset=["A"]) print(df_dropped)
  • subset=["A"] tells Pandas to only look at column A when deciding whether to drop a row.
  • Rows that have NaN in column A get dropped, but rows with NaNs in other columns are unaffected.

2. Using a Boolean Condition

Sometimes, you might want a more explicit approach. You can create a boolean mask that identifies rows where the specified column is not NaN, then filter the DataFrame:

df_dropped = df[df["A"].notna()] print(df_dropped)
  • df["A"].notna() produces a boolean Series that is True where A is not NaN, and False otherwise.
  • Slicing df[...] with that boolean array returns only rows where A is non-null.

3. Dropping Multiple Columns’ NaNs

If you want to drop rows that have NaNs in any of several columns, specify multiple columns in subset:

# Drop rows where columns "A" OR "B" contain NaN df_dropped = df.dropna(subset=["A", "B"]) print(df_dropped)

Now, a row is dropped if it has NaN in either A or B.

4. Additional Tips

  1. inplace=True:

    • You can modify the original DataFrame without creating a new one by using inplace=True. For example:
      df.dropna(subset=["A"], inplace=True)
    • However, using inplace=True is generally discouraged in larger pipelines where functional approaches (returning a new DataFrame) are more predictable.
  2. Keep an Eye on Other Columns:

    • Make sure you’re only dropping rows based on the columns you truly don’t want NaNs for. You may have to combine different approaches if your logic is more complex (e.g., dropping rows if two specific columns are both NaN).
  3. Performance:

    • Dropping rows in large DataFrames can be expensive in memory and computation. Always validate that this operation is necessary, especially if you’re dealing with massive datasets.

Learn More About Pandas and Python

If you want to refine your data manipulation skills and overall Python proficiency, here are some recommended courses from DesignGurus.io:

For engineers aiming at system design or data engineering roles:

Final Thoughts

Dropping rows based on NaN values in a particular column is usually done via:

df_dropped = df.dropna(subset=["column_name"])

or by filtering with a boolean mask:

df_dropped = df[df["column_name"].notna()]

Choose the method that best fits your workflow, and you’ll maintain clean, consistent data for analysis or further processing. Happy data wrangling!

CONTRIBUTOR
TechGrind