Chapter 02 — Pandas

Pandas Mastery

From raw CSV to clean, ML-ready data. Learn Series, DataFrames, cleaning, transforming, grouping, and encoding — all the tools a data scientist uses daily.

DataFrame Series groupby EDA get_dummies datetime
04

Pandas Series & DataFrame

Pandas has two main data structures. A Series is a 1D labeled array (like a single column). A DataFrame is a 2D table with labeled rows and columns — the most important object in data science.

Series
series.py
import pandas as pd

# Create Series from list
s = pd.Series([10, 20, 30, 40], index=['a','b','c','d'])

# Attributes
print(s.values)   # [10 20 30 40]
print(s.index)   # Index(['a','b','c','d'])
print(s.dtype)   # int64
print(s.name)    # None

# Access elements
print(s['b'])    # 20
print(s[0])     # 10

# Arithmetic (broadcasts like NumPy)
print(s * 2)     # a:20 b:40 c:60 d:80
print(s > 15)   # a:False b:True c:True d:True
Output
a 10 b 20 c 30 d 40 dtype: int64
DataFrame
dataframe.py
import pandas as pd

# Create from dictionary
df = pd.DataFrame({
    'name'  : ['Alice', 'Bob', 'Charlie'],
    'age'   : [25, 30, 22],
    'score' : [88.5, 92.0, 75.0]
})

# Inspect
print(df.shape)      # (3, 3)
print(df.columns)    # Index(['name','age','score'])
print(df.dtypes)     # name:object age:int64 score:float64
df.info()            # summary of shape, dtypes, nulls
df.describe()        # stats: count, mean, std, min, max

# Add / rename / drop columns
df['grade'] = df['score'] >= 85          # add column
df = df.rename(columns={'age': 'Age'})   # rename
df = df.drop('grade', axis=1)            # drop column

# Sorting
df.sort_values('score', ascending=False)
df['name'].value_counts()               # frequency of each name
Output
name age score 0 Alice 25 88.5 1 Bob 30 92.0 2 Charlie 22 75.0 shape: (3, 3)
05

CSV & Excel Read / Write

Reading and writing files is the very first step of every real-world data project. Pandas supports CSV, Excel, JSON, SQL and more.

read_write.py
import pandas as pd

# ── READ CSV ──
df = pd.read_csv('data.csv')

# Common parameters
df = pd.read_csv(
    'data.csv',
    sep=',',           # delimiter (use '\t' for TSV)
    header=0,          # row to use as column names
    index_col=0,        # use first col as row index
    usecols=['a','b'],  # load only these columns
    nrows=100,          # load only first 100 rows
    na_values=['?','NA'], # treat these as NaN
    parse_dates=['date'] # auto-parse date column
)

# READ EXCEL
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# ── QUICK INSPECTION after loading ──
df.head(5)          # first 5 rows
df.tail(3)          # last 3 rows
df.sample(5)        # random 5 rows
df.info()           # dtypes, non-null counts
df.describe()       # statistics

# ── WRITE ──
df.to_csv('output.csv', index=False)          # index=False avoids extra column
df.to_excel('output.xlsx', index=False, sheet_name='Results')
df.to_json('output.json', orient='records')
Tip
Always run df.info() right after loading — it tells you shape, dtypes, and how many nulls exist per column in one shot.
Always Use index=False when saving
By default, to_csv() writes the DataFrame index as an extra column. Always pass index=False unless you specifically need the index saved.
06

Data Selection

Pandas gives you multiple ways to select data. Understanding the difference between loc, iloc, and boolean indexing is essential.

loc and iloc
loc_iloc.py
import pandas as pd

df = pd.DataFrame({
    'name' : ['Alice','Bob','Charlie','Diana'],
    'age'  : [25, 30, 22, 28],
    'score': [88, 92, 75, 95]
})

# ── Column selection ──
df['name']                 # single column → Series
df[['name', 'score']]     # multiple columns → DataFrame

# ── loc — label-based (uses row labels / column names) ──
df.loc[0]                  # row at label 0
df.loc[0:2]               # rows 0 to 2 INCLUSIVE
df.loc[0, 'name']         # row 0, 'name' column → 'Alice'
df.loc[1:3, ['name','age']] # rows 1-3, specific cols

# ── iloc — integer position-based ──
df.iloc[0]                 # first row
df.iloc[-1]                # last row
df.iloc[0:2, 1:3]         # rows 0-1 (exclusive 2), cols 1-2
df.iloc[[0,3], :]          # rows 0 and 3, all columns
Key Difference
loc[0:2] → rows 0, 1, 2 (INCLUSIVE end) iloc[0:2] → rows 0, 1 (EXCLUSIVE end, like Python slicing)
loc vs iloc — Quick Rule
Use loc when you know the label/name. Use iloc when you know the position number. Think: loc = Label, iloc = Integer.
Boolean Indexing & Query
boolean_filter.py
# Single condition
df[df['age'] > 25]

# Multiple conditions — use & | ~
df[(df['age'] > 20) & (df['score'] >= 90)]
df[(df['age'] < 23) | (df['score'] > 93)]
df[~(df['name'] == 'Bob')]      # NOT Bob

# .isin() — filter by a list of values
df[df['name'].isin(['Alice', 'Diana'])]

# .between() — range filter
df[df['age'].between(22, 28)]

# .query() — SQL-style, cleaner syntax
df.query('age > 25 and score >= 90')
df.query('name in ["Alice", "Diana"]')
Output — df[df['score'] >= 90]
name age score 1 Bob 30 92.0 3 Diana 28 95.0
07

Missing Data Handling

Real-world datasets always have missing values. Properly handling them is one of the most critical steps in any ML pipeline — wrong choices here can silently ruin your model.

missing_data.py
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [np.nan, 5, 6, np.nan],
    'C': [7, 8, 9, 10]
})

# ── Detect ──
print(df.isnull().sum())   # count nulls per column
print(df.isnull().sum().sum())  # total nulls

# ── Drop ──
df.dropna()                     # drop any row with a null
df.dropna(axis=1)               # drop any COLUMN with a null
df.dropna(how='all')            # drop only if ALL values null
df.dropna(thresh=2)             # keep rows with at least 2 non-nulls
df.dropna(subset=['A'])         # drop only if col A is null

# ── Fill ──
df.fillna(0)                    # fill all nulls with 0
df['A'].fillna(df['A'].mean()) # fill with mean — very common in ML!
df['B'].fillna(method='ffill') # forward fill (use previous value)
df['B'].fillna(method='bfill') # backward fill (use next value)

# ── Replace specific values ──
df.replace(-999, np.nan)       # replace sentinel value with NaN
df.replace({'A': {0: np.nan}}) # replace 0 in column A with NaN
Output — df.isnull().sum()
A 1 B 2 C 0 dtype: int64 Total nulls: 3
ML Best Practice — When to fill vs drop
Fill with mean/median for numerical columns with few nulls. Fill with mode for categorical. Drop rows only if the null % is very low. Drop columns if they have >50% nulls.
08

groupby

groupby splits the DataFrame into groups based on a column, applies a function to each group, and combines the results — a pattern called Split → Apply → Combine.

groupby.py
import pandas as pd

df = pd.DataFrame({
    'dept' : ['Eng','Mkt','Eng','Mkt','Eng'],
    'name' : ['Alice','Bob','Charlie','Diana','Eve'],
    'salary':[80,60,95,70,88],
    'years' :[3,5,7,2,4]
})

# Basic groupby
df.groupby('dept')['salary'].mean()
df.groupby('dept')['salary'].sum()
df.groupby('dept').size()            # count rows per group

# Multiple aggregations at once
df.groupby('dept').agg({
    'salary': ['mean', 'max', 'min'],
    'years':  ['sum', 'count']
})

# transform — returns same shape as original (useful for feature engineering)
df['dept_mean_sal'] = df.groupby('dept')['salary'].transform('mean')

# filter — keep only groups matching condition
df.groupby('dept').filter(lambda x: x['salary'].mean() > 75)
Output — groupby('dept')['salary'].mean()
dept Eng 87.666667 Mkt 65.000000 Name: salary, dtype: float64
08b

merge & concat

merge_concat.py
import pandas as pd

df1 = pd.DataFrame({'id':[1,2,3], 'name':['A','B','C']})
df2 = pd.DataFrame({'id':[2,3,4], 'score':[88,92,75]})

# ── merge (like SQL JOIN) ──
pd.merge(df1, df2, on='id', how='inner')   # only matching ids
pd.merge(df1, df2, on='id', how='left')    # all from df1
pd.merge(df1, df2, on='id', how='outer')   # all from both

# Different column names
pd.merge(df1, df2, left_on='id', right_on='user_id')

# ── concat — stack rows or columns ──
pd.concat([df1, df2])                  # stack rows (axis=0)
pd.concat([df1, df2], axis=1)         # stick columns side by side
pd.concat([df1, df2], ignore_index=True) # reset index 0,1,2...
inner merge result
id name score 0 2 B 88 1 3 C 92
merge vs concat — when to use which
Use merge when combining tables based on a shared key (like SQL JOIN). Use concat when simply stacking rows or columns together.
08c

pivot_table & melt

pivot.py
import pandas as pd

df = pd.DataFrame({
    'dept'  :['Eng','Eng','Mkt','Mkt'],
    'gender':['M','F','M','F'],
    'salary':[90,85,70,65]
})

# pivot_table: group + aggregate in tabular form
pivot = pd.pivot_table(
    df,
    values='salary',
    index='dept',
    columns='gender',
    aggfunc='mean'
)
print(pivot)

# cross-tabulation (frequency table)
pd.crosstab(df['dept'], df['gender'])

# melt — wide format → long format (reverse of pivot)
wide = pd.DataFrame({'name':['Alice'], 'math':[90], 'english':[85]})
long  = wide.melt(id_vars='name', var_name='subject', value_name='score')
print(long)
pivot_table output
gender F M dept Eng 85.0 90.0 Mkt 65.0 70.0
09

apply, map & lambda Functions

These functions let you apply custom logic to every element, row, or column in a DataFrame without writing loops.

apply_map.py
import pandas as pd

df = pd.DataFrame({
    'name' :['alice','bob','charlie'],
    'score':[88, 92, 75],
    'grade':['A', 'A', 'B']
})

# ── apply on Series (single column) ──
df['name'].apply(str.title)           # Alice, Bob, Charlie
df['score'].apply(lambda x: x + 5)   # add 5 bonus to each score

# apply on DataFrame rows (axis=1)
df.apply(lambda row: row['score'] * 2 if row['grade']=='A' else row['score'], axis=1)

# Custom function with apply
def classify(score):
    if score >= 90:  return 'Excellent'
    elif score >= 75: return 'Good'
    else: return 'Needs Work'

df['level'] = df['score'].apply(classify)

# ── map — label encoding with dict ──
df['grade_num'] = df['grade'].map({'A': 1, 'B': 0})

# ── sort_values with lambda key ──
df.sort_values('name', key=lambda x: x.str.len())  # sort by name length
Output — df['level']
0 Good 1 Excellent 2 Good Name: level, dtype: object
10

EDA Workflow

Exploratory Data Analysis (EDA) is the process of understanding your dataset before modeling. Follow these 5 steps for any new dataset.

eda_workflow.py
import pandas as pd

# ── STEP 1: Load & First Look ──
df = pd.read_csv('dataset.csv')
print(df.shape)        # (rows, cols)
df.info()              # dtypes + null counts
df.head()

# ── STEP 2: Clean ──
df.isnull().sum()                    # nulls per column
df.duplicated().sum()               # duplicate rows
df = df.drop_duplicates()           # remove duplicates
df.columns = df.columns.str.lower().str.strip()  # normalize column names
df['age'] = df['age'].astype(int)   # fix wrong dtype

# ── STEP 3: Univariate Analysis ──
df['category'].value_counts()       # categorical column frequencies
df['age'].describe()               # count mean std min max

# ── STEP 4: Bivariate Analysis ──
df.corr()                           # correlation matrix (numerical cols)
df.groupby('category')['price'].mean()  # mean price per category
pd.crosstab(df['gender'], df['outcome'])  # frequency cross-tab

# ── STEP 5: Feature Understanding ──
cat_cols = df.select_dtypes('object').columns
num_cols = df.select_dtypes('number').columns
df[cat_cols].nunique()              # cardinality of categorical cols
df['target'].value_counts(normalize=True)  # class balance check
EDA Checklist
✓ Shape & dtypes ✓ Nulls ✓ Duplicates ✓ Value counts ✓ Correlation ✓ Class balance
11

String Operations — .str accessor

The .str accessor lets you apply string methods to every element in a Series without loops.

string_ops.py
import pandas as pd

s = pd.Series(['  Hello World  ', 'foo BAR', 'pandas 2.0'])

# Case
s.str.lower()          # ['  hello world  ', 'foo bar', 'pandas 2.0']
s.str.upper()
s.str.title()

# Strip whitespace
s.str.strip()           # remove leading/trailing spaces

# Search & Replace
s.str.contains('World')            # [True, False, False]
s.str.replace('pandas', 'Python')  # replace substring
s.str.startswith('foo')
s.str.endswith('0')

# Split & Length
s.str.split(' ')         # split into list on space
s.str.len()             # length of each string
s.str.count('o')        # count 'o' in each string

# Regex extraction
emails = pd.Series(['alice@gmail.com', 'bob@yahoo.com'])
emails.str.extract(r'(\w+)@(\w+)')   # extract username and domain
str.extract output
0 1 0 alice gmail 1 bob yahoo
12

DateTime Handling

Date and time features are extremely powerful in ML (seasonality, trends, time since event). The .dt accessor unlocks all of this.

datetime.py
import pandas as pd

# Parse date strings to datetime
df = pd.DataFrame({'date': ['2024-01-15', '2024-07-04', '2023-12-25']})
df['date'] = pd.to_datetime(df['date'])

# .dt accessor — extract parts
df['year']     = df['date'].dt.year
df['month']    = df['date'].dt.month
df['day']      = df['date'].dt.day
df['weekday']  = df['date'].dt.day_name()  # 'Monday', 'Tuesday'...
df['quarter']  = df['date'].dt.quarter
df['is_month_end'] = df['date'].dt.is_month_end

# Date arithmetic (timedelta)
from datetime import timedelta
df['date_plus_7'] = df['date'] + timedelta(days=7)

# Date range creation
dates = pd.date_range('2024-01-01', periods=12, freq='M')  # 12 months

# Filter by date
df[df['date'] > '2024-01-01']      # filter after a date
Output
date year month day weekday quarter 0 2024-01-15 2024 1 15 Monday 1 1 2024-07-04 2024 7 4 Thursday 3 2 2023-12-25 2023 12 25 Monday 4
ML Feature Engineering with Dates
Always extract year, month, day_of_week, quarter as separate numeric columns before feeding to an ML model — models can't use raw datetime objects.
13

ML-Ready Pandas — Encoding & Binning

The final step before passing data to a scikit-learn model: encoding categorical variables into numbers and binning continuous values.

encoding_binning.py
import pandas as pd

df = pd.DataFrame({
    'color':['Red','Blue','Green','Blue'],
    'size' :['S','M','L','XL'],
    'age'  :[23,45,34,67]
})

# ── One-Hot Encoding (pd.get_dummies) ──
# Best for nominal categories (no order: Red/Blue/Green)
encoded = pd.get_dummies(df['color'])
# color_Blue  color_Green  color_Red

# drop_first=True avoids the "dummy variable trap" in linear models
encoded = pd.get_dummies(df, columns=['color'], drop_first=True)

# ── Label Encoding with map ──
# Best for ordinal categories (S < M < L < XL)
size_map = {'S':0, 'M':1, 'L':2, 'XL':3}
df['size_num'] = df['size'].map(size_map)

# ── Binning ──
# pd.cut — equal-width bins
df['age_group'] = pd.cut(df['age'], bins=3, labels=['young','mid','senior'])

# pd.qcut — equal-frequency bins (same number of people per bin)
df['age_quartile'] = pd.qcut(df['age'], q=4, labels=['Q1','Q2','Q3','Q4'])
get_dummies output (drop_first=False)
color_Blue color_Green color_Red 0 False False True 1 True False False 2 False True False 3 True False False
Encoding Before Splitting is a Mistake!
Always split your data into train/test FIRST, then fit your encoders on the training set only — and transform the test set with the same encoder. Fitting on the full dataset causes data leakage.
What Comes Next After Pandas?
Once your DataFrame is cleaned and encoded, you're ready for Scikit-learn: StandardScaler for scaling, train_test_split for splitting, and then your first ML models!