Pandas Mastery
From raw CSV to clean, ML-ready data. Learn Series, DataFrames, cleaning, transforming, grouping, and encoding — all the tools a data scientist uses daily.
Pandas Series & DataFrame
Pandas has two main data structures. A Series is a 1D labeled array (like a single column). A DataFrame is a 2D table with labeled rows and columns — the most important object in data science.
import pandas as pd # Create Series from list s = pd.Series([10, 20, 30, 40], index=['a','b','c','d']) # Attributes print(s.values) # [10 20 30 40] print(s.index) # Index(['a','b','c','d']) print(s.dtype) # int64 print(s.name) # None # Access elements print(s['b']) # 20 print(s[0]) # 10 # Arithmetic (broadcasts like NumPy) print(s * 2) # a:20 b:40 c:60 d:80 print(s > 15) # a:False b:True c:True d:True
import pandas as pd # Create from dictionary df = pd.DataFrame({ 'name' : ['Alice', 'Bob', 'Charlie'], 'age' : [25, 30, 22], 'score' : [88.5, 92.0, 75.0] }) # Inspect print(df.shape) # (3, 3) print(df.columns) # Index(['name','age','score']) print(df.dtypes) # name:object age:int64 score:float64 df.info() # summary of shape, dtypes, nulls df.describe() # stats: count, mean, std, min, max # Add / rename / drop columns df['grade'] = df['score'] >= 85 # add column df = df.rename(columns={'age': 'Age'}) # rename df = df.drop('grade', axis=1) # drop column # Sorting df.sort_values('score', ascending=False) df['name'].value_counts() # frequency of each name
CSV & Excel Read / Write
Reading and writing files is the very first step of every real-world data project. Pandas supports CSV, Excel, JSON, SQL and more.
import pandas as pd # ── READ CSV ── df = pd.read_csv('data.csv') # Common parameters df = pd.read_csv( 'data.csv', sep=',', # delimiter (use '\t' for TSV) header=0, # row to use as column names index_col=0, # use first col as row index usecols=['a','b'], # load only these columns nrows=100, # load only first 100 rows na_values=['?','NA'], # treat these as NaN parse_dates=['date'] # auto-parse date column ) # READ EXCEL df = pd.read_excel('data.xlsx', sheet_name='Sheet1') # ── QUICK INSPECTION after loading ── df.head(5) # first 5 rows df.tail(3) # last 3 rows df.sample(5) # random 5 rows df.info() # dtypes, non-null counts df.describe() # statistics # ── WRITE ── df.to_csv('output.csv', index=False) # index=False avoids extra column df.to_excel('output.xlsx', index=False, sheet_name='Results') df.to_json('output.json', orient='records')
to_csv() writes the DataFrame index as an extra column. Always pass index=False unless you specifically need the index saved.
Data Selection
Pandas gives you multiple ways to select data. Understanding the difference between loc, iloc, and boolean indexing is essential.
import pandas as pd df = pd.DataFrame({ 'name' : ['Alice','Bob','Charlie','Diana'], 'age' : [25, 30, 22, 28], 'score': [88, 92, 75, 95] }) # ── Column selection ── df['name'] # single column → Series df[['name', 'score']] # multiple columns → DataFrame # ── loc — label-based (uses row labels / column names) ── df.loc[0] # row at label 0 df.loc[0:2] # rows 0 to 2 INCLUSIVE df.loc[0, 'name'] # row 0, 'name' column → 'Alice' df.loc[1:3, ['name','age']] # rows 1-3, specific cols # ── iloc — integer position-based ── df.iloc[0] # first row df.iloc[-1] # last row df.iloc[0:2, 1:3] # rows 0-1 (exclusive 2), cols 1-2 df.iloc[[0,3], :] # rows 0 and 3, all columns
loc when you know the label/name. Use iloc when you know the position number. Think: loc = Label, iloc = Integer.
# Single condition df[df['age'] > 25] # Multiple conditions — use & | ~ df[(df['age'] > 20) & (df['score'] >= 90)] df[(df['age'] < 23) | (df['score'] > 93)] df[~(df['name'] == 'Bob')] # NOT Bob # .isin() — filter by a list of values df[df['name'].isin(['Alice', 'Diana'])] # .between() — range filter df[df['age'].between(22, 28)] # .query() — SQL-style, cleaner syntax df.query('age > 25 and score >= 90') df.query('name in ["Alice", "Diana"]')
Missing Data Handling
Real-world datasets always have missing values. Properly handling them is one of the most critical steps in any ML pipeline — wrong choices here can silently ruin your model.
import pandas as pd import numpy as np df = pd.DataFrame({ 'A': [1, 2, np.nan, 4], 'B': [np.nan, 5, 6, np.nan], 'C': [7, 8, 9, 10] }) # ── Detect ── print(df.isnull().sum()) # count nulls per column print(df.isnull().sum().sum()) # total nulls # ── Drop ── df.dropna() # drop any row with a null df.dropna(axis=1) # drop any COLUMN with a null df.dropna(how='all') # drop only if ALL values null df.dropna(thresh=2) # keep rows with at least 2 non-nulls df.dropna(subset=['A']) # drop only if col A is null # ── Fill ── df.fillna(0) # fill all nulls with 0 df['A'].fillna(df['A'].mean()) # fill with mean — very common in ML! df['B'].fillna(method='ffill') # forward fill (use previous value) df['B'].fillna(method='bfill') # backward fill (use next value) # ── Replace specific values ── df.replace(-999, np.nan) # replace sentinel value with NaN df.replace({'A': {0: np.nan}}) # replace 0 in column A with NaN
groupby
groupby splits the DataFrame into groups based on a column, applies a function to each group, and combines the results — a pattern called Split → Apply → Combine.
import pandas as pd df = pd.DataFrame({ 'dept' : ['Eng','Mkt','Eng','Mkt','Eng'], 'name' : ['Alice','Bob','Charlie','Diana','Eve'], 'salary':[80,60,95,70,88], 'years' :[3,5,7,2,4] }) # Basic groupby df.groupby('dept')['salary'].mean() df.groupby('dept')['salary'].sum() df.groupby('dept').size() # count rows per group # Multiple aggregations at once df.groupby('dept').agg({ 'salary': ['mean', 'max', 'min'], 'years': ['sum', 'count'] }) # transform — returns same shape as original (useful for feature engineering) df['dept_mean_sal'] = df.groupby('dept')['salary'].transform('mean') # filter — keep only groups matching condition df.groupby('dept').filter(lambda x: x['salary'].mean() > 75)
merge & concat
import pandas as pd df1 = pd.DataFrame({'id':[1,2,3], 'name':['A','B','C']}) df2 = pd.DataFrame({'id':[2,3,4], 'score':[88,92,75]}) # ── merge (like SQL JOIN) ── pd.merge(df1, df2, on='id', how='inner') # only matching ids pd.merge(df1, df2, on='id', how='left') # all from df1 pd.merge(df1, df2, on='id', how='outer') # all from both # Different column names pd.merge(df1, df2, left_on='id', right_on='user_id') # ── concat — stack rows or columns ── pd.concat([df1, df2]) # stack rows (axis=0) pd.concat([df1, df2], axis=1) # stick columns side by side pd.concat([df1, df2], ignore_index=True) # reset index 0,1,2...
merge when combining tables based on a shared key (like SQL JOIN). Use concat when simply stacking rows or columns together.
pivot_table & melt
import pandas as pd df = pd.DataFrame({ 'dept' :['Eng','Eng','Mkt','Mkt'], 'gender':['M','F','M','F'], 'salary':[90,85,70,65] }) # pivot_table: group + aggregate in tabular form pivot = pd.pivot_table( df, values='salary', index='dept', columns='gender', aggfunc='mean' ) print(pivot) # cross-tabulation (frequency table) pd.crosstab(df['dept'], df['gender']) # melt — wide format → long format (reverse of pivot) wide = pd.DataFrame({'name':['Alice'], 'math':[90], 'english':[85]}) long = wide.melt(id_vars='name', var_name='subject', value_name='score') print(long)
apply, map & lambda Functions
These functions let you apply custom logic to every element, row, or column in a DataFrame without writing loops.
import pandas as pd df = pd.DataFrame({ 'name' :['alice','bob','charlie'], 'score':[88, 92, 75], 'grade':['A', 'A', 'B'] }) # ── apply on Series (single column) ── df['name'].apply(str.title) # Alice, Bob, Charlie df['score'].apply(lambda x: x + 5) # add 5 bonus to each score # apply on DataFrame rows (axis=1) df.apply(lambda row: row['score'] * 2 if row['grade']=='A' else row['score'], axis=1) # Custom function with apply def classify(score): if score >= 90: return 'Excellent' elif score >= 75: return 'Good' else: return 'Needs Work' df['level'] = df['score'].apply(classify) # ── map — label encoding with dict ── df['grade_num'] = df['grade'].map({'A': 1, 'B': 0}) # ── sort_values with lambda key ── df.sort_values('name', key=lambda x: x.str.len()) # sort by name length
EDA Workflow
Exploratory Data Analysis (EDA) is the process of understanding your dataset before modeling. Follow these 5 steps for any new dataset.
import pandas as pd # ── STEP 1: Load & First Look ── df = pd.read_csv('dataset.csv') print(df.shape) # (rows, cols) df.info() # dtypes + null counts df.head() # ── STEP 2: Clean ── df.isnull().sum() # nulls per column df.duplicated().sum() # duplicate rows df = df.drop_duplicates() # remove duplicates df.columns = df.columns.str.lower().str.strip() # normalize column names df['age'] = df['age'].astype(int) # fix wrong dtype # ── STEP 3: Univariate Analysis ── df['category'].value_counts() # categorical column frequencies df['age'].describe() # count mean std min max # ── STEP 4: Bivariate Analysis ── df.corr() # correlation matrix (numerical cols) df.groupby('category')['price'].mean() # mean price per category pd.crosstab(df['gender'], df['outcome']) # frequency cross-tab # ── STEP 5: Feature Understanding ── cat_cols = df.select_dtypes('object').columns num_cols = df.select_dtypes('number').columns df[cat_cols].nunique() # cardinality of categorical cols df['target'].value_counts(normalize=True) # class balance check
String Operations — .str accessor
The .str accessor lets you apply string methods to every element in a Series without loops.
import pandas as pd s = pd.Series([' Hello World ', 'foo BAR', 'pandas 2.0']) # Case s.str.lower() # [' hello world ', 'foo bar', 'pandas 2.0'] s.str.upper() s.str.title() # Strip whitespace s.str.strip() # remove leading/trailing spaces # Search & Replace s.str.contains('World') # [True, False, False] s.str.replace('pandas', 'Python') # replace substring s.str.startswith('foo') s.str.endswith('0') # Split & Length s.str.split(' ') # split into list on space s.str.len() # length of each string s.str.count('o') # count 'o' in each string # Regex extraction emails = pd.Series(['alice@gmail.com', 'bob@yahoo.com']) emails.str.extract(r'(\w+)@(\w+)') # extract username and domain
DateTime Handling
Date and time features are extremely powerful in ML (seasonality, trends, time since event). The .dt accessor unlocks all of this.
import pandas as pd # Parse date strings to datetime df = pd.DataFrame({'date': ['2024-01-15', '2024-07-04', '2023-12-25']}) df['date'] = pd.to_datetime(df['date']) # .dt accessor — extract parts df['year'] = df['date'].dt.year df['month'] = df['date'].dt.month df['day'] = df['date'].dt.day df['weekday'] = df['date'].dt.day_name() # 'Monday', 'Tuesday'... df['quarter'] = df['date'].dt.quarter df['is_month_end'] = df['date'].dt.is_month_end # Date arithmetic (timedelta) from datetime import timedelta df['date_plus_7'] = df['date'] + timedelta(days=7) # Date range creation dates = pd.date_range('2024-01-01', periods=12, freq='M') # 12 months # Filter by date df[df['date'] > '2024-01-01'] # filter after a date
year, month, day_of_week, quarter as separate numeric columns before feeding to an ML model — models can't use raw datetime objects.
ML-Ready Pandas — Encoding & Binning
The final step before passing data to a scikit-learn model: encoding categorical variables into numbers and binning continuous values.
import pandas as pd df = pd.DataFrame({ 'color':['Red','Blue','Green','Blue'], 'size' :['S','M','L','XL'], 'age' :[23,45,34,67] }) # ── One-Hot Encoding (pd.get_dummies) ── # Best for nominal categories (no order: Red/Blue/Green) encoded = pd.get_dummies(df['color']) # color_Blue color_Green color_Red # drop_first=True avoids the "dummy variable trap" in linear models encoded = pd.get_dummies(df, columns=['color'], drop_first=True) # ── Label Encoding with map ── # Best for ordinal categories (S < M < L < XL) size_map = {'S':0, 'M':1, 'L':2, 'XL':3} df['size_num'] = df['size'].map(size_map) # ── Binning ── # pd.cut — equal-width bins df['age_group'] = pd.cut(df['age'], bins=3, labels=['young','mid','senior']) # pd.qcut — equal-frequency bins (same number of people per bin) df['age_quartile'] = pd.qcut(df['age'], q=4, labels=['Q1','Q2','Q3','Q4'])
StandardScaler for scaling, train_test_split for splitting, and then your first ML models!