Data Handling with Pandas & Matplotlib – The Masterclass

1. Introduction: Why Pandas?

Welcome to the world of Data Science! In previous chapters, we used Lists to store data. But Lists are slow and cannot handle millions of rows efficiently. Enter Pandas (Python Data Analysis Library).

  • Built on Top of NumPy: Pandas is built on top of NumPy (Numerical Python). While NumPy deals with arrays (numbers), Pandas deals with Tables (mixed data like Excel).
  • The Two Titans: Pandas has two main data structures you must memorize:
    1. Series: One-dimensional (Like a single column).
    2. DataFrame: Two-dimensional (Like a whole table).

Installation Check:

To use these tools, we must import them. This is the standard alias used worldwide:

Python

import pandas as pd
import matplotlib.pyplot as plt

2. The Pandas Series (1D Data)

Think of a Series as a List with Superpowers. Unlike a list, a Series has an explicit Index (labels) associated with each value.

2.1 Creating a Series

Syntax: pd.Series(data, index=idx)

A. From a List:

Python

import pandas as pd
L = [10, 20, 30]
S = pd.Series(L)
print(S)
# Output:
# 0    10
# 1    20
# 2    30
# dtype: int64 (Note: It automatically assigns index 0, 1, 2)

B. From a Dictionary (Exam Favorite):

When creating from a dict, the Keys become the Index.

Python

D = {'Jan': 31, 'Feb': 28, 'Mar': 31}
S = pd.Series(D)
# Index will be 'Jan', 'Feb', 'Mar'.

C. From a Scalar Value:

If you pass a single value and an index list, Pandas repeats the value.

Python

S = pd.Series(50, index=['A', 'B', 'C'])
# Output: A=50, B=50, C=50

2.2 Vectorized Operations

You can do math on the whole Series at once (no loops needed!).

  • S + 2: Adds 2 to every element.
  • S1 + S2: Adds values where indexes match.
    • Critical Note: If an index is present in S1 but missing in S2, the result is NaN (Not a Number).

3. The Pandas DataFrame (2D Data)

A DataFrame is like an Excel sheet or a SQL Table. It has Row Indexes and Column Headers.

3.1 Creating a DataFrame

A. From a Dictionary of Lists (Column-wise):

Python

data = {
    "Name": ["Ravi", "Anju", "Kiran"],
    "Marks": [90, 85, 92]
}
df = pd.DataFrame(data)
# Keys ("Name", "Marks") become Column Headers.

B. From a List of Dictionaries (Row-wise):

Python

data = [
    {"Name": "Ravi", "Marks": 90},
    {"Name": "Anju", "Marks": 85}
]
df = pd.DataFrame(data)

3.2 Inspection Functions

  • df.head(n): Returns top \(n \) rows (default 5).
  • df.tail(n): Returns bottom \(n \) rows.
  • df.shape: Returns tuple (rows, columns).
  • df.columns: Returns list of column names.
  • df.T: Transposes the DataFrame (Rows become Columns).

4. Selecting Data: The Battle of loc vs iloc

This is the most important topic for the PGT exam. Confusing these two is the easiest way to lose marks.

Featureloc (Label Based)iloc (Integer Position Based)
LogicUses the Name of the row/column.Uses the Index Number (0, 1, 2…).
Slicing EndInclusive (Start to End).Exclusive (Start to End-1).
Exampledf.loc['Row1', 'Name']df.iloc[0, 1]

Teacher’s Example:

Imagine a row with Label “Row10” sitting at index position 0.

  • df.loc['Row10'] fetches that row.
  • df.iloc[10] fetches the row at position 10 (which might be “Row25”).

4.3 Boolean Indexing (Filtering)

Selecting rows based on a condition.

  • df[df['Marks'] > 90]
  • Translation: “Show me rows where the Marks column is greater than 90.”

5. Manipulating Data (CRUD)

5.1 Adding Elements

  • Add Column: df['Grade'] = ['A', 'B', 'A']
  • Add Row: Using loc or concat. df.loc[3] = ['NewStudent', 88]

5.2 Deleting Elements

  • Drop Column: df.drop('Grade', axis=1)
  • Drop Row:df.drop(0, axis=0)
    • Note: axis=1 stands for Columns (Vertical). axis=0 stands for Rows (Horizontal).

5.3 Merging DataFrames

KVS syllabus mentions Joining/Merging.

  • pd.concat([df1, df2]): Stacks tables on top of each other.
  • pd.merge(df1, df2, on='ID'): Joins tables side-by-side based on a common key (like SQL Join).

6. Handling CSV Files

You don’t need the csv module here. Pandas makes it one line.

  • Reading:df = pd.read_csv("data.csv")
    • Useful Params: sep=',', header=None, index_col=0.
  • Writing:df.to_csv("output.csv", index=False)
    • Tip: Always set index=False unless you want the row numbers saved in your file.

7. Data Visualization: Matplotlib

Visualizing data helps us spot trends. We use the Pyplot interface.

import matplotlib.pyplot as plt

7.1 The Anatomy of a Plot

  1. Figure: The entire canvas.
  2. Axes: The specific plot (X-axis, Y-axis).
  3. Legend: Explains what the colors mean.

7.2 Types of Plots

A. Line Plot (Trends over time)

Python

plt.plot(x, y, color='red', marker='o')
plt.title("Sales Trend")
plt.xlabel("Month")
plt.ylabel("Revenue")
plt.grid(True)
plt.show()

B. Bar Graph (Comparison)

Python

plt.bar(categories, values, width=0.5, color='blue')

C. Histogram (Frequency Distribution)

Used to see how data is distributed (e.g., How many students got between 80-90?).

Python

plt.hist(marks_list, bins=5, edgecolor='black')

Exam Question: What is the difference between Bar and Histogram?

  • Bar: Compares categories (Apples vs Oranges). Gaps between bars.
  • Histogram: Shows frequency of continuous data (Marks 0-10, 10-20). No gaps between bars.

Exam Corner: 5 High-Value MCQs

  1. Q: If S1 has indices [A, B] and S2 has indices [B, C], what is the result of S1 + S2?
    • A) Indices [A, B, C] with values summed.
    • B) Indices [B] only.
    • C) Indices [A, B, C] where A and C are NaN.
    • Ans: C. Operations align by index. Missing indices result in NaN.
  2. Q: Which function is used to fetch the column names of a DataFrame df?
    • A) df.names
    • B) df.columns
    • C) df.keys
    • Ans: B.
  3. Q: In df.iloc[1:5], how many rows are selected?
    • A) 5
    • B) 4
    • C) 3
    • Ans: B. iloc excludes the stop value. Rows 1, 2, 3, 4 are selected.
  4. Q: Which argument in to_csv() prevents writing row numbers to the file?
    • A) row_numbers=False
    • B) header=False
    • C) index=False
    • Ans: C.
  5. Q: Which function flips the rows and columns of a DataFrame?
    • A) df.invert()
    • B) df.transpose() or df.T
    • C) df.flip()
    • Ans: B.

Pandas is not just a syllabus topic; it is the skill that gets you hired in the industry. For the exam, focus heavily on Slicing (iloc) and Series Math (NaN behavior).

This concludes the Programming section of the syllabus!, we shift gears to Database Management & SQL, where we will learn how to store structured data professionally.

Next Lesson: [Module 5: Database Management & SQL]


SRIRAM
SRIRAM

Sriram is a seasoned Computer Science educator and mentor. He is UGC NET Qualified twice (2014 & 2019) and holds State Eligibility Test (SET) qualifications for both Andhra Pradesh (AP) and Telangana (TG). With years of experience teaching programming languages, he simplifies complex CS concepts for aspirants of UGC NET Computer Science, KVS, NVS, EMRS, and other competitive exams.

Leave a Reply

Your email address will not be published. Required fields are marked *