1. Introduction: Why Pandas?
Welcome to the world of Data Science! In previous chapters, we used Lists to store data. But Lists are slow and cannot handle millions of rows efficiently. Enter Pandas (Python Data Analysis Library).
- Built on Top of NumPy: Pandas is built on top of NumPy (Numerical Python). While NumPy deals with arrays (numbers), Pandas deals with Tables (mixed data like Excel).
- The Two Titans: Pandas has two main data structures you must memorize:
- Series: One-dimensional (Like a single column).
- DataFrame: Two-dimensional (Like a whole table).
Installation Check:
To use these tools, we must import them. This is the standard alias used worldwide:
Python
import pandas as pd import matplotlib.pyplot as plt
2. The Pandas Series (1D Data)
Think of a Series as a List with Superpowers. Unlike a list, a Series has an explicit Index (labels) associated with each value.
2.1 Creating a Series
Syntax: pd.Series(data, index=idx)
A. From a List:
Python
import pandas as pd
L = [10, 20, 30]
S = pd.Series(L)
print(S)
# Output:
# 0 10
# 1 20
# 2 30
# dtype: int64 (Note: It automatically assigns index 0, 1, 2)
B. From a Dictionary (Exam Favorite):
When creating from a dict, the Keys become the Index.
Python
D = {'Jan': 31, 'Feb': 28, 'Mar': 31}
S = pd.Series(D)
# Index will be 'Jan', 'Feb', 'Mar'.
C. From a Scalar Value:
If you pass a single value and an index list, Pandas repeats the value.
Python
S = pd.Series(50, index=['A', 'B', 'C'])
# Output: A=50, B=50, C=50
2.2 Vectorized Operations
You can do math on the whole Series at once (no loops needed!).
S + 2: Adds 2 to every element.S1 + S2: Adds values where indexes match.- Critical Note: If an index is present in S1 but missing in S2, the result is
NaN(Not a Number).
- Critical Note: If an index is present in S1 but missing in S2, the result is
3. The Pandas DataFrame (2D Data)
A DataFrame is like an Excel sheet or a SQL Table. It has Row Indexes and Column Headers.
3.1 Creating a DataFrame
A. From a Dictionary of Lists (Column-wise):
Python
data = {
"Name": ["Ravi", "Anju", "Kiran"],
"Marks": [90, 85, 92]
}
df = pd.DataFrame(data)
# Keys ("Name", "Marks") become Column Headers.
B. From a List of Dictionaries (Row-wise):
Python
data = [
{"Name": "Ravi", "Marks": 90},
{"Name": "Anju", "Marks": 85}
]
df = pd.DataFrame(data)
3.2 Inspection Functions
df.head(n): Returns top \(n \) rows (default 5).df.tail(n): Returns bottom \(n \) rows.df.shape: Returns tuple(rows, columns).df.columns: Returns list of column names.df.T: Transposes the DataFrame (Rows become Columns).
4. Selecting Data: The Battle of loc vs iloc
This is the most important topic for the PGT exam. Confusing these two is the easiest way to lose marks.
| Feature | loc (Label Based) | iloc (Integer Position Based) |
| Logic | Uses the Name of the row/column. | Uses the Index Number (0, 1, 2…). |
| Slicing End | Inclusive (Start to End). | Exclusive (Start to End-1). |
| Example | df.loc['Row1', 'Name'] | df.iloc[0, 1] |
Teacher’s Example:
Imagine a row with Label “Row10” sitting at index position 0.
df.loc['Row10']fetches that row.df.iloc[10]fetches the row at position 10 (which might be “Row25”).
4.3 Boolean Indexing (Filtering)
Selecting rows based on a condition.
df[df['Marks'] > 90]- Translation: “Show me rows where the Marks column is greater than 90.”
5. Manipulating Data (CRUD)
5.1 Adding Elements
- Add Column:
df['Grade'] = ['A', 'B', 'A'] - Add Row: Using
locorconcat.df.loc[3] = ['NewStudent', 88]
5.2 Deleting Elements
- Drop Column:
df.drop('Grade', axis=1) - Drop Row:
df.drop(0, axis=0)- Note:
axis=1stands for Columns (Vertical).axis=0stands for Rows (Horizontal).
- Note:
5.3 Merging DataFrames
KVS syllabus mentions Joining/Merging.
pd.concat([df1, df2]): Stacks tables on top of each other.pd.merge(df1, df2, on='ID'): Joins tables side-by-side based on a common key (like SQL Join).
6. Handling CSV Files
You don’t need the csv module here. Pandas makes it one line.
- Reading:
df = pd.read_csv("data.csv")- Useful Params:
sep=',',header=None,index_col=0.
- Useful Params:
- Writing:
df.to_csv("output.csv", index=False)- Tip: Always set
index=Falseunless you want the row numbers saved in your file.
- Tip: Always set
7. Data Visualization: Matplotlib
Visualizing data helps us spot trends. We use the Pyplot interface.
import matplotlib.pyplot as plt
7.1 The Anatomy of a Plot
- Figure: The entire canvas.
- Axes: The specific plot (X-axis, Y-axis).
- Legend: Explains what the colors mean.
7.2 Types of Plots
A. Line Plot (Trends over time)
Python
plt.plot(x, y, color='red', marker='o')
plt.title("Sales Trend")
plt.xlabel("Month")
plt.ylabel("Revenue")
plt.grid(True)
plt.show()
B. Bar Graph (Comparison)
Python
plt.bar(categories, values, width=0.5, color='blue')
C. Histogram (Frequency Distribution)
Used to see how data is distributed (e.g., How many students got between 80-90?).
Python
plt.hist(marks_list, bins=5, edgecolor='black')
Exam Question: What is the difference between Bar and Histogram?
- Bar: Compares categories (Apples vs Oranges). Gaps between bars.
- Histogram: Shows frequency of continuous data (Marks 0-10, 10-20). No gaps between bars.
Exam Corner: 5 High-Value MCQs
- Q: If
S1has indices [A, B] andS2has indices [B, C], what is the result ofS1 + S2?- A) Indices [A, B, C] with values summed.
- B) Indices [B] only.
- C) Indices [A, B, C] where A and C are
NaN. - Ans: C. Operations align by index. Missing indices result in NaN.
- Q: Which function is used to fetch the column names of a DataFrame
df?- A)
df.names - B)
df.columns - C)
df.keys - Ans: B.
- A)
- Q: In
df.iloc[1:5], how many rows are selected?- A) 5
- B) 4
- C) 3
- Ans: B.
ilocexcludes the stop value. Rows 1, 2, 3, 4 are selected.
- Q: Which argument in
to_csv()prevents writing row numbers to the file?- A)
row_numbers=False - B)
header=False - C)
index=False - Ans: C.
- A)
- Q: Which function flips the rows and columns of a DataFrame?
- A)
df.invert() - B)
df.transpose()ordf.T - C)
df.flip() - Ans: B.
- A)
Pandas is not just a syllabus topic; it is the skill that gets you hired in the industry. For the exam, focus heavily on Slicing (iloc) and Series Math (NaN behavior).
This concludes the Programming section of the syllabus!, we shift gears to Database Management & SQL, where we will learn how to store structured data professionally.
Next Lesson: [Module 5: Database Management & SQL]
