Basics of Pandas: A Step By Step Guide

Data processing is important part of analyzing the data, because data is not always available in desired format. Various processing are required before analyzing the data such as cleaning, restructuring or merging etc. Numpy, Scipy, Cython and Panda are the tools available in python which can be used fast processing of the data. Further, Pandas are built on the top of Numpy. Pandas provides rich set of functions to process various types of data. Further, working with Panda is fast, easy and more expressive than other tools. Pandas provides fast data processing as Numpy along with flexible data manipulation techniques as spreadsheets and relational databases. Lastly, pandas integrates well with matplotlib library, which makes it very handy tool for analyzing the data.

Table of Contents

Data Structures in Pandas

Series:

The Series is a one-dimensional array that can store various data types, including mix data types. The row labels in a Series are called the index. Any list, tuple and dictionary can be converted in to Series using ‘series’ method as shown below,

import pandas as pd
# converting tuple to Series
h = ('AA', '2012-02-01', 100, 10.2)
s = pd.Series(h)
type(s)
print(s)

<class 'pandas.core.series.Series'>
0  AA
1  2012-02-01
2  100
3  10.2
dtype: object

Converting dictionary to Series:

# converting dictionary to Series
d = {'name' : 'IBM', 'date' : '2010-09-08', 'shares' : 100, 'price' : 10.2}
ds = pd.Series(d)
type(ds)

<class 'pandas.core.series.Series'>

print(ds)

date    2010-09-08
name    IBM
price   10.2
shares  100
dtype: object

Note that in the tuple-conversion, the index are set to ‘0, 1, 2 and 3’. We can provide custom index names as follows:

f = ['FB', '2001-08-02', 90, 3.2]
f = pd.Series(f, index = ['name', 'date', 'shares', 'price'])
print(f)

name     FB
date     2001-08-02
shares   90
price    3.2
dtype: object

f['shares']

f[0]

'FB'

Elements of the Series can be accessed using index name e.g. f[‘shares’] or f[0] in below code. Further, specific elements can be selected by providing the index in the list,

f[['shares', 'price']]

shares   90
price    3.2
dtype: object

DataFrame:

In pandas, a DataFrame is a two-dimensional labeled data structure that resembles a table or a spreadsheet. It is one of the primary data structures provided by the pandas library, which is widely used for data manipulation and analysis in Python.

A DataFrame consists of rows and columns, where each column can contain different types of data (e.g., numbers, strings, or dates). The rows are labeled with an index, which allows for easy identification and retrieval of specific rows.

DataFrames can be created from various data sources such as CSV or Excel files, SQL databases, or even manually by providing data in the form of lists, dictionaries, or NumPy arrays.

Here’s an example of creating a DataFrame from a dictionary:

data = { 'name' : ['AA', 'IBM', 'GOOG'],
   'date' : ['2001-12-01', '2012-02-10', '2010-04-09'],
   'shares' : [100, 30, 90],
   'price' : [12.3, 10.3, 32.2]
      }
df = pd.DataFrame(data)
df

        date  name  price  shares
0  2001-12-01 AA    12.3   100
1  2012-02-10 IBM   10.3   30
2  2010-04-09 GOOG  32.2   90

Additional columns can be added after defining a DataFrame as below,

df['owner'] = 'Unknown'
df

        date  name  price  shares owner
0 2001-12-01  AA    12.3   100    Unknown
1 2012-02-10  IBM   10.3   30     Unknown
2 2010-04-09  GOOG  32.2   90     Unknown

Currently, the row index are set to 0, 1 and 2. These can be changed using ‘index’ attribute as below,

df.index = ['one', 'two', 'three']
df

           date  name  price shares owner
one  2001-12-01   AA    12.3  100    Unknown
two  2012-02-10   IBM   10.3  30     Unknown
three 2010-04-09 GOOG   32.2   90    Unknown

Further, any column of the DataFrame can be set as index using ‘set_index()’ attribute, as shown below,

df = df.set_index(['name'])
df

          date   price  shares owner
name
AA   2001-12-01   12.3  100   Unknown
IBM  2012-02-10   10.3  30    Unknown
GOOG 2010-04-09   32.2  90    Unknown

Data can be accessed in two ways i.e. using row and column index,

# access data using column-index
df['shares']

name
AA    100
IBM   30
GOOG  90
Name: shares, dtype: int64

In pandas, you can access data by row index using the loc or iloc indexer.

The loc indexer is used to access rows and columns in a DataFrame using label-based indexing. You can specify the row index label(s) inside the loc indexer to retrieve the corresponding row(s).

Here’s an example:

import pandas as pd

data = {'Name': ['John', 'Alice', 'Bob'],
        'Age': [25, 30, 35],
        'City': ['New York', 'London', 'Paris']}

df = pd.DataFrame(data, index=['A', 'B', 'C'])

# Access a single row by label
row_a = df.loc['A']
print(row_a)

# Access multiple rows by labels
rows_b_c = df.loc[['B', 'C']]
print(rows_b_c)

Output:

Name         John
Age            25
City     New York
Name: A, dtype: object

    Name  Age    City
B  Alice   30  London
C    Bob   35   Paris

The iloc indexer is used for accessing rows and columns based on integer position indexing, similar to indexing in Python lists. You can specify the row index position(s) inside the iloc indexer to retrieve the corresponding row(s).

Here’s an example:

import pandas as pd

data = {'Name': ['John', 'Alice', 'Bob'],
        'Age': [25, 30, 35],
        'City': ['New York', 'London', 'Paris']}

df = pd.DataFrame(data)

# Access a single row by position
row_0 = df.iloc[0]
print(row_0)

# Access multiple rows by positions
rows_1_2 = df.iloc[[1, 2]]
print(rows_1_2)

Output:

Name         John
Age            25
City     New York
Name: 0, dtype: object

    Name  Age    City
1  Alice   30  London
2    Bob   35   Paris

Both loc and iloc allow you to access specific rows in a DataFrame using different indexing methods based on your requirements.

Access all rows for a column:

To access all rows for a specific column in a pandas DataFrame, you can use the indexing operator [] with the column name.

Here’s an example:

import pandas as pd

data = {'Name': ['John', 'Alice', 'Bob'],
        'Age': [25, 30, 35],
        'City': ['New York', 'London', 'Paris']}

df = pd.DataFrame(data)

# Access all rows for the 'Name' column
name_column = df['Name']
print(name_column)

Output:

0     John
1    Alice
2      Bob
Name: Name, dtype: object

Access specific element from the DataFrame:

To access a specific element from a pandas DataFrame, you can use the at or iat accessor for label-based or integer-based indexing, respectively.

The at accessor is used to access a scalar value by label-based indexing. You can specify the row and column labels to retrieve the corresponding element.

Here’s an example:

import pandas as pd

data = {'Name': ['John', 'Alice', 'Bob'],
        'Age': [25, 30, 35],
        'City': ['New York', 'London', 'Paris']}

df = pd.DataFrame(data)

# Access a specific element by row and column labels
element = df.at[1, 'Name']
print(element)

Output:

Alice

In the above code, df.at[1, ‘Name’] retrieves the element at row index 1 and column label ‘Name’.

The iat accessor is used for accessing a scalar value by integer position indexing. You can specify the row and column positions to retrieve the corresponding element.

Here’s an example:

import pandas as pd

data = {'Name': ['John', 'Alice', 'Bob'],
        'Age': [25, 30, 35],
        'City': ['New York', 'London', 'Paris']}

df = pd.DataFrame(data)

# Access a specific element by row and column positions
element = df.iat[1, 0]
print(element)

In the above code, df.iat[1, 0] retrieves the element at row index 1 and column index 0.

Both at and iat accessors allow you to access specific elements from a DataFrame based on labels or integer positions, respectively.

Any column can be deleted using ‘del’ or ‘drop’ commands,

Yes, you can delete a column from a pandas DataFrame using either the del statement or the drop() method.

Using del: The del statement allows you to delete a column in-place by specifying the column name after the del keyword.

import pandas as pd

data = {'Name': ['John', 'Alice', 'Bob'],
        'Age': [25, 30, 35],
        'City': ['New York', 'London', 'Paris']}

df = pd.DataFrame(data)

# Delete the 'City' column using del
del df['City']

print(df)

Output:

   Name  Age
0  John   25
1 Alice   30
2   Bob   35

In the above code, the del df[‘City’] statement deletes the ‘City’ column from the DataFrame df.

Using drop(): The drop() method allows you to remove a column by specifying the column name and setting the axis parameter to 1.

import pandas as pd

data = {'Name': ['John', 'Alice', 'Bob'],
        'Age': [25, 30, 35],
        'City': ['New York', 'London', 'Paris']}

df = pd.DataFrame(data)

# Delete the 'City' column using drop()
df = df.drop('City', axis=1)

print(df)

Output:

   Name  Age
0  John   25
1 Alice   30
2   Bob   35

In the above code, df.drop(‘City’, axis=1) removes the ‘City’ column from the DataFrame df and returns a new DataFrame without the specified column.

Both methods, del and drop(), allow you to delete a column from a pandas DataFrame, but they differ in how they handle the modification of the original DataFrame. del modifies the DataFrame in-place, while drop() returns a new DataFrame with the column removed, leaving the original DataFrame unchanged.

Conclusion

In conclusion, pandas is a powerful Python library for data manipulation and analysis. It provides a versatile and efficient data structure called DataFrame, which resembles a table or a spreadsheet.

These basics of pandas provide a foundation for further exploration and utilization of the library’s extensive functionality. Pandas offers a wide range of data manipulation, analysis, and visualization capabilities, making it an essential tool for working with structured data in Python.