Data processing is important part of analyzing the data, because data is not always available in desired format. Various processing are required before analyzing the data such as cleaning, restructuring or merging etc. Numpy, Scipy, Cython and Panda are the tools available in python which can be used fast processing of the data. Further, Pandas are built on the top of Numpy. Pandas provides rich set of functions to process various types of data. Further, working with Panda is fast, easy and more expressive than other tools. Pandas provides fast data processing as Numpy along with flexible data manipulation techniques as spreadsheets and relational databases. Lastly, pandas integrates well with matplotlib library, which makes it very handy tool for analyzing the data.
Data Structures in Pandas
Series:
The Series is a one-dimensional array that can store various data types, including mix data types. The row labels in a Series are called the index. Any list, tuple and dictionary can be converted in to Series using ‘series’ method as shown below,
import pandas as pd
# converting tuple to Series
h = ('AA', '2012-02-01', 100, 10.2)
s = pd.Series(h)
type(s)
print(s)
<class 'pandas.core.series.Series'>
0 AA
1 2012-02-01
2 100
3 10.2
dtype: object
Converting dictionary to Series:
# converting dictionary to Series
d = {'name' : 'IBM', 'date' : '2010-09-08', 'shares' : 100, 'price' : 10.2}
ds = pd.Series(d)
type(ds)
<class 'pandas.core.series.Series'>
print(ds)
date 2010-09-08
name IBM
price 10.2
shares 100
dtype: object
Note that in the tuple-conversion, the index are set to ‘0, 1, 2 and 3’. We can provide custom index names as follows:
f = ['FB', '2001-08-02', 90, 3.2]
f = pd.Series(f, index = ['name', 'date', 'shares', 'price'])
print(f)
name FB
date 2001-08-02
shares 90
price 3.2
dtype: object
f['shares']
90
f[0]
'FB'
Elements of the Series can be accessed using index name e.g. f[‘shares’] or f[0] in below code. Further, specific elements can be selected by providing the index in the list,
f[['shares', 'price']]
shares 90
price 3.2
dtype: object
DataFrame:
In pandas, a DataFrame is a two-dimensional labeled data structure that resembles a table or a spreadsheet. It is one of the primary data structures provided by the pandas library, which is widely used for data manipulation and analysis in Python.
A DataFrame consists of rows and columns, where each column can contain different types of data (e.g., numbers, strings, or dates). The rows are labeled with an index, which allows for easy identification and retrieval of specific rows.
DataFrames can be created from various data sources such as CSV or Excel files, SQL databases, or even manually by providing data in the form of lists, dictionaries, or NumPy arrays.
Here’s an example of creating a DataFrame from a dictionary:
data = { 'name' : ['AA', 'IBM', 'GOOG'],
'date' : ['2001-12-01', '2012-02-10', '2010-04-09'],
'shares' : [100, 30, 90],
'price' : [12.3, 10.3, 32.2]
}
df = pd.DataFrame(data)
df
date name price shares
0 2001-12-01 AA 12.3 100
1 2012-02-10 IBM 10.3 30
2 2010-04-09 GOOG 32.2 90
Additional columns can be added after defining a DataFrame as below,
df['owner'] = 'Unknown'
df
date name price shares owner
0 2001-12-01 AA 12.3 100 Unknown
1 2012-02-10 IBM 10.3 30 Unknown
2 2010-04-09 GOOG 32.2 90 Unknown
Currently, the row index are set to 0, 1 and 2. These can be changed using ‘index’ attribute as below,
df.index = ['one', 'two', 'three']
df
date name price shares owner
one 2001-12-01 AA 12.3 100 Unknown
two 2012-02-10 IBM 10.3 30 Unknown
three 2010-04-09 GOOG 32.2 90 Unknown
Further, any column of the DataFrame can be set as index using ‘set_index()’ attribute, as shown below,
df = df.set_index(['name'])
df
date price shares owner
name
AA 2001-12-01 12.3 100 Unknown
IBM 2012-02-10 10.3 30 Unknown
GOOG 2010-04-09 32.2 90 Unknown
Data can be accessed in two ways i.e. using row and column index,
# access data using column-index
df['shares']
name
AA 100
IBM 30
GOOG 90
Name: shares, dtype: int64
In pandas, you can access data by row index using the loc or iloc indexer.
The loc indexer is used to access rows and columns in a DataFrame using label-based indexing. You can specify the row index label(s) inside the loc indexer to retrieve the corresponding row(s).
Here’s an example:
import pandas as pd
data = {'Name': ['John', 'Alice', 'Bob'],
'Age': [25, 30, 35],
'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data, index=['A', 'B', 'C'])
# Access a single row by label
row_a = df.loc['A']
print(row_a)
# Access multiple rows by labels
rows_b_c = df.loc[['B', 'C']]
print(rows_b_c)
Output:
Name John
Age 25
City New York
Name: A, dtype: object
Name Age City
B Alice 30 London
C Bob 35 Paris
The iloc indexer is used for accessing rows and columns based on integer position indexing, similar to indexing in Python lists. You can specify the row index position(s) inside the iloc indexer to retrieve the corresponding row(s).
Here’s an example:
import pandas as pd
data = {'Name': ['John', 'Alice', 'Bob'],
'Age': [25, 30, 35],
'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
# Access a single row by position
row_0 = df.iloc[0]
print(row_0)
# Access multiple rows by positions
rows_1_2 = df.iloc[[1, 2]]
print(rows_1_2)
Output:
Name John
Age 25
City New York
Name: 0, dtype: object
Name Age City
1 Alice 30 London
2 Bob 35 Paris
Both loc and iloc allow you to access specific rows in a DataFrame using different indexing methods based on your requirements.
Access all rows for a column:
To access all rows for a specific column in a pandas DataFrame, you can use the indexing operator [] with the column name.
Here’s an example:
import pandas as pd
data = {'Name': ['John', 'Alice', 'Bob'],
'Age': [25, 30, 35],
'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
# Access all rows for the 'Name' column
name_column = df['Name']
print(name_column)
Output:
0 John
1 Alice
2 Bob
Name: Name, dtype: object
Access specific element from the DataFrame:
To access a specific element from a pandas DataFrame, you can use the at or iat accessor for label-based or integer-based indexing, respectively.
The at
accessor is used to access a scalar value by label-based indexing. You can specify the row and column labels to retrieve the corresponding element.
Here’s an example:
import pandas as pd
data = {'Name': ['John', 'Alice', 'Bob'],
'Age': [25, 30, 35],
'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
# Access a specific element by row and column labels
element = df.at[1, 'Name']
print(element)
Output:
Alice
In the above code, df.at[1, ‘Name’] retrieves the element at row index 1 and column label ‘Name’.
The iat accessor is used for accessing a scalar value by integer position indexing. You can specify the row and column positions to retrieve the corresponding element.
Here’s an example:
import pandas as pd
data = {'Name': ['John', 'Alice', 'Bob'],
'Age': [25, 30, 35],
'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
# Access a specific element by row and column positions
element = df.iat[1, 0]
print(element)
In the above code, df.iat[1, 0] retrieves the element at row index 1 and column index 0.
Both at and iat accessors allow you to access specific elements from a DataFrame based on labels or integer positions, respectively.
Any column can be deleted using ‘del’ or ‘drop’ commands,
Yes, you can delete a column from a pandas DataFrame using either the del statement or the drop() method.
- Using del: The del statement allows you to delete a column in-place by specifying the column name after the del keyword.
import pandas as pd
data = {'Name': ['John', 'Alice', 'Bob'],
'Age': [25, 30, 35],
'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
# Delete the 'City' column using del
del df['City']
print(df)
Output:
Name Age
0 John 25
1 Alice 30
2 Bob 35
In the above code, the del df[‘City’] statement deletes the ‘City’ column from the DataFrame df.
- Using drop(): The drop() method allows you to remove a column by specifying the column name and setting the axis parameter to 1.
import pandas as pd
data = {'Name': ['John', 'Alice', 'Bob'],
'Age': [25, 30, 35],
'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
# Delete the 'City' column using drop()
df = df.drop('City', axis=1)
print(df)
Output:
Name Age
0 John 25
1 Alice 30
2 Bob 35
In the above code, df.drop(‘City’, axis=1) removes the ‘City’ column from the DataFrame df and returns a new DataFrame without the specified column.
Both methods, del and drop(), allow you to delete a column from a pandas DataFrame, but they differ in how they handle the modification of the original DataFrame. del modifies the DataFrame in-place, while drop() returns a new DataFrame with the column removed, leaving the original DataFrame unchanged.
Conclusion
In conclusion, pandas is a powerful Python library for data manipulation and analysis. It provides a versatile and efficient data structure called DataFrame, which resembles a table or a spreadsheet.
These basics of pandas provide a foundation for further exploration and utilization of the library’s extensive functionality. Pandas offers a wide range of data manipulation, analysis, and visualization capabilities, making it an essential tool for working with structured data in Python.