process

Data processing using pandas involves using the pandas library in Python to perform various operations on structured data, such as cleaning, transforming, aggregating, and analyzing data. Pandas provides powerful data structures and functions specifically designed for data manipulation and analysis.

What is Data Processing

Data processing refers to the manipulation and transformation of raw data to extract useful information and derive meaningful insights. It involves a series of steps that aim to clean, organize, analyze, and visualize data to uncover patterns, trends, and relationships. Data processing plays a crucial role in various fields, including business, science, research, and technology, where data is generated and collected at an unprecedented rate.

The process of data processing typically involves the following steps:

  1. Data Collection: Gathering data from various sources, such as databases, files, sensors, APIs, or surveys.
  2. Data Cleaning: Identifying and handling missing values, outliers, inconsistencies, and errors in the data. This step ensures the data is accurate, complete, and ready for analysis.
  3. Data Transformation: Performing operations to reshape, reformat, or normalize the data. This may involve converting data types, scaling values, encoding categorical variables, or creating derived features.
  4. Data Integration: Combining multiple datasets or sources to create a unified view of the data. This step may involve merging, joining, or appending data based on common identifiers or keys.
  5. Data Analysis: Applying various statistical, mathematical, or computational techniques to gain insights from the data. This includes performing calculations, aggregations, correlations, or running machine learning algorithms.
  6. Data Visualization: Representing data visually through charts, graphs, or plots to facilitate understanding and interpretation. Visualizations help communicate patterns, trends, and relationships effectively.
  7. Data Interpretation: Interpreting the analyzed data to draw conclusions, make predictions, or support decision-making. This step involves extracting meaningful insights and actionable information from the processed data.
  8. Data Storage or Dissemination: Storing the processed data in databases, data warehouses, or files for future use. It may also involve sharing or disseminating the results with stakeholders through reports, dashboards, or presentations.

Pandas along with python libraries provide us a high performance, flexible and
high level environment for processing the data.

Hierarchical indexing

Hierarchical indexing is an important feature of pandas that enable us to have multiple index levels.

Creating multiple index:

Following is an example of series with multiple index

import pandas as pd
data = pd.Series([10, 20, 30, 40, 15, 25, 35, 25], index = [['a', 'a',
'a', 'a', 'b', 'b', 'b', 'b'], ['obj1', 'obj2', 'obj3', 'obj4', 'obj1',
'obj2', 'obj3', 'obj4']])
data
a  obj1    10
   obj2    20
   obj3    30
   obj4    40
b  obj1    15
   obj2    25
   obj3    35
   obj4    25
dtype: int64

There are two level of index here i.e. (a, b) and (obj1, obj2, . . . obj4). The index can be seen using ‘index’
command as shown below.

data.index
MultiIndex([('a', 'obj1'),
            ('a', 'obj2'),
            ('a', 'obj3'),
            ('a', 'obj4'),
            ('b', 'obj1'),
            ('b', 'obj2'),
            ('b', 'obj3'),
            ('b', 'obj4')],
           )

Partial indexing:

Choosing a particular index from a hierarchical indexing is known as partial indexing.

In the below code, index ‘b’ is extracted from the data.

data['b']
obj1    15
obj2    25
obj3    35
obj4    25
dtype: int64

Further, the data can be extracted based on inner level i.e. ‘obj’. Below result shows the two available values for ‘obj2’ in the Series.

data[:, 'obj2']
a    20
b    25
dtype: int64

Unstack the data:

Unstack changes the row header to column header. Since the row index is changed to column index, therefore the Series will become the DataFrame in this case.

# unstack based on first level i.e. a, b
# note that data row-labels are a and b
data.unstack(0)
	a	b
obj1	10	15
obj2	20	25
obj3	30	35
obj4	40	25
# unstack based on second level i.e. 'obj'
data.unstack(1)
	obj1	obj2	obj3	obj4
a	10	20	30	40
b	15	25	35	25
# by default innermost level is used for unstacking
d = data.unstack()
d

‘stack()’ operation converts the column index to row index again. In above code, DataFrame ‘d’ has ‘obj’ as column index, this can be converted into row index using ‘stack’ operation.

d.stack()
a  obj1    10
   obj2    20
   obj3    30
   obj4    40
b  obj1    15
   obj2    25
   obj3    35
   obj4    25
dtype: int64

Column indexing:

Remember that, the column indexing is possible for DataFrame only (not for Series), because column-indexing require two dimensional data. Let’s create a new DataFrame as below for understanding the columns with multiple index.

import numpy as np
df = pd.DataFrame(np.arange(12).reshape(4, 3), index = [['a', 'a', 'b', 'b'], ['one', 'two', 'three', 'four']],
columns = [['num1', 'num2', 'num3'], ['red', 'green', 'red']]
)
df
	        num1	num2	num3
                red	green	red
a	one	0	1	2
        two	3	4	5
b	three	6	7	8
        four	9	10	11
# display row index
df.index
MultiIndex([('a',   'one'),
            ('a',   'two'),
            ('b', 'three'),
            ('b',  'four')],
           )
# display column index
df.columns
MultiIndex([('num1',   'red'),
            ('num2', 'green'),
            ('num3',   'red')],
           )

We can give name to index as below.

df.index.names=['key1', 'key2']
df.columns.names=['n', 'color']
df
	 n	num1	num2	num3
        color	red	green	red
key1	key2			
a	one	0	1	2
        two	3	4	5
b	three	6	7	8
        four	9	10	11

Now, we can perform the partial indexing operations. In following code, various ways to access the data in a DataFrame are shown.

# accessing the column for num1
df['num1']
	color	red
key1	key2	
a	one	0
        two	3
b	three	6
        four	9
# accessing the column for a
df.loc['a']
n	num1	num2	num3
color	red	green	red
key2			
one	0	1	2
two	3	4	5

Swap and sort level:

We can swap the index level using ‘swaplevel’ command, which takes two level-numbers as input.

df.swaplevel('key1', 'key2')
	n	num1	num2	num3
        color	red	green	red
key2	key1			
one	a	0	1	2
two	a	3	4	5
three	b	6	7	8
four	b	9	10	11

Levels can be sorted using ‘sort_index’ command. In below code, data is sorted by ‘key2’ names i.e. key2 is arranged alphabetically.

df.sort_index(level='key2')
	n	num1	num2	num3
        color	red	green	red
key1	key2			
b	four	9	10	11
a	one	0	1	2
b	three	6	7	8
a	two	3	4	5

File operations

Here various methods for reading and writing the files are discussed.

Note: CSV files can be downloaded from below link

Link: https://bitbucket.org/pythondsp/pandasguide/downloads/

Reading files:

Pandas supports various types of file format e.g. csv, text, excel and different database etc. Files are often stored in different formats as well e.g. files may or may not contain header, footer and comments etc. Therefore we need to process the content of file. Pandas provides various features which can process some of the common processing while reading the file.

Files can be read using ‘read_csv’, ‘read_table’ or ‘DataFrame.from_csv’ options, as shown below. Note
that, the output of all these methods are same, but we need to provide different parameters to read the file correctly.

# read_csv
df = pd.read_csv('ex1.csv')
df
	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo
# read_table
df = pd.read_table('ex1.csv', sep=',')
df
	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

Writing data to a file:

In pandas, you can use the to_csv() function to write data from a DataFrame to a CSV file. Here’s an example code snippet that demonstrates how to use to_csv():

import pandas as pd

# Assuming you have a DataFrame named df
# Write the DataFrame to a CSV file
df.to_csv('output.csv', index=False)
  1. df.to_csv(‘output.csv’, index=False): This line uses the to_csv() function to write the data from the DataFrame df to a CSV file named “output.csv”.
    • ‘output.csv’: This is the name of the output file you want to create. You can specify a different file name or path as per your requirement.
    • index=False: This parameter specifies whether to write the row index along with the data. By setting it to False, we exclude the row index from being written to the CSV file.

Data transformation:

Transformation i.e. cleaning and filtering the data e.g. removing the duplicate entries and replacing the NaN values etc.

Data transformation in pandas refers to the process of modifying, reorganizing, or aggregating data to create a new representation that better suits the analysis or visualization needs. Pandas provides a wide range of functions and methods to perform various data transformation operations.

Removing duplicates

Removing duplicate entries are quite easy with ‘drop_duplicates’ command. Also, ‘duplicate()’ command
can be used to check the duplicate entries as shown below.

# create DataFrame with duplicate entries
df = pd.DataFrame({'k1':['one']*3 + ['two']*4, 'k2':[1,1,2,3,3,4,4]})
df
	k1	k2
0	one	1
1	one	1
2	one	2
3	two	3
4	two	3
5	two	4
6	two	4
# see the duplicate entries
df.duplicated()
0    False
1     True
2    False
3    False
4     True
5    False
6     True
dtype: bool
# drop the duplicate entries
df.drop_duplicates()
	k1	k2
0	one	1
2	one	2
3	two	3
5	two	4

Currently, last entry is removed by “drop_duplicates” command. If we want to keep the last entry, then ‘keep’ keyword can be used.

df.drop_duplicates(keep="last")

We can drop all the duplicate values based on the specific columns as well.

# drop duplicate entries based on k1 only
df.drop_duplicates(['k1'])
	k1	k2
0	one	1
3	two	3

Replacing values

Replacing value is very easy using pandas as below.

# replace 'one' with 'One'
df.replace('one', 'One')
	k1	k2
0	One	1
1	One	1
2	One	2
3	two	3
4	two	3
5	two	4
6	two	4

Arguments can be passed as dictionary as well.

df.replace({'one':'One', 3:30})

Groupby and data aggregation:

df = pd.DataFrame({'k1':['a', 'a', 'b', 'b', 'a'],
'k2':['one', 'two', 'one', 'two', 'one'],
'data1': [2, 3, 3, 2, 4],
'data2': [5, 5, 5, 5, 10]})
df
	k1	k2	data1	data2
0	a	one	2	5
1	a	two	3	5
2	b	one	3	5
3	b	two	2	5
4	a	one	4	10

Now, create a group based on ‘k1’ and find the mean value as below. In the following code, rows (0, 1, 4)
and (2, 3) are grouped together. Therefore mean values are 3 and 2.5.

gp1 = df['data1'].groupby(df['k1'])
gp1
gp1.mean()
k1
a    3.0
b    2.5
Name: data1, dtype: float64

Iterating over group

The groupby operation supports iteration which generates the tuple with two values i.e. group-name and
data.

for name, group in gp1:
    print(name)
    print(group)
a
0    2
1    3
4    4
Name: data1, dtype: int64
b
2    3
3    2
Name: data1, dtype: int64

Data aggregation

We can perform various aggregation operation on the grouped data as well.

gp1.max()
k1
a    4
b    3
Name: data1, dtype: int64
gp1.min()
k1
a    2
b    2
Name: data1, dtype: int64

Conclusion

Data processing is a crucial step in working with data, regardless of the field or domain. It involves a series of operations to transform raw data into a format that is more meaningful, useful, and actionable.

By effectively cleaning, integrating, transforming, analyzing, and visualizing data, data processing empowers decision-makers to make informed decisions, solve problems, optimize processes, and drive innovation in various domains.

By Akshay Tekam

software developer, Data science enthusiast, content creator.

Leave a Reply

Your email address will not be published. Required fields are marked *