Pandas is a popular open-source Python library that provides powerful and efficient data manipulation and analysis tools. It is widely used for working with structured data, making it an essential tool in data science, machine learning, and data analysis workflows. The name “pandas” is derived from “panel data” and “Python data analysis.” It was created by Wes McKinney in 2008 as a project to enhance data analysis capabilities in Python.
In pandas, you can read various types of files using specific functions provided by the library. Here are some commonly used functions for reading files in pandas:
Also Check Out: Basics Of Pandas
What is Reading File Means
Reading a file in pandas refers to the process of importing data from an external file into a pandas DataFrame. The DataFrame is a two-dimensional data structure that allows for efficient handling, analysis, and manipulation of structured data.
Reading a file in pandas involves using specific functions provided by the library that are designed to read different file formats such as CSV, Excel, JSON, SQL databases, and more. These functions parse the contents of the file and create a DataFrame object that holds the data.
Note: CSV files can be downloaded from below link
Link: https://bitbucket.org/pythondsp/pandasguide/downloads/
CSV Files:
To read data from a CSV (Comma-Separated Values) file, you can use the read_csv() function.
import pandas as pd
# Read a CSV file
df = pd.read_csv('data.csv')
The read_csv() function automatically detects the delimiter (usually a comma) and creates a DataFrame from the data in the CSV file.
Excel Files:
To read data from an Excel file, you can use the read_excel() function.
import pandas as pd
# Read an Excel file
df = pd.read_excel('data.xlsx')
The read_excel() function reads the first sheet of the Excel file by default, but you can specify a specific sheet name or index using the sheet_name parameter.
JSON Files:
To read data from a JSON (JavaScript Object Notation) file, you can use the read_json() function.
import pandas as pd
# Read a JSON file
df = pd.read_json('data.json')
The read_json() function creates a DataFrame from the data in the JSON file.
SQL Databases:
To read data from an SQL database, you can use the read_sql() function, which requires a database connection.
import pandas as pd
import sqlite3
# Connect to the database
conn = sqlite3.connect('database.db')
# Read data from a SQL query
query = 'SELECT * FROM table_name'
df = pd.read_sql(query, conn)
The read_sql() function executes the SQL query and retrieves the result as a DataFrame.
These are just a few examples of reading files in pandas. The library provides functions for reading data from many other file formats, including HTML, HDF5, Parquet, and more. The appropriate function depends on the specific file format you are working with.
Note: CSV files can be downloaded from below link
Link: https://bitbucket.org/pythondsp/pandasguide/downloads/
In this section, two data files are used i.e. ‘titles.csv’ and ‘cast.csv’. The ‘titles.csv’ file contains the list
of movies with the releasing year; whereas ‘cast.csv’ file has five columns which store the title of movie,
releasing year, star-casts, type(actor/actress), characters and ratings for actors, as shown below,
import pandas as pd
casts = pd.read_csv('cast.csv', index_col=None)
casts.head()
title year name type character n
0 Closet Monster 2015 Buffy #1 actor Buffy 4 31.0
1 Suuri illusioni 1985 Homo $ actor Guests 22.0
2 Battle of the Sexes 2017 $hutter actor Bobby Riggs Fan 10.0
3 Secret in Their Eyes 2015 $hutter actor 2002 Dodger Fan NaN
4 Steve Jobs 2015 $hutter actor 1988 Opera House Patron NaN
titles = pd.read_csv('titles.csv', index_col =None)
titles.tail()
- read_csv : read the data from the csv file.
- index_col = None : there is no index i.e. first column is data
- head() : show only first five elements of the DataFrame
- tail() : show only last five elements of the DataFrame
titles = pd.read_csv('titles.csv', index_col=None, encoding='utf-8')
If we simply type the name of the DataFrame (i.e. cast in below code), then it will show the first thirty and last twenty rows of the file along with complete list of columns. This can be limited using ‘set_options’ as below. Further, at the end of the table total number of rows and columns will be displayed.
pd.set_option('max_rows', 10, 'max_columns', 10)
titles
title year
0 The Rising Son 1990
1 The Thousand Plane Raid 1969
2 Crucea de piatra 1993
3 Country 2000
4 Gaiking II 2011
... ... ... ... .. ... ..
49995 Rebel 1970
49996 Suzanne 1996
49997 Bomba 2013
49998 Aao Jao Ghar Tumhara 1984
49999 Mrs. Munck 1995
[50000 rows x 2 columns]
Len :
The ‘len’ command can be used to see the total number of rows in the file,
len(titles)
50000
Note: head() and tail() commands can be used for remind ourselves about the header and contents of the file. These two commands will show the first and last 5 lines respectively of the file. Further, we can change the total number of lines to be displayed by these commands,
titles.head(3)
title year
0 The Rising Son 1990
1 The Thousand Plane Raid 1969
2 Crucea de piatra 1993
Also check out: Basics Of Pandas
Conclusion
In conclusion, reading files in pandas is a fundamental task that allows you to bring external data into a pandas DataFrame for further analysis and manipulation. Here are the key points to remember:
- File Formats: Pandas provides specialized functions for reading various file formats, including CSV, Excel, JSON, SQL databases, and more.
- File Reading Functions: Some commonly used functions for reading files in pandas are read_csv() for CSV files, read_excel() for Excel files, read_json() for JSON files, and read_sql() for SQL databases.
- DataFrame Creation: The file reading functions in pandas create a DataFrame object that holds the data from the file. The DataFrame allows for easy manipulation and analysis of the data.
- Options and Parameters: File reading functions provide various options and parameters to customize the reading process, such as specifying the file path, sheet name, delimiter, data types, and more.
- Data Cleaning: After reading the file into a DataFrame, you can perform data cleaning tasks, such as handling missing values, data type conversions, removing duplicates, and restructuring the data if necessary.
By leveraging the file reading capabilities of pandas, you can efficiently bring data from different file formats into a convenient and versatile DataFrame structure. This enables you to perform powerful data analysis and manipulation tasks using the rich functionality provided by pandas and its integration with other Python libraries.