How to Read a File in Pandas

Pandas is a popular open-source Python library that provides powerful and efficient data manipulation and analysis tools. It is widely used for working with structured data, making it an essential tool in data science, machine learning, and data analysis workflows. The name “pandas” is derived from “panel data” and “Python data analysis.” It was created by Wes McKinney in 2008 as a project to enhance data analysis capabilities in Python.

Table of Contents

In pandas, you can read various types of files using specific functions provided by the library. Here are some commonly used functions for reading files in pandas:

Also Check Out: Basics Of Pandas

What is Reading File Means

Reading a file in pandas refers to the process of importing data from an external file into a pandas DataFrame. The DataFrame is a two-dimensional data structure that allows for efficient handling, analysis, and manipulation of structured data.

Reading a file in pandas involves using specific functions provided by the library that are designed to read different file formats such as CSV, Excel, JSON, SQL databases, and more. These functions parse the contents of the file and create a DataFrame object that holds the data.

Note: CSV files can be downloaded from below link

Link: https://bitbucket.org/pythondsp/pandasguide/downloads/

CSV Files:

To read data from a CSV (Comma-Separated Values) file, you can use the read_csv() function.

import pandas as pd

# Read a CSV file
df = pd.read_csv('data.csv')

The read_csv() function automatically detects the delimiter (usually a comma) and creates a DataFrame from the data in the CSV file.

Excel Files:

To read data from an Excel file, you can use the read_excel() function.

import pandas as pd

# Read an Excel file
df = pd.read_excel('data.xlsx')

The read_excel() function reads the first sheet of the Excel file by default, but you can specify a specific sheet name or index using the sheet_name parameter.

JSON Files:

To read data from a JSON (JavaScript Object Notation) file, you can use the read_json() function.

import pandas as pd

# Read a JSON file
df = pd.read_json('data.json')

The read_json() function creates a DataFrame from the data in the JSON file.

SQL Databases:

To read data from an SQL database, you can use the read_sql() function, which requires a database connection.

import pandas as pd
import sqlite3

# Connect to the database
conn = sqlite3.connect('database.db')

# Read data from a SQL query
query = 'SELECT * FROM table_name'
df = pd.read_sql(query, conn)

The read_sql() function executes the SQL query and retrieves the result as a DataFrame.

These are just a few examples of reading files in pandas. The library provides functions for reading data from many other file formats, including HTML, HDF5, Parquet, and more. The appropriate function depends on the specific file format you are working with.

Note: CSV files can be downloaded from below link

Link: https://bitbucket.org/pythondsp/pandasguide/downloads/

In this section, two data files are used i.e. ‘titles.csv’ and ‘cast.csv’. The ‘titles.csv’ file contains the list
of movies with the releasing year; whereas ‘cast.csv’ file has five columns which store the title of movie,
releasing year, star-casts, type(actor/actress), characters and ratings for actors, as shown below,

import pandas as pd
casts = pd.read_csv('cast.csv', index_col=None)
casts.head()

                  title   year name      type            character        n
0        Closet Monster   2015 Buffy #1  actor           Buffy 4          31.0
1       Suuri illusioni   1985 Homo $    actor           Guests           22.0
2  Battle of the Sexes    2017 $hutter   actor           Bobby Riggs Fan  10.0
3 Secret in Their Eyes    2015 $hutter   actor           2002 Dodger Fan  NaN
4           Steve Jobs    2015 $hutter   actor    1988 Opera House Patron NaN

titles = pd.read_csv('titles.csv', index_col =None)
titles.tail()

read_csv : read the data from the csv file.
index_col = None : there is no index i.e. first column is data
head() : show only first five elements of the DataFrame
tail() : show only last five elements of the DataFrame

titles = pd.read_csv('titles.csv', index_col=None, encoding='utf-8')

If we simply type the name of the DataFrame (i.e. cast in below code), then it will show the first thirty and last twenty rows of the file along with complete list of columns. This can be limited using ‘set_options’ as below. Further, at the end of the table total number of rows and columns will be displayed.

pd.set_option('max_rows', 10, 'max_columns', 10)
titles

                    title      year
0          The Rising Son      1990
1 The Thousand Plane Raid      1969
2        Crucea de piatra      1993
3                 Country      2000
4              Gaiking II      2011
... ... ... ... .. ... ..
49995               Rebel      1970
49996             Suzanne      1996
49997               Bomba      2013
49998 Aao Jao Ghar Tumhara     1984
49999          Mrs. Munck      1995
[50000 rows x 2 columns]

Len :

The ‘len’ command can be used to see the total number of rows in the file,

len(titles)

Note: head() and tail() commands can be used for remind ourselves about the header and contents of the file. These two commands will show the first and last 5 lines respectively of the file. Further, we can change the total number of lines to be displayed by these commands,

titles.head(3)

                    title       year
0          The Rising Son       1990
1 The Thousand Plane Raid       1969
2        Crucea de piatra       1993

Also check out: Basics Of Pandas

Conclusion

In conclusion, reading files in pandas is a fundamental task that allows you to bring external data into a pandas DataFrame for further analysis and manipulation. Here are the key points to remember:

File Formats: Pandas provides specialized functions for reading various file formats, including CSV, Excel, JSON, SQL databases, and more.
File Reading Functions: Some commonly used functions for reading files in pandas are read_csv() for CSV files, read_excel() for Excel files, read_json() for JSON files, and read_sql() for SQL databases.
DataFrame Creation: The file reading functions in pandas create a DataFrame object that holds the data from the file. The DataFrame allows for easy manipulation and analysis of the data.
Options and Parameters: File reading functions provide various options and parameters to customize the reading process, such as specifying the file path, sheet name, delimiter, data types, and more.
Data Cleaning: After reading the file into a DataFrame, you can perform data cleaning tasks, such as handling missing values, data type conversions, removing duplicates, and restructuring the data if necessary.

By leveraging the file reading capabilities of pandas, you can efficiently bring data from different file formats into a convenient and versatile DataFrame structure. This enables you to perform powerful data analysis and manipulation tasks using the rich functionality provided by pandas and its integration with other Python libraries.