APIs, or Application Programming Interfaces, are sets of protocols and tools that allow different software applications to communicate with each other. Data collection using APIs involves accessing these interfaces provided by online services, platforms, or data providers to retrieve structured data. Using an API for data collection is a powerful way to obtain real-time or historical data for data analysis, machine learning models, or any data-driven application. So, if you want to learn how to create a dataset by collecting data with an API, this article is for you. In this article, I’ll take you through the task of Data Collection with an API using Python.
Data Collection with an API
Most of the time, a data engineer is responsible for working with APIs to collect data and create datasets according to the needs of the business. Below is the process you can follow for the task of data collection with an API:
- Clearly outline what data is needed, the purpose of the data collection, and how it will be used in your analysis or modelling.
- Read API documentation to know what data you can get, in what format you can get the data, and how you can get it.
- Register or sign up to use the API, if necessary, to obtain API keys.
- Use programming languages that support HTTP requests, like Python with libraries such as requests or urllib for making API calls.
- Develop a script that makes requests to the API endpoints you identified. Handle pagination and iterate over pages of data if the API splits the data across multiple responses.
- Code the script to parse the received data (usually in JSON or XML format) and convert it into a usable format like a DataFrame in Python using Pandas.
So, in this article, I will be using the Spotify API to collect real-time music data from Spotify and create a dataset of music with their features and popularity.
Data Collection with Spotify API using Python
So, to collect data from the Spotify API, you first need to know what data you can get and in what format the data comes in. You can learn everything about it from here.
Now, follow the process mentioned below to sign up for using the API for data collection:
- Create a Spotify Developer account at Spotify for Developers.
- Go to Create an App and get your Client ID and Client Secret.
- If it asks for a website, feel free to use datatoinfolabs.com if you don’t have a website.
Now, let’s start with the data collection task with the Spotify API. I’ll first write code to authenticate with the Spotify API and obtain an access token using the Client Credentials Flow:
import requests
import base64
# replace with your own client id and client secret
CLIENT_ID = 'Your Client ID'
CLIENT_SECRET = 'Your Client Secret'
# Base64 encode the client id and client secret
client_credentials = f"{CLIENT_ID}:{CLIENT_SECRET}"
client_credentials_base64 = base64.b64encode(client_credentials.encode())
# request the access token
token_url = 'https://accounts.spotify.com/api/token'
headers = {
'Authorization': f'Basic {client_credentials_base64.decode()}'
}
data = {
'grant_type': 'client_credentials'
}
response = requests.post(token_url, data=data, headers=headers)
if response.status_code == 200:
access_token = response.json()['access_token']
print("Access token obtained successfully.")
else:
print("Error obtaining access token.")
exit()
The access token obtained is crucial as it is used in subsequent requests to the Spotify API to authenticate and authorize those requests. Without this token, your application will not be able to interact with Spotify’s data and services under the Client Credentials flow. This flow is specifically used for server-to-server interactions where no user authorization is required, which is suitable for accessing publicly available information such as music data, playlists, etc.
Now, install Spotify’s official Python API known as Spotipy. You can install it on your Python environment by executing the command below on your terminal or command prompt:
- pip install spotipy
Now, I’ll write a Python function to extract detailed information about each track in any Spotify playlist:
import pandas as pd
import spotipy
from spotipy.oauth2 import SpotifyOAuth
def get_trending_playlist_data(playlist_id, access_token):
# set up spotipy with the access token
sp = spotipy.Spotify(auth=access_token)
# get the tracks from the playlist
playlist_tracks = sp.playlist_tracks(playlist_id, fields='items(track(id, name, artists, album(id, name)))')
# extract relevant information and store in a list of dictionaries
music_data = []
for track_info in playlist_tracks['items']:
track = track_info['track']
track_name = track['name']
artists = ', '.join([artist['name'] for artist in track['artists']])
album_name = track['album']['name']
album_id = track['album']['id']
track_id = track['id']
# get audio features for the track
audio_features = sp.audio_features(track_id)[0] if track_id != 'Not available' else None
# get release date of the album
try:
album_info = sp.album(album_id) if album_id != 'Not available' else None
release_date = album_info['release_date'] if album_info else None
except:
release_date = None
# get popularity of the track
try:
track_info = sp.track(track_id) if track_id != 'Not available' else None
popularity = track_info['popularity'] if track_info else None
except:
popularity = None
# add additional track information to the track data
track_data = {
'Track Name': track_name,
'Artists': artists,
'Album Name': album_name,
'Album ID': album_id,
'Track ID': track_id,
'Popularity': popularity,
'Release Date': release_date,
'Duration (ms)': audio_features['duration_ms'] if audio_features else None,
'Explicit': track_info.get('explicit', None),
'External URLs': track_info.get('external_urls', {}).get('spotify', None),
'Danceability': audio_features['danceability'] if audio_features else None,
'Energy': audio_features['energy'] if audio_features else None,
'Key': audio_features['key'] if audio_features else None,
'Loudness': audio_features['loudness'] if audio_features else None,
'Mode': audio_features['mode'] if audio_features else None,
'Speechiness': audio_features['speechiness'] if audio_features else None,
'Acousticness': audio_features['acousticness'] if audio_features else None,
'Instrumentalness': audio_features['instrumentalness'] if audio_features else None,
'Liveness': audio_features['liveness'] if audio_features else None,
'Valence': audio_features['valence'] if audio_features else None,
'Tempo': audio_features['tempo'] if audio_features else None,
# add more attributes as needed (go through the documentation to know what more you can add)
}
music_data.append(track_data)
# create a pandas dataframe from the list of dictionaries
df = pd.DataFrame(music_data)
return df
Now, let’s use our function get_trending_playlist_data using a specific Spotify playlist ID and an already obtained access token:
# you can add the playlist id of any playlist on Spotify here
playlist_id = '1gfWsOG1WAoxNeUMMktZbq'
# call the function to get the music data from the playlist and store it in a DataFrame
music_df = get_trending_playlist_data(playlist_id, access_token)
print(music_df)
Track Name \
0 Bijlee Bijlee
1 Expert Jatt
2 Kaun Nachdi (From "Sonu Ke Titu Ki Sweety")
3 Na Na Na Na
4 Patiala Peg
.. ...
95
96 Move Your Lakk
97 Patola (From "Blackmail")
98 Ban Ja Rani (From "Tumhari Sulu")
99 Hauli Hauli (From "De De Pyaar De")
Artists \
0 Harrdy Sandhu
1 Nawab
2 Guru Randhawa, Neeti Mohan
3 J Star
4 Diljit Dosanjh
.. ...
95
96 Diljit Dosanjh, Badshah, Sonakshi Sinha
97 Guru Randhawa, Preet Hundal
98 Guru Randhawa
99 Garry Sandhu, Neha Kakkar, Mellow D
Album Name Album ID \
0 Bijlee Bijlee 3tG0IGB24sRhGFLs5F1Km8
1 Expert Jatt 2gibg5SCTep0wsIMefGzkd
2 High Rated Gabru - Guru Randhawa 6EDbwGsQNQRLf73c7QwZ2f
3 Na Na Na Na 4xBqgoiRSOMU1VlKuntVQW
4 Do Gabru - Diljit Dosanjh & Akhil 1uxDllRe9CPhdr8rhz2QCZ
.. ... ...
95 2jw92hf4mnISbYywvU3Anj
96 Move Your Lakk 0V06TMGQQQkvKxNmFlKyEj
97 Patola (From "Blackmail") 2XAAIDEpPb57NsKgAHLGVQ
98 High Rated Gabru - Guru Randhawa 6EDbwGsQNQRLf73c7QwZ2f
99 Dance Syndrome 6e1XB070vlPaxGDAsi8AF6
Track ID Popularity Release Date Duration (ms) Explicit \
0 1iZLpuGMr4tn1F5bZu32Kb 67 2021-10-30 168450 False
1 7rr6n1NFIcQXCsi43P0YNl 63 2018-01-18 199535 False
2 3s7m0jmCXGcM8tmlvjCvAa 61 2019-03-02 183373 False
3 5GjxbFTZAMhrVfVrNrrwrG 54 2015 209730 False
4 6TikcWOLRsPq66GBx2jk67 46 2018-07-10 188314 False
.. ... ... ... ... ...
95 3OZr3vo7SmYpn5XqeQEAOM 0 0000 203207 False
96 3aYMKdSitJeHUCZO8Wt6fw 48 2017-03-29 194568 False
97 17LZzRCY0iFWlDDuAG7BlM 55 2018-03-05 184410 False
98 7cQtGVoPCK9DlspeYjdHOA 57 2019-03-02 225938 False
99 4XyKoSEWrkQjI4AekJYWNx 34 2019-09-03 209393 False
External URLs ... Energy Key \
0 https://open.spotify.com/track/1iZLpuGMr4tn1F5... ... 0.670 1
1 https://open.spotify.com/track/7rr6n1NFIcQXCsi... ... 0.948 6
2 https://open.spotify.com/track/3s7m0jmCXGcM8tm... ... 0.830 4
3 https://open.spotify.com/track/5GjxbFTZAMhrVfV... ... 0.863 3
4 https://open.spotify.com/track/6TikcWOLRsPq66G... ... 0.811 5
.. ... ... ... ...
95 https://open.spotify.com/track/3OZr3vo7SmYpn5X... ... 0.842 6
96 https://open.spotify.com/track/3aYMKdSitJeHUCZ... ... 0.816 2
97 https://open.spotify.com/track/17LZzRCY0iFWlDD... ... 0.901 3
98 https://open.spotify.com/track/7cQtGVoPCK9Dlsp... ... 0.692 9
99 https://open.spotify.com/track/4XyKoSEWrkQjI4A... ... 0.982 1
Loudness Mode Speechiness Acousticness Instrumentalness Liveness \
0 -5.313 0 0.1430 0.26900 0.000000 0.0733
1 -2.816 0 0.1990 0.29800 0.000000 0.0784
2 -3.981 0 0.0455 0.03570 0.000000 0.0419
3 -3.760 1 0.0413 0.37600 0.000014 0.0916
4 -3.253 0 0.1840 0.02590 0.000000 0.3110
.. ... ... ... ... ... ...
95 -4.109 1 0.0745 0.00814 0.000013 0.2120
96 -5.595 1 0.1480 0.03790 0.000153 0.1230
97 -2.719 0 0.0508 0.12600 0.000000 0.0692
98 -4.718 0 0.0577 0.20600 0.000000 0.1240
99 -3.376 1 0.0788 0.02120 0.000032 0.3370
Valence Tempo
0 0.643 100.004
1 0.647 172.038
2 0.753 127.999
3 0.807 95.000
4 0.835 175.910
.. ... ...
95 0.915 156.051
96 0.744 99.992
97 0.914 87.998
98 0.487 102.015
99 0.571 94.990
[100 rows x 21 columns]
To get the playlist ID of any other playlist on Spotify, just copy the link of the playlist and below is how to identify the playlist ID from the URL of the playlist:
Now, here’s how you can add this data to a CSV file:
music_df.to_csv("musicdata.csv")
Similarly, interacting with other APIs requires you to follow the same process. Reading the documentation thoroughly is half of the steps and writing the script for data collection is the other half. Get the full code here.
Summary
So, this is how you can collect data from an API using Python. Using an API for data collection is a powerful way to obtain real-time or historical data for data analysis, machine learning models, or any data-driven application.
I hope you liked this article on Data Collection with an API using Python. Feel free to ask valuable questions in the comments section below. You can follow me for more content like this.