Stock Sentiment Analysis using News Headlines Machine Learning Mini Project

In this Stock Sentiment Analysis mini project, we are gonna create a simple machine learning model to analyze stocks sentiment which makes use of news headlines. By doing this we can say whether the stock will go up or down.

Table of Contents

Overview

In the world of finance, making informed decisions about stock investments is crucial. Investors often rely on various factors, including company performance, market trends, and news headlines, to predict stock movements. One powerful tool in this domain is sentiment analysis, which involves analyzing text data to determine the sentiment or emotion expressed within it. In this project we are not gonna use EDA process to do extra analysis but we are gonna use some extend of Data Preprocessing and NLP process such as CountVectorizer. We already have pretty much clean data. You can download it from here.

Understanding Stock Sentiment Analysis

Stock sentiment analysis aims to gauge the sentiment or opinion of investors and traders towards particular stocks or the overall market. It involves analyzing news articles, social media posts, financial reports, and other textual data to extract insights about market sentiment. By analyzing sentiment trends, investors can gain valuable insights into market sentiment and potentially make more informed investment decisions.

Machine Learning in Stock Sentiment Analysis

Machine learning techniques play a crucial role in automating the process of sentiment analysis. By training machine learning models on historical data, these models can learn to classify news headlines or other textual data into positive, negative, or neutral sentiments. These models can then be used to analyze new headlines and provide sentiment predictions in real-time.

Building a Stock Sentiment Analysis Model

Get Data: Here we need to write encoding parameter in read_csv function to get rid of UnicodeDecodeError.

import pandas as pd

df = pd.read_csv('Data.csv', encoding = 'ISO-8859-1')

Show the preview of data. It will show you top five rows from DataFrame. Our data has Label column which contains 0 and 1 plus this is a dependent feature in dataset. The 0 means that stock is not going up, and 1 means stock is going up the next day.

df.head(5)

We need to train our model first and then test it on testing data. Lets split the DataFrame into two parts such as train and test based on specific dates from dates column. All the data less than date ‘2015-01-01’ will go into training set, and remaining should go into testing set. Usually in time series problems, we put older data in training set and new date data in testing set.

train = df[df['Date'] < '20150101']

test = df[df['Date'] > '20141231']

Removing punctuations from text columns. Punctuations could be anything such as period, comma, semicolon, question mark, or dash or any other symbol. So select the columns from top 1 to top 25.

# Removing punctuations
data=train.iloc[:,2:27]
data.replace("[^a-zA-Z]"," ",regex=True, inplace=True)

Let’s rename all independent columns so that we can access it easily. For all 25 columns, naming should be from 0 to 24.

# Renaming column names for ease of access
list1= [i for i in range(25)]
new_Index=[str(i) for i in list1]
data.columns= new_Index
data.head(5)

For ease of model creation we will convert all characters from dataset into lower case.

# Convertng headlines to lower case
for index in new_Index:
    data[index]=data[index].str.lower()
data.head(2)

Let’s join each and every row into a single lines and store it in a list name headlines. By combining each and every headlines in a row it will help us to convert it into vectors further. So headlines will have all the sentences in form of list.

headlines = []
for row in range(0,len(data.index)):
    headlines.append(' '.join(str(x) for x in data.iloc[row,0:25]))

We are gonna make use of Count Vectorizer a class in the sklearn library to extract the features from the sentences. It basically converts all sentences into vectors. It is used to count the frequency of the word that occurs in the sentence. In a bag of words, the numbers are given to every word and give importance to all words.

from sklearn.feature_extraction.text import CountVectorizer

# implement BAG OF WORDS
countvector=CountVectorizer(ngram_range=(2,2))
traindataset=countvector.fit_transform(headlines)

Now lets implement Random Forest Classifier to train our model.

from sklearn.ensemble import RandomForestClassifier

# implement RandomForest Classifier
randomclassifier=RandomForestClassifier(n_estimators=200,criterion='entropy')
randomclassifier.fit(traindataset,train['Label'])

Once the model is trained using fit() method, we can predict it for the test dataset.

## Predict for the Test Dataset
test_transform= []
for row in range(0,len(test.index)):
    test_transform.append(' '.join(str(x) for x in test.iloc[row,2:27]))
test_dataset = countvector.transform(test_transform)
predictions = randomclassifier.predict(test_dataset)

Now we are gonna use some metrics to check our newly created model performance.

## Import library to check accuracy
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

matrix=confusion_matrix(test['Label'],predictions)
print(matrix)
print("---------------------------")
score=accuracy_score(test['Label'],predictions)
print(score)
print("---------------------------")
report=classification_report(test['Label'],predictions)
print(report)

By looking at the confusion matrix, we can say that the data is not imbalance. And it will almost perform 85 percent accurate.