Stock Price Forecasting (Time Series)
Overview
Time series data have a sequential data over a period of time. This data helps us to forecast what will happen in the next time period. They do so by understanding the patterns and trends in previous time periods. One common example would be in sales and revenues. Suppose in a market we have the data of sales and revenue in last 3 years. Based on these data we can do feature engineering and perform analysis to predict or forecast the sales for coming year.
There can be essentially two ways to do so:
- Statistical Analysis and then estimating or guessing based on trend.
- Machine Learning or Deep Learning based approach.
Problem Statement
In this blog I will apply second approach to forecast the stock prices of different companies. The dataset consists of history of stocks for 31 different companies:
['VZ', 'AXP', 'MMM', 'AAPL', 'BA', 'CAT', 'CVX', 'CSCO', 'KO',
'DIS', 'XOM', 'GE', 'GS', 'HD', 'IBM', 'INTC', 'JNJ', 'JPM', 'MCD',
'MRK', 'MSFT', 'NKE', 'PFE', 'PG', 'TRV', 'UTX', 'UNH', 'WMT',
'GOOGL', 'AMZN', 'AABA']
Below is a glimpse of the dataset:
As we have multiple companies, we must create custom data generator in order to get our batch of data corresponding to a particular company say AAPL (Apple Inc.).
Data Loader
The idea is to apply time series logic on “Close” column data. Our objective is to predict the closing price given a history of closing prices. Let’s say the prices in last 7 days were 35,36,39,34,42,43,44 and now you want to predict the price for tomorrow. So we need to convert our problem into a supervised learning problem where we maintain a window of say size K. Out of this we take k-1 prices as our training data and kth price as our target label. In above example, our X would be 35,36,39,34,42,43 (6 days sequence) and y would be 44 (7th day).
I created a custom data generator in Pytorch which creates a train, validation and test dataset generator object for the ease of training.
#dataset class for loading a datapooint:
class StockData:
def __init__(self,company, directory, data_set, seq_len):
self.company = company
self.data_set = data_set
self.directory = directory
self.seq_len = seq_len
self.files = glob.glob(self.directory + '/*.csv')
self.data = self.load_data(self.company)
total = self.__len__()
train_idx = 0
test_idx = int(0.7*total)
val_idx = int(0.85*total)
if self.data_set == 'train':
self.data = self.data[:test_idx]
elif self.data_set == 'test':
self.data = self.data[test_idx:val_idx]
elif self.data_set == 'val':
self.data = self.data[val_idx:]
def __len__(self,):
return len(self.data) - self.seq_len
def load_data(self, company_name):
data_ = []
for file in self.files:
comp_file = file.split('/')[1].split('_')[0]
if comp_file == company_name:
data_.append(pd.read_csv(file))
df = pd.concat(data_, ignore_index = True)
tensor_close = torch.tensor(df['Close'])
return tensor_close
def __getitem__(self,i):
X = self.data[i:i+self.seq_len].float()
y = self.data[i+self.seq_len].float()
return (X,y)
Then we need to train a model with above explained data setting.
Model Training
Firstly, I created and trained a fully connected dense network as shown below:
#Model:
class MLP(nn.Module):
def __init__(self,num_inputs):
super(MLP,self).__init__()
self.num_inputs = num_inputs
self.l1 = nn.Linear(num_inputs, 1024)
self.b = nn.BatchNorm1d(1024)
self.drop1 = nn.Dropout()
self.l2 = nn.Linear(1024,1)
self.drop2 = nn.Dropout()
self.l3 = nn.Linear(512,1)
self.sigmoid = nn.Sigmoid()
def forward(self, X):
X = self.l1(X)
X = self.b(X)
X = self.sigmoid(X)
X = self.drop1(X)
X = self.l2(X)
#X = self.sigmoid(X)
#X = self.drop2(X)
#X = self.l3(X)
return X
I kept a sequence length of 20 for the training. I used a learning rate of 0.00001. I used SGD optimiser with mean squared error loss function and trained the model for 1000 epochs. Below are my training results:
Point to note here is BatchNorm1d significantly helped the performance. Also, in my SGD optimiser, adding weight decay helped a lot.
At the 1000th epoch, the losses were:
Epoch : 1000 || Train Loss : 0.03550402447581291 || Val Loss : 15.789423942565918
After this I trained a LSTM network as shown below:
class LSTM(nn.Module):
def __init__(self, num_hidden, num_inputs, num_layers):
super(LSTM, self).__init__()
self.num_inputs = num_inputs
self.num_hidden = num_hidden
self.num_layers = num_layers
self.lstm1 = nn.LSTM(hidden_size=num_hidden,input_size = num_inputs, num_layers = num_layers, batch_first=True)
self.lstm2 = nn.LSTM(input_size =num_hidden, hidden_size=num_hidden, num_layers = num_layers, batch_first=True)
self.fc1 = nn.Linear(num_hidden,512)
self.sigmoid = nn.Sigmoid()
self.fc2 = nn.Linear(512,1)
def forward(self,x):
num_samples = x.size(0)
h0 = torch.zeros(self.num_layers, num_samples, self.num_hidden).requires_grad_()
c0 = torch.zeros(self.num_layers, num_samples, self.num_hidden).requires_grad_()
h0,c0 = h0.to(device), c0.to(device)
out, (hn, cn) = self.lstm1(x, (h0.detach(), c0.detach()))
#print(out.size())
#print(out[:,:,-1].size())
out = self.fc1(out[:,-1,:])
out = self.fc2(out)
return out
Note: self.lstm2 was not used during forward. I just settled with one lstm block.
I used 512 hidden cells with 5 layers of lstm. The number of input cell was just one. And our network looked like:
The input shape was : torch.Size([512, 20, 1] where 512 is the batch size and 20 is the sequence length 1 is the dimension which matches the input cell size of our LSTM. I know sizes are something which confuses us sometimes and that's why I have mentioned the sizes at each step below:
Input size : torch.Size([512, 20, 1])
Hidden State size : torch.Size([5, 512, 512])
Context vector size : torch.Size([5, 512, 512])
LSTM output size : torch.Size([512, 20, 512])
FC input size torch.Size([512, 20])
FC1 output size : torch.Size([512, 512])
Final output size : torch.Size([512, 1])
Notice that in Hidden state and context we have [5, 512, 512]. The 5 here is nothing but the number of layers in our lstm and 1st 512 is the batch size and 2nd 512 is the number of hidden cells.
I used a learning rate of 0.00001. I used SGD optimiser with mean squared error loss function and trained the model for 1000 epochs. Below are my training results:
At the 1000th epoch, the losses were:
Epoch : 1000 || Train Loss : 0.020185701549053192 || Val Loss : 3.842329263687134
Conclusion:
LSTM did significantly better on test data as compared to simple dense net. There are many areas where experimentation can be done to improve the overall performance such as trying GRUs, vanilla RNNs, etc. Also, one can tweak the sequence length, optimiser, number of epochs, etc.