For this project I used TA-Lib (technical Analysis Library) to create most of my features. Below is chart showing all the features engineered to do my analysis. Notice that for some of these features, I had several different time intervals for them. I did this in order to capture both short and and long term indicators of stocks. This is similar to the techniques used by professional Day Traders on the Market.
- Introduction
- Exploratory Data Analysis
- Feature Selection
- Model Performance
- Test vs Train distributions
- Conclusion: What I learned
- Exploratory Data Analysis
Below, you can see the positive and negative relationships between the features and ROI high targets. For the time series forcast (TSF200) we see a strong positive correlation which makes sense if we forcast a stock to increase in value, we should expect it to do so or vice-versa. Next, we have moving average for 60 days (ma60) which takes the moving average and divedes it by the Adjusted Close price of the stock. Thus, when you see a value (x-axis) larger than 1, then you know the price has crossed below the MA line and is "likely" to regress back above it. This is why we see as the MA increases, our ROI also has a tendancy to increase as well. Last, we see the RSI (relative strength index) which is a fundamental momentum indicator for traders to use to indicate whether a stock is over-bought or over-sold. Conventionally, values over 70 are over-bought and values under 30 are over-sold. Hence, the negative relationship. Moreover, we would say that as RSI values increase, we seem to find the stock is over-bought and will 'likely' sell off, and conversly as the RSI decreases we see the stock is over-sold and will 'likely' regress back up with more buying volume.
Below, notice the top highest ROIs were on days that very depressed markets and after previous days that were high losses (current_roi).
- Feature Selection
Decreasing our feature space is important because as you add more and more features it becomes more and more difficult to find local minimum and maximum in that increasing dimensional space. Thus, it best to use some kind of process to reduce the number of features. You can use various techniques such as PCA (principle component analysis), pearson correlation, random forests and several others. I choose random forests. The only issue with using random forest's feature selection is that it has a bias toward high frequency features. This means I could have some indicators which are sparse but predict well when they do occur. However, since they don't occur often random forests weigh them less than other features. As you can see below, I went from 43 features down to 29 features. I choose these features because as I started to take more features off, I tended to decrease my accuracy.
- Model Performance
From the above chart we can see how well each of our models did. Accuracy here is showing how well it predicted the exact class or quartile range. Alternatively, Cohen's Kappa ranges from -1 to 1 and indicates better than random chances above 1 and worse than random guessing less than 1. In all cases, we did better than random guessing. However, when I built my Ensemble model, I choose to only take in the models from Random Forests and AdaBoost since they were the best predictors. Also, as I will discuss later, I will only be concerned with how well each one predicts the liklihood of being above Q1. I do this because it has higher probabilities when doing it in this manner.
Below shows the confusion matrices for my models. notice a few things about the predictions. First, understanding how I personally calculated my accuracy is important. Using each column as my predictor, I counted all classifications actually valued above Q1 to be accurate and ones that were actually below Q1 to be inaccurate. I did this in part because we can use these percentages as likelihoods of being correct at knowing the minimal high for that given day. Additionally, we can use that information to make informed decisons on holding out for higher gains on certain days to reap larger returns. Second, the ensemble model that I used didn't do any better than my component models. The ensemble was build using the probabilities of each target from the first two models along with the top 4 predictive features. This resulted in the ensemble to being averaged down for an overall accuracy. lastly, in future work I would ensemble using voting; taking in only the best predictor columns form each model. Also, with these predictors from various models, I could optimally buy the highest probability stocks on given days to maximize returns.
Below we are looking at the distributions of my model predictions vs open to open price ROI. Notice that the model's predictions have a much higher expected value (mean) than the buy and hold over time. This can be misleading, however, it still shows us that we have a higher likelihood to capitalize using these models than random guesing of buying and selling each day. The percentages are showing the average percentage gain from wach distribution. This information could be useful for a Day Trader to understand their liklihood of returns for that day decent prices to put stop limits on their stock. Stop limits are prices at which a trader chooses to buy or sell an equity once the price of that equity reaches that limit.
- Test vs Train distributions
After looking at various stocks I wanted to know the relationship of test vs train distributions. To study this, I used p-values from t-tests to visualize this I then broke up my test and train sets in various ranges to test these conditions. As you can see, accuracy increases when p-values are larger or when train and test sets are more similiar in variance and mean. The only contrary evidence we have to this is when the mean is higher for the test set. This makes sense though.
- Conclusion: What I learned.
- Random Forest and AdaBoost were best predictors
- Similar distributions of test vs train produce higher Accuracy
- Machine Learning can do better than random guessing
- Highest days for ROI were in highly down times for stock