There is a recently ended stock price prediction competition on Kaggle. Although I haven’t joined it, the result is rather interesting. EMH (Efficient Market Hypothesis) believers should be happy. 427 teams participated in the competition, which probably contains some of the best data scientists in the world, cannot do anything much better than the Random Walk benchmark. The winning solution is only negligibly better than using the last observable value for prediction (a 0.0036 improvement in the mean absolute error).
The data set contains data collected by Deltix in a two-year period, including 200 days of training data and 310 days of testing data. For each trading day, the prices of 198 securities and the values of 244 undisclosed “indicators” are recorded in 5-minute intervals starting from 9:30AM to 1:55PM. The task is to predict the closing price for each security 2 hours later, at 4PM.
It is not quite right to say the competition result confirmed EMH. But given the data provided, it is very hard to make prediction. The features provided in the data are not disclosed. You don’t know what indicator I230 actually means and therefore you can’t use your prior knowledge to build the model; the indicators are also not security-specific. So things like “special news related to stock 1234 announced” are not included in the data set.
In addition, the snapshots for the indicator values are also ended at 1:55PM. So you cannot use the indicator values at 4PM to aid your prediction. One can imagine that the change in the market index during 1:55PM to 4:00PM could provide some value in making the prediction.
Nevertheless, the biggest lesson learnt in this competition is nothing related to finance. “Believe your Cross-validation score” as someone said. Many teams too relied on the “Public Leaderboard” score (calculated using 30% of the test data) when optimizing or choosing their models. These over-fitted models gave a very bad performance when making prediction at the end of the competition.
Image may be NSFW.Clik here to view.
