“FORECASTING IS HARD, ESPECIALLY THE FUTURE” (image generated with DALL-E)

The challenge of Financial Forecasting, M6 edition.

Miguel Perez Michaus

--

An insight into the results

M-competitions are legendary. There have been 6 editions since 1982, each with a different design and focus. Easily the most ambitious research experiment on real-life forecasting approaches. The last one, M6, is about to end after almost 1 year of monthly updates, and both my forecasting and global ranking are looking good.

What was the competition about? In a nutshell, both about ranking and trading a portfolio of 100 financial products, evaluated in real-time, monthly, throughout a year, with three tracks:

  • A forecasting track for monthly performance ranking probability distribution of the products.
  • A decision track of investment decisions constrained to 100 products monthly rebalance, accounting for volatility of returns.
  • A “duathlon” track, the equally weighted average of the previous two, with double money prizes than each of the individual tracks.

A lot of money prizes are awarded in this research competition, crucially won not only at the end of the competition but also quarterly for each of the three tracks. This multi-horizon evaluation heavily conditions any price maximization strategy and keeps incentives high for the participants.

There are some genius aspects to this design. The dilemma of optimizing for quarterly prizes vs. global mimics real life situation of different relevant timeframes for different agents and noisy interaction of the two scores make adversarial tactics quite difficult.

Overall, the competition design builds a simplified but valid instance of financial markets competitive environment.

By rewarding much more heavily duathlon track the organizers made clear their research objectives is not only forecasting performance but also forecasting applicability and impact. The question is not only “Can you forecast markets?” but, mainly, “Does that forecasting edge make a difference on the bottom line?”

My personal bias as a problem solver makes me favor robust, non-parametric solutions, especially if the task at hand is noisy. And noisy is an understatement for this task. So my path to answering the question has implied asking another one: “How can you apply Machine Learning to Portfolio Optimization?”

It is hard to correctly depict how hard the task at hand is. A hint might be the fact that Nobel prizes have been awarded related to the topic. Even for Machine Learning practitioners, if not used to dealing with financial time series it is easy to understate the difficulty. Let’s see the results of the competition after almost 1 year of monthly rebalanced submissions.

“DART-THROWING MONKEY” (image generated with stable-difussion-2)

Regarding the forecasting task, let’s compare the performance with the rank-probability version of the famous “dart-throwing monkey”, this is, a very naive RPS score of 0.16 given by predicting “I don’t know” and giving equal probabilities for the ranking of the products 1 month ahead.

How many teams are beating the naive “monkey’ forecasting benchmark after one year? Only 20% of forecasters.

And it gets worse. How many beat the naive forecast on a monthly basis, consistently forecasting better than random each and every of the 12 months of the competition? With 12th month score still open, only 3 teams, that is, less than 2% of the teams. Those 3 teams consistently better than random at a monthly level are “Dan”, “invest to get rich” and “Miguel Pérez Michaus” (me). I optimized my model for low variance in score, keeping that “monkey” 0.16 threshold about 2 standard deviations over mean expected score, what still implies it will eventually lose to the monkey benchmark one month every two years, on average.

Let that sink in. For 98% of the competitors, in 2023, no AI tool or statistical model has been able to provide a consistent hedge over a random forecast on a monthly basis. In the most famous international forecasting competition. With the motivation of big money prizes. That says something about the difficulty of the task.

How much has the naive benchmark been improved? At the time of writing two weeks before the end of the competition the best score is 0.1566 but it keeps moving. It seems that first positions in the forecasting leaderboard will be contested until the very last moment.

The question about predictability of markets is a disputed one. In the “Efficient Market Hypothesis” line of reasoning usable forecasts, if public, lead to actions that front-run the very events forecasted. My personal view on the topic is that the this doesn’t negate the possibility of forecasting to a limited extent by a small subset of players.

What about the decisions track?

It is harder to reach conclusions regarding the decisions track and it is still moving. The organizers provided dummy submission of simple 1/n product selection that only about 38% are currently beating. This benchmark can be seen as a “long only” benchmark and as most portfolio components are quite correlated with overall market the decision IR score for this dummy is negative.

How many teams are currently showing positive returns? Only 23% of teams achieve positive IR as of today.

I am much more cautions to jump to conclusions regarding this track, given the adversarial nature of the contest environment, the condition of “paper money” of the investments, the lack of comissions and the idiosincrasy of the chosen IR metric that made it advisable -conditional to initial outperformance- the adoption of market neutral strategies as a competition risk minimization tool.

Still, it is easy to build an alternative randomized naive benchmark by randomly choosing both a product and a sign(long/short) and observing the distributions of constantly sticking to those 200 bets throughout the competition year.

RISK AVERSION FOR POSITIVE IR SCORES ? LONG BIAS?

Tails of the distribution look compatible but the IR performance distribution seems to cluster at/below zero score with an overall underperformance relative to the randomized benchmark. Why can this be? The main suspects to explain this underperformance are, in my opinion:

  • a possible bias of investors favouring long positioning over short. This might explain the underperformance given the period when the competition has taken place and the positive correlation to overall market of the portfolio.
  • the adversarial nature of the decision track, the incentive either to increase risk to maximal levels or neutralize it with market neutral positioning dependent of proximity to quarterly milestones.
  • The big impact of decision risk management on duathlon score that might make decision track a subordinate tool to maximize probability of success in the duatlon score.

Still two weeks to the end of the competition and the leaderboard is still moving. But I think some insights are already apparent:

  • Prediction of financial markets is possible only to a very limited extent without using exogenous, non public, alternative data. (Given that my forecasting model uses exclusively historical end of day price data to consistently outperform I am assuming no relevant information was extracted from external data in the competition that wasn’t implicit in historical prices)
  • Portfolio ranking forecasting is as difficult as returns forecasting (at its core, forecasting relative performance implies directional forecast as well as volatility).
  • Machine Learning can help to optimize to the limit of what is feasible but giving the model some generalization capacity is not trivial for this task.
  • Competitive environments can influence risk management decisions, potentially making them, on average, suboptimal.

And that was all about takeaways from the competition general results, I hope it was of some interest.I will be sharing more details about my approach in the near future.

--

--