Skip to Content

5 min read

AI-Assisted Investment Research: Benefits and Risks of LLMs

Machine-generated investment analysis can improve analysts’ efficiency, but their evaluation of that analysis is still necessary, especially as it increases in complexity.

Key Takeaways

  • Large Language Models perform worse as the complexity of information retrieval and text summarization tasks increase.

  • Human oversight is still required for complex arithmetic and logical-reasoning intensive tasks due to factual knowledge gaps.

  • Large Language Models themselves can be used to evaluate the information they generate in an automated process.

Artificial intelligence and large language models like ChatGPT have demonstrated incredible potential over the last couple years. They have infiltrated nearly every major industry, but the technology is still so new and unprecedented that there isn’t much historical wisdom to draw from. For the investment research world in particular, the research on the application of large language models is limited.

Existing studies have mostly focused on evaluating LLMs for passing financial analyst exams or tasks like numerical reasoning. Investment analysts are especially interested in opportunities for LLMs to take on time-consuming day-to-day investment research tasks, but there are serious challenges to overcome first.

The Morningstar Quantitative Research team published a comprehensive report that this article is based on. Download the AI-assisted research report here.

How Are Investment Analysts Using AI?

Large language models can help make sense of unstructured data and identify patterns across a data set. The Morningstar study explored the diverse array of practical investment research tasks, spanning increasing levels of complexity:

  • Information mining—extracting financial information from financial reports, news articles, or social media posts.
  • Text condensation—turning large volumes of information, such as financial reports, into easily digestible summaries.
  • General research Q&A—answering potential client questions about investments and associated services.
  • SQL code generation—writing SQL code to screen universes, fetching data points, and performing aggregation calculations.
  • Numerical reasoning—drawing quick calculations from structured data assets.
  • Drafting narrations—studying, analyzing, and writing about investment opportunities.

Manager research teams face practical barriers to wide-scale adoption of AI technology. Large language models can suffer from hallucinations, or the generation of factually incorrect information. They can also struggle to comprehend diverse data formats and types, unique linguistic styles, domain-specific intent, and entity identification alongside evolving datasets. To fine-tune LLMs, asset managers and advisors would have to allocate resources for provisioning computational resources and managing ongoing model maintenance costs.

RAG Systems and the Challenges Facing Artificial Intelligence

To address some of these challenges, retrieval-augmented generation, or RAG, systems have emerged.

RAGs fetch up-to-date or context-specific data from an external research database and make it available to an LLM during the text-generation process. Morningstar’s own chatbot, which was built with the Morningstar Intelligence Engine, uses a RAG. These systems can cite their sources, improving auditability and transparency, which is a key requirement for regulated investment research entities.

An illustration showing how a retrieval-augmented generation system is inserted into the pathway of a large language model to produce more accurate output.

Testing the Effectiveness of AI-Powered Investment Research

The Morningstar Quantitative Research team conducted an experiment to uncover the challenges of using RAG systems for practical investment research tasks that consume most of an investment analyst’s time during the content-generation process.

The team curated a list of over 1,250 real-world questions and answers representative of these tasks. The evaluation data set was curated with live financial data using analyst notes, research papers, and filing documents available on Morningstar Direct, the comprehensive application that asset and wealth managers use to build and manage investment portfolios.

Exhibit 2 describes the structure of the data and some characteristics, such as average prompt length for LLMs, alongside the percentage of numerical data and requirements for logical reasoning to solve the task.

A chart showing the data distribution of investment research tasks evaluated in the Moriningstar quantitative research report "Scaling AI-Assisted Research."

Next, as part of machine-generated content evaluation for various tasks, they used different flavors of closed- and open-source LLMs such as GPT-4, Claude-v2, and Mistral-7b, which have shown impressive performance on various benchmarks. They further evaluated these models in zero-shot and few-shot prompt settings to check if the results improved.

The human evaluation was first conducted on machine-generated content to uncover potential gaps. Outcomes were evaluated on multiple dimensions such as relevancy, groundedness, and conciseness. Like human evaluations, they graded machine-generated text on multiple metrics like relevancy, groundedness, and conciseness as well.

Here were some of the key findings.

  • Machine-generated text displays higher efficacy on simpler information retrieval and text summarization tasks, holding promise to augment the efficiency of analysts.
  • Complex arithmetic calculations and logical-reasoning intensive research tasks remain challenging for LLMs today, with the need for continued expert human oversight due to factual knowledge gaps.
  • Automated evaluation of machine-generated text using LLMs themselves yield a scalable and cost-efficient approach, aiding adoption of this technology. Based on our experiments there is 80% alignment between LLM-aided evaluations and human assessments.
  • Based on experiments, GPT-4 model come out on top for text generation and evaluation of investment research tasks.

How Will AI Change Investment Research?

The experiment uncovered potential challenges of RAG-based systems in generating machine text on real-world, time-consuming investment research tasks. Based on human evaluations, AI tools can perform simple tasks like information mining and text condensation effectively. However, as the complexity of tasks grows, requiring arithmetic computations and logical-reasoning skills, they start to falter. Investors cannot solely rely on machine-generated text for investment decisions. These tools should be treated as means of bringing in efficiency gains.

Read the full report for the specific details of the experiment as well as robust commentary on the outcomes.

You might also be interested in...