Predicting and Analyzing Markets Utilizing a News Archive

LLMs can play a crucial role in decision-making, offering insights into market directions, strategic choices, and an accurate portrayal of the market landscape.

Thomas Cotter
January 22, 2024

Discovering the power of Large Language Models (LLMs) has become a trend nowadays, with their applications extending across different fields. Although their potential in the financial world is vast, it remains largely untapped. LLMs can play a crucial role in decision-making, offering insights into market directions, strategic choices, and an accurate portrayal of the market landscape.

One key element in understanding the current market is stock news, which serves as a significant indicator. It not only reflects the present market conditions but also provides insights for predicting the future. Overcoming the challenge of accessing stock news is made easier with the Benzinga Stock News API. This data becomes a valuable source of information, especially when utilizing various models and technologies.

This article delves into the exploration of Large Language Models and their application, shedding light on their potential to enhance market analysis.

Installing Libraries

In this part, we’ll look at what we need to make our idea happen. These libraries are important because they help us get the information we need and use the Large Language Model.

!pip install benzinga

!pip install openai
from benzinga import news_data

import pandas as pd
import re

import openai
openai.api_key = "YOUR_OPENAI_KEY"

api_key = "YOUR_BENZINGA_API_KEY"

OpenAI is a leading artificial intelligence research organization that provides powerful language models, such as GPT-3, for natural language processing and generation.

Benzinga, on the other hand, is a financial news platform that offers the “benzinga” library, enabling users to access and analyze real-time stock market news and data.

Getting the dataset

Now, we’ll use the functions from the Benzinga Library to get our dataset. This dataset consists of stock news and comes with various details linked to each piece of news.

The obtained dataset is characterized by the following attributes:

id: A unique identifier for each entry in the dataset.
author: The author or contributor of the article.
created: The timestamp indicating when the article was originally created.
updated: The timestamp indicating the last update to the article.
title: The title of the article.
teaser: A brief preview or summary of the article content.
body: The main text or body content of the article.
url: The URL link to the full article.
image: Information about the article’s associated image, if available.
channels: Categories or channels to which the article belongs.
stocks: Information about relevant stocks mentioned in the article.
tags: Tags associated with the article, providing additional information or context.

We need to clean the body of the article to remove the HTML tags and special symbols it comes with. The following function uses the regular expressions library to remove all of those.

def remove_symbols(text):
    # Remove HTML tags
    clean_text = re.sub(r'<.*?>', '', text)
   
    # Remove special symbols
    clean_text = re.sub(r'&[a-zA-Z0-9#]+;', '', clean_text)
   
    return clean_text


df['body'] = df['body'].apply(remove_symbols)

The dataset has a lot of potential and can be used for various real life applications:

Stock Market Prediction:

Utilize historical data on stocks and associated tags to train machine learning models for predicting future stock market trends.

Financial News Sentiment Analysis:

Analyze the sentiment in article titles and teasers to gauge market sentiment, helping traders and investors make informed decisions.

Influencer Investment Strategies:

Extract information on disclosed investments from influential figures like Bill Ackman to understand and potentially replicate successful investment strategies.

Media Coverage Analysis:

Assess media coverage of companies (e.g., Apple) and personalities (e.g., Alexis Ohanian) for public relations and brand management purposes.

Training Data for NLP Models:

Use the dataset to train natural language processing (NLP) models for various tasks, including sentiment analysis and topic modeling in financial and tech domains.

Prompt and training the model

Next, we’ll define our model. For this instance, I’m opting for gpt-3.5-turbo. Feel free to experiment with different models to find the one that suits your requirements best.

def get_chatcompletion(prompt, model="gpt-3.5-turbo"):


    messages = [{"role": "user", "content": prompt}]
    response = openai.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0, # this is the degree of randomness of the model's output
    )
    return response.choices[0].message.content

Refining a prompt effectively is crucial to achieve the desired outcomes. Through extensive experimentation, I’ve developed an ideal prompt that combines example-based training with logical reasoning, striving to reduce mistakes and improve precision.

The prompt creation process is delineated into five straightforward phases, applicable to tasks of any complexity. Each phase below will involve experimentation, adjustments, and refinement.

Phase 1:

Firstly, grasp the essence of the task. Document your requirements, starting with the expected response from the prompt. Then, detail the input you’ll provide. Highlight responses that you want the model to avoid. Anticipate potential scenarios the model might generate based on your input and instruct it explicitly to avoid those. This stage naturally involves trial and error.

Phase 2:

Enrich the prompt with context or examples, understanding that more examples can be beneficial but not always necessary. Clarify what the model should extract from these examples and to what extent. Indicate the elements that must be included in the model’s response after processing the context.

Phase 3:

Define the desired response format, preferably mirroring the format used in the provided context. A more structured response format is generally more effective. However, variations might be needed depending on the case, which requires testing different formats.

Phase 4:

Organize the prompt in this sequence: task, context, response format, instructions, and input. This order prioritizes the information’s relevance, decreasing from the beginning to the end. Different tasks might necessitate a different order or additional sections.

Phase 5:

Analyze the responses obtained in each phase to understand the causes of any inaccuracies or ‘hallucinations’ by the model. Guide the model to exclude these inaccuracies by adding a “Note:” at the end of the prompt structure.

def get_sentiment(news_body):
 
  prompt = f"""
Given a News on stocks delimited by triple backticks (```), your task is to predict the sentiment that the news is giving - "Positive" or "Negative", the direction in which the market will go -"up" or
"down" and the logical reasoning behind your choices. You are provided with examples of stock news its sentiment, use these to understand the task and give accurate and reasonable response to the task.


Stock News:"{df['body'][0]}"
Sentiment: "Positive"


Stock News:"{df['body'][1]}"
Sentiment: "Positive"


Stock News:"{df['body'][10]}"
Sentiment: "Negative"




Note: Sentiment analysed should be on the basis of reading the entire article given as Stock News.


Stock News: ``` {news_body}
```
"""
  response = get_chatcompletion(prompt)
  print(response)

Results

Now, we’ll analyze the output for a piece of stock news from our Benzinga dataset. The logical reasoning provided aligns with the sentiment and market direction, suggesting that our model is performing well.

get_sentiment(df['body'][3])

Sentiment: "Positive"

The sentiment of the news is positive because it announces the launch of mobile and retail sports betting operations in Ohio by BetMGM, a leading sports betting and iGaming operator. The CEO of BetMGM, Adam Greenblatt, describes it as a "monumental day of growth" for both BetMGM and the sports betting industry. The news highlights the opportunity for passionate sports fans in Ohio to enjoy the excitement and benefits of wagering with BetMGM. Additionally, the integration with MGM Rewards allows users to redeem their gameplay for experiences at MGM Resorts properties nationwide. The opening of two new retail sportsbooks in the Cleveland and Cincinnati metro areas is also mentioned, further emphasizing the expansion and accessibility of BetMGM's services. The announcement of responsible gambling education resources, such as the GameSense program, further adds to the positive sentiment by demonstrating BetMGM's commitment to providing a safe and entertaining digital experience for customers.

get_sentiment(df['body'][11])

Sentiment: "Neutral"


The given stock news does not provide any sentiment as it is a list of the top searched items on Google in 2022. It is a factual report without any positive or negative connotations.


Reasoning: The news simply presents a list of the most searched items on Google in 2022, including topics such as the Wordle game, election results, the deaths of Betty White and Bob Saget, the invasion of Ukraine by Russia, lottery jackpots, and other notable events. There is no indication of positive or negative sentiment in this news.

get_sentiment(df['body'][6])

Sentiment: "Negative"


The sentiment of this news is negative because it highlights the financial troubles and legal issues faced by Twitter, which is owned by Elon Musk. The news mentions that Twitter has been sued for failing to pay rent for its office space in San Francisco and for refusing to pay for its charter flights. It also mentions that Twitter has not paid rent for its headquarters or any of its other global offices. These financial issues and lawsuits indicate a negative situation for the company.


Reasoning: The news clearly states that Twitter has been sued for failing to pay rent and for refusing to pay for its charter flights. These actions indicate financial difficulties and a lack of responsibility on the part of the company. Additionally, the mention of Twitter's net loss of $270 million in the second quarter further supports the negative sentiment. Overall, the news presents a picture of financial troubles and legal issues for Twitter, which contributes to the negative sentiment.

Conclusion

In conclusion, the integration of large language models, exemplified by our use of the Benzinga News Archive and API, opens up immense possibilities in the financial sector. The capabilities these tools offer play a pivotal role in automating processes within the industry. The richness of the data available through Benzinga not only facilitates comprehensive analysis but also paves the way for innovative advancements.

As we navigate this landscape of evolving technology, the synergy between large language models and financial datasets stands as a promising avenue for transforming and optimizing various aspects of the financial realm.

With that being said, you’ve reached the end of the article. Hope you learned something new and useful today, don’t hesitate to reach out if you have any questions!

See what's happening at Benzinga

Thomas Cotter

As new accounts stabilize, the intermediate trader will come into the limelight. Here are Benzinga’s 2022 Data Market Predictions: 1.) Advanced Analytics will take Priority

Thomas Cotter

As we close out Q1 of the new year, our attention is caught by the fact that the industry has exploded with a record number

Thomas Cotter

In a fast paced world, empower users to make quick and subconscious decisions with a distinctive and easily recognizable company logo. Whether you use an

Predicting and Analyzing Markets Utilizing a News Archive

Installing Libraries

Getting the dataset

Prompt and training the model

Results

Conclusion

See what's happening at Benzinga

Thank you, we will be in touch with you shortly!