Create a dashboard with SEC text for financial NLP in Amazon SageMaker JumpStart

Amazon SageMaker JumpStart assists you quickly and quickly get started with machine knowing (ML) and provides a set of solutions for the most common use cases that can be trained and released easily with simply a few clicks. JumpStart likewise includes a collection of multimodal financial text analysis tools, including example note pads, text models, and services, which utilize APIs from a JumpStart SDK.
SEC filings are really important in financing. Companies submit these reports with the SEC, which informs the world about their service conditions and the future outlook of the business. Due to the fact that of their possible predictive worth, these filings are good sources of details for lots of people, ranging from your average high-school financier to executives of big financial corporations. These filings are freely readily available to all financiers. These SEC filings are openly offered to anyone, downloading parsed filings and constructing a tidy dataset with included features is a time-consuming exercise, even for great technologists. We make this possible in a few API calls.
There are numerous types of filings, however the 3 that we focus on here are 10-Ks, 10-Qs, 8-Ks. The performance we talk about in this post supplies an overall dashboard to represent these 3 types of filings with attribute scoring. We use specialized word lists obtained from natural language processing (NLP) strategies to score the real texts of these filings for several qualities like threat, unpredictability, litigiousness, readability, and belief, offering available numbers to represent these qualities.
This post shows how to do the following in a note pad titled Dashboarding SEC Filings available from SageMaker JumpStart:

client = boto3.client(s 3). _ file( S3_BUCKET_NAME, / . format( S3_FOLDER_NAME, 10k_10q_8k_2019_2021. csv), 10k_10q_8k_2019_2021. csv). df_forms = _ csv( 10k_10q_8k_2019_2021. csv).

items_10K. relabel( columns= header_mappings_10K, inplace= True).
df_10K = pd.merge( df_10K, items_10K, left_index= True, right_index= True).
df_10K. head( 10 ).

We use specialized word lists derived from natural language processing (NLP) strategies to score the actual texts of these filings for numerous characteristics like risk, unpredictability, readability, litigiousness, and belief, offering available numbers to represent these traits. We show how to use the NLP scoring API to add numerical scores as new columns or functions that might be utilized to build a control panel.
SEC filings are extensively used by financial services companies as a source of details about business in order to make trading, risk, loaning, and financial investment management choices. = items_to_df_row( item_iter, columns_10K, “10-K”).

# Prepare the SageMaker sessions default S3 container.
# and a folder to store processed data.
session = sagemaker.Session().
pail = session.default _ bucket().
secdashboard_processed_folder= jumpstart_industry_secdashboard_processed

data_loader. load(.
dataset_10k_10q_8k_2019_2021. csv, # output file name.
wait= True,.
logs= True).

This is conserved as an HTML file, and you can select the file in Studio and open it in your web browser. You can filter, sort, and browse the table interactively in your internet browser. Both search boxes and sliders supply interactivity.
The following is a sample table with example numbers, which are only for illustration.

ratings = _ csv( stock_sec_scores. csv).
#Choose whichever filings you wish to compare for the 2nd and 3rd specification.
createRadarChart( scores, 2, 9).

%% time.
qdf [ summary] =”.
for i in range( len( qdf)):.
print( i, end=.)

nlp_scorer_config = NLPScorerConfig( score_type_list).

You can see the brand-new column included in the following screenshot.

! pip install– no-index smjsindustry-1.0.0- py3-none-any. whl.

The following screenshot shows our results.

data_loader = DataLoader(.
function= sagemaker.get _ execution_role(), # loading task execution role.
instance_count= 1, # instances number, limit differs with circumstances type.
instance_type= ml.c5.2 xlarge, # circumstances type.
volume_size_in_gb= 30, # size in GB of the EBS volume to use.
volume_kms_key= None, # KMS key for the processing volume.
output_kms_key= None, # KMS key ID for processing task outputs.
max_runtime_in_seconds= None, # timeout in seconds. Default is 24 hours.
sagemaker_session= sagemaker.Session(), # session things.
tags= None) # a list of key-value pairs.

The leading part specifies the following:.

The API designates rows of the dataframe to the picked maker instance, and the processing logs reveal the development of the session.
We write the file to Amazon S3. Now lets take a look at the dataframe.

Subset the dataframe for the 10-K filings.
Extract the areas for each 10-K filing and put them in columns in a different dataframe.
Merge this dataframe with the dataframe from Step 1.

Obtain parsed 10-K, 10-Q, 8-K filings. Recovering these filings from SECs EDGAR service is complicated, and parsing these kinds into plaintext for additional analysis can be very time-consuming. You now have the capability to produce a curated dataset in a single API call.
Produce different dataframes for each of the three types of kinds, in addition to separate columns for each extracted area.
Combines two or more sections of the 10-K forms. We execute this (as shown in the following sections) and save the combined text in a column called text2score. We demonstrate how to utilize the NLP scoring API to include numerical ratings as brand-new columns or features that may be utilized to build a dashboard.
Add a column with a summary of the text2score column.
Prepare a last dataframe that can be used as input for a control panel.
Prepare an interactive (in the web browser) data table.

df_10K [” text2score”] = [i+ + j for i, j.
in zip( df_10K [” Managements Discussion and Analysis of Financial Condition and Results of Operations”],. df_10K [” Quantitative and Qualitative Disclosures about Market Risk”]] df_10K [[ ticker, text2score]] to_csv( text2score.csv, index= False). NLP scoring can be slow for huge documents such as SEC filings, which include anywhere from 20,000– 100,000 words. Matching to long word lists (normally 200 words or more) can be lengthy.
The input to the API requires the following:.

The dataset has 248 rows and 6 columns. The column text consists of the complete plaintext of the filing. The column mdna is for the Management Discussion and Analysis area and is just present in the 10-K and 10-Q forms, not in the 8-K form.
Create the dataframe for the drawn out product sections from the 10-K filings.
Next, we produce the dataframe for the extracted item areas.

The middle area reveals how to designate system resources and has default values in location.
The tail end runs the API.

The output file name used in the following example is all_scores. csv, however you can change this to any other file name. Its saved in the S3 bucket and after that, as shown in the following code, we copy it into SageMaker Studio to process it into a dashboard.
The API call is as follows:.

Next, we fill the S3 pail from the SageMaker session:.

%% time.

# Download scripts from S3

qdf = _ csv( all_scores. csv).

% run sec-dashboard/SEC _ Section_Extraction_Functions. ipynb.

Note that we have actually filled the SageMaker Python SDK Boto3 and classes from smfinance. Youre now prepared to download SEC filings for curating your text dataframe. As we talk about next, this is carried out in a single API call.
Download the filings you wish to work with.
Downloading SEC filings is done from the SECs Electronic Data Gathering, Analysis, and Retrieval (EDGAR) site, which supplies open information gain access to. EDGAR is the main system under the SEC for others and business submitting files under the Securities Act of 1933, the Securities Exchange Act of 1934, the Trust Indenture Act of 1939, and the Investment Company Act of 1940. EDGAR contains millions of company and private filings. The system processes about 3,000 filings per day, provides 3,000 terabytes of information to the general public annually, and accommodates 40,000 new filers per year usually. To find out more, see Accessing EDGAR Data. In this area, we offer a single API call that creates a dataset in a couple of lines of code, for any period of time and for a large number of tickers.
We covered the retrieval functionality into a SageMaker processing container and supply an example note pad in JumpStart so you can download a dataset of filings with metadata such as dates and parsed plaintext, which you can then utilize for ML using other SageMaker tools. You only need to specify a date variety and a list of ticker signs, and this API does the rest.
The extracted dataframe is composed to Amazon S3 storage as a CSV file.
The following API defines the maker to be used and the volume size. It also defines the tickers or CIK codes for the business to be covered, as well as the 3 type types (10-K, 10-Q, 8-K) to obtain. The date range is likewise specified along with the file name (CSV) where the obtained filings are kept.
The API is in 3 parts:.

nlp_score_processor. determine(.
” text2score”, # input column.
s 3:// / / . format( S3_BUCKET_NAME, S3_FOLDER_NAME, text2score.csv), # input from s3 container. s 3:// / . format( S3_BUCKET_NAME, S3_FOLDER_NAME), # output s3 prefix (both pail and folder names are needed). all_scores. csv # output file name.

score_type_list = list(.
NLPScoreType( score_type, [].
for score_type in NLPScoreType.DEFAULT _ SCORE_TYPES.
if score_type not in NLPSCORE_NO_WORD_LIST.
score_type_list. extend( [NLPScoreType( score_type, None) for score_type in NLPSCORE_NO_WORD_LIST].

Visit the example note pad Dashboarding SEC Filings in JumpStart to see similar section extraction for the 10-Q and 8-K types. For the latter, not all areas are populated. This is because 8-K forms are typically filed for reporting one kind of product occasion that impacts a business, such as Bankruptcy/Receivership, Termination of a Material Definitive Agreement, Regulation FD Disclosure, and so on.
NLP scoring of the kinds for particular sections.
Financial text has been scored utilizing word lists for some time. For a detailed review, see Textual Analysis in Finance.
The smjsindustry library supplies 11 NLP rating types by default: positive, negative, litigious, polarity, risk, readability, scams, safe, belief, uncertainty, and certainty. Each score (other than readability and sentiment) has its own word list, which is utilized for scanning and matching with an input text dataset.
NLP scoring provides a rating as the fraction of words in a document that are in the relevant scoring word lists. Some ratings like readability utilize standard formulae such as the Gunning-Fog score.
These NLP ratings are added as brand-new numerical columns to the text dataframe; this produces a multimodal dataframe, which is a mixture of tabular data and long-form text, called TabText. When sending this multimodal dataframe for ML, its a great idea to stabilize the columns of NLP ratings (typically with standard normalization or min-max scaling).
You can instantly score any chosen text column using the tools in JumpStart. We demonstrate this with the following code example. We combine the MD&A section (Item 7) and the Risk area (Item 7A), and then apply NLP scoring. We calculate 11 additional columns for various kinds of scores.
To begin, allocate the text for NLP scoring by producing a new column that integrates two columns into a single column called text2score. A new file is conserved in your Amazon S3 bucket.

After this, import the required libraries as follows:.

This post is for demonstrative purposes only. It is not monetary recommendations and ought to not be depended on as monetary or investment suggestions.
This post utilizes information gotten from the SEC EDGAR database. You are responsible for adhering to EDGARs gain access to terms.

Prepare a dashboard of an interactive screening table and visualize the data.
After you curate this dataframe, wouldnt it be excellent to interact with it? This can be performed in a couple of lines of code. We require the R programming language, specifically the DT bundle to get this working. We utilize this last CSV file to build the screening table. The file stock_sec_scores. csv is the same as all_scores. csv other than without the text2score and summary columns.
Utilize the following code in R to build the dashboard:.

df = _ csv( 10k_10q_8k_2019_2021. csv).
df_10K = df [df.form _ type == “10-K”]
# Construct the DataFrame row by row.
items_10K = pd.DataFrame( columns = columns_10K, dtype= object).
# for i in variety( len( df)):.
for i in df_10K. index:.
form_text = df_10K. text [i] item_iter = get_form_items( form_text, “10-K”).
items_10K. loc [i] = items_to_df_row( item_iter, columns_10K, “10-K”).

A contrast of ratings from two documents is gotten utilizing a radar plot, for which the function createRadarChart is supplied in the assistant note pad. This is beneficial to compare two SEC filings utilizing their stabilized (min-max scaling) NLP scores. Ball games are normalized utilizing min-max scaling on each NLP rating. See the following code:.

Specification of system resources, such as the number and kind of machine instances to be utilized.
What NLP ratings to produce, every one leading to a new column in the dataframe.
The S3 container and file name in which to store the enhanced dataframe as a CSV file.
A section that begins the API.

# R codeimport subprocess.
ret_code= [/usr/bin/Rscript, sec-dashboard/Dashboard. R].

Derrick Zhang is a Software Development Engineer at Amazon SageMaker. He concentrates on developing artificial intelligence tools and items for customers.
Daniel Zhu is a Software Development Engineer Intern at Amazon SageMaker. He is presently a third-year Computer Science significant at UC Berkeley, with a concentrate on machine knowing.
Bodhisatta Saha is a high-school senior at the Harker School in San Jose, California. He takes pleasure in working on software application projects in the areas of social benefit, financing, and natural language.
Dr. Sanjiv Das is an Amazon Scholar and the Terry Professor of Finance and Data Science at Santa Clara University. He holds post-graduate degrees in Finance (M.Phil and PhD from New York University) and Computer Science (MS from UC Berkeley), and an MBA from the Indian Institute of Management, Ahmedabad. Prior to being a scholastic, he operated in the derivatives organization in the Asia-Pacific region as a Vice President at Citibank. He works on multimodal artificial intelligence in the area of financial applications.

In this post, we revealed how to curate a dataset of SEC filings, use NLP for function engineering on the dataset, and provide the functions in a control panel.
To get going, you can refer to the example note pad in JumpStart titled Dashboarding SEC Filings. You can likewise refer to the example note pad in JumpStart entitled Create a TabText Dataset of SEC Filings in a Single API Call, which consists of more details of SEC types retrieval, summarization, and NLP scoring.
For an introduction of financial ML tools in JumpStart, see Amazon SageMaker JumpStart introduces brand-new multimodal (long-form text, tabular) financial analysis tools.
For associated post with usage cases to start with, see Use SEC text for ratings classification utilizing multimodal ML in Amazon SageMaker JumpStart and Use pre-trained monetary language models for transfer knowing in Amazon SageMaker JumpStart.
For additional documents, see SageMaker JumpStart Industry Python SDK and Amazon SageMaker JumpStart Industry.

The radar plot reveals the overlap (and subsequently, the difference) in between files on numerous characteristics.

nlp_score_processor = NLPScorer(.
sagemaker.get _ execution_role(), # packing task execution function.
1, # circumstances number, limit varies with instance type.
ml.c5.18 xlarge, # ec2 instance type to run the loading job.
volume_size_in_gb= 30, # size in GB of the EBS volume to use.
volume_kms_key= None, # KMS secret for the processing volume.
output_kms_key= None, # KMS crucial ID for processing job outputs.
max_runtime_in_seconds= None, # timeout in seconds. Default is 24 hours.
sagemaker_session= sagemaker.Session(), # session things.
tags= None) # a list of key-value sets.


Among the functions of this dashboarding procedure displayed in the note pad is breaking out the long SEC filings into separate sections, each of which deals with different elements of a companys reporting. This makes accessing and processing parts of the text of each filing quickly offered to investors or their algorithms.
Financial NLP
Financial NLP is a subset of the quickly increasing usage of ML in finance, but it is the largest. The starting point for a huge amount of monetary NLP is text in SEC filings.
SEC filings are commonly used by monetary services companies as a source of details about business in order to make trading, financing, financial investment, and threat management decisions. They consist of positive information that assists with forecasts and are written with a view to the future. In addition, in current times, the value of historic time series data has deteriorated, considering that economies have been structurally transformed by trade wars, pandemics, and political upheavals. Text as a source of forward-looking information has actually been increasing in relevance.
There has actually been an exponential growth in downloads of SEC filings. How to Talk When a Machine is Listening: Corporate Disclosure in the Age of AI reports that the number of machine downloads of business 10-K and 10-Q filings increased from 360,861 in 2003 to 165,318,719 in 2016.
A vast body of scholastic and professional research is based on monetary text, a significant part of which is based upon SEC filings. A current review short article summarizing this work is Textual Analysis in Finance (2020 ).
We describe how a user can quickly retrieve a set of kinds, break them into areas, score the text in each section using the supplied word lists, and after that prepare control panel elements to evaluate the information.
This post demonstrates how to curate a dataset of SEC filings with a single API call. This can conserve financial analysts weeks of work in developing a pipeline to curate and download SEC text, particularly given its substantial scale.
Set up the smjsindustry library
We provide the APIs through the smjsindustry client library. The initial step requires pip installing a Python package that interacts with a SageMaker processing container. The retrieval, parsing, transforming, and scoring of text is an intricate procedure and uses various algorithms and bundles. To make this smooth and steady for the user, the functionality is packaged into an Amazon Simple Storage Service (Amazon S3) bucket. For installation and maintenance of the workflow, this method lowers your effort to a pip install followed by a single API call.
The following code blocks copy the wheel file to install the smjsindustry library. It likewise downloads an artificial example dataset and dependencies to show the performance of curating the TabText dataframe.

About the Authors.

This likewise installs all the essential packages and APIs that are required to build the monetary NLP dataset.
Unique functions.
We developed different helper functions to make it possible for sectioning the SEC forms, each of which has their own sectioning structure, with areas assigned product numbers. The following code conjures up numerous helper functions required to parse out the numerous areas in the 10-Q, 10-k, and 8-k types:.

This starts the processing task running in a SageMaker container and makes certain that even a really large retrieval can run without the notebook connection.

You can take a look at the cells in the following dataframe to see the text from each section:.

import smjsindustry.
from smjsindustry import NLPScoreType, NLPSCORE_NO_WORD_LIST.
from smjsindustry import NLPScorer.
from smjsindustry import NLPScorerConfig.

dataset_config = EDGARDataSetConfig(.
tickers_or_ciks= [ amzn, goog, 27904, fb, msft, uber, nflx], # list of stock tickers or CIKs.
form_types= [10-K, 10-Q, 8-K], # list of SEC type types.
filing_date_start= 2019-01-01, # starting filing date.
filing_date_end= 2020-12-31, # ending filing date.
[email protected]) # user agent email.

%% time.

This example notebook uses information obtained from the SEC EDGAR database. Keep in mind that you are accountable for abiding by EDGARs gain access to terms. To find out more, see Accessing EDGAR Data.
For this post, we downloaded 3 types of filings for seven companies for a period of 2 years. The finished dataset is saved in Amazon S3 as a CSV file titled 10k_10q_8k_2019_2021.

import smjsindustry.
from import utils.
from smjsindustry import NLPScoreType, NLPSCORE_NO_WORD_LIST.
from smjsindustry import NLPScorerConfig, JaccardSummarizerConfig, KMedoidsSummarizerConfig.
from smjsindustry import Summarizer, NLPScorer.
from import DataLoader, SECXMLFilingParser.
from _ config import EDGARDataSetConfig.

The following screenshot shows our results.

Include a column with summaries of the text being scored.
We can further boost the dataframe with summaries of the target text column. As an example, we utilized the abstractive summarizer from Hugging Face. Because this summarizer can only accommodate roughly 300 words of text, its not straight relevant to our text, which is much longer (thousands of words). Therefore, we applied the Hugging Face summarizer to groups of paragraphs and pulled all of it together to make a single summary. This occurs automatically in the summary function in the following code.
The dataframe is now extended with an additional summary column. Note that an abstractive summarizer restructures the text and loses the original sentences. This is in contrast to an extractive summarizer, which retains the original sentence structure.
The code for this is as follows:.

# Download smjsindustry SDK.
sdk_bucket = fs 3:// / .
! aws s3 sync $sdk_bucket./. # Download helper scripts.
scripts_bucket = fs 3:// / notebook_script_prefix .
! aws s3 sync $scripts_bucket./ sec-dashboard.

The readability rating points out the number of years of schooling needed to check out the material. The higher this worth, the lower the readability.

% pylab inline.
import boto3.
import pandas as pd.
import sagemaker.
pd.get _ choice(” display.max _ columns”, None).

The tickers or SEC CIK codes for the business whose kinds are being recovered.
The SEC types (in this case 10-K, 10-Q, 8-K).
Date range of forms by filing date.
The output CSV file and S3 pail to save the dataset.

Leave a Reply

Your email address will not be published.