Process and add additional file formats to your Amazon Kendra Index

If you have a corpus of internal documents that you regularly search through, Amazon Kendra can assist you discover your material much faster and easier. Amazon Kendra is a fully handled service backed by machine learning (ML).
One of the popular features in Amazon Kendra is natural language question answering. You can query Amazon Kendra in natural language and it returns an answer from within your documents.
Since September 2021, Amazon Kendra accepts the following document types:

In this post, we demonstrate how to add other formats, including RTF and markdown, to your Amazon Kendra indexes. In addition, we show how you can include additional file formats to your Amazon Kendra cluster.
Service summary
The following diagram illustrates our architecture.

Plaintext
HTML
PDF
Microsoft PowerPoint
Microsoft Word

Our solution has an event-driven serverless architecture with the following actions:

You put your RTF or markdown files in your Amazon Simple Storage Service (Amazon S3) pail. This event through AWS CloudTrail conjures up Amazon EventBridge.
EventBridge generates messages and puts them in an Amazon Simple Queue Service (Amazon SQS) line. Utilizing EventBridge together with Amazon SQS provides high schedule and fault tolerance, guaranteeing all the freshly positioned files in the S3 container are processed and contributed to Amazon Kendra.
EventBridge also invokes an AWS Lambda function, which in turn begins AWS Step Functions. Step Functions supplies serverless orchestration to our solution, which further enhances our high availability and fault tolerance architecture.
Step Functions makes sure that each freshly positioned file in Amazon S3 is processed. Step Functions calls Lambda functions to triage and procedure the files living in Amazon S3. At this step, we initially triage the files based upon their extensions and then process each file in a Lambda function. This architecture lets you add assistance for extra file formats.
The processing Lambda functions (RTF Lambda and MD Lambda) extract the text from each file, store the extracted text files in Amazon S3, and update the Amazon Kendra cluster.
The files are processed and the SQS line is empty, all services, except Amazon S3 and Amazon Kendra, shut down.

Enhance the solution and tailor
You can quickly process additional file types by developing new Lambda functions and including them to the processing list. All you require to do is change the code a little for the triage function to include your brand-new file type and produce matching Lambda functions to process those files.
The following is the code for the triage Lambda function:

Returns.
——.
dict: Object including information of the stock selling deal.
“””.
shot:.
receipt_handle = event.get(” receipt_handle”,” Not Found”).
secret = event.get(” secret”,” Not Found”).
s3_key = event.get(” s3_key”, “Not Found”).

# Environ vars.
outputBucketName = os.environ.get(” OUTPUT_BUCKET_NAME”).
rawBucketName = os.environ.get(” RAW_BUCKET_NAME”).
other than Exception as mistake:.
logger.error( f” Error getting the environ or event variables, error text follows: n mistake “).
raise error.

Criteria.
———-.
event: dict, needed.
Input event to the Lambda function.

For Stack Name, get in an unique name.
For LoggingLevel, enter your desired logging level (WARNING, debug, or info).
For Prefix, enter your desired S3 bucket prefix.

We append the AWS account ID to prevent international S3 pail name crashes.

If you have a corpus of internal files that you frequently browse through, Amazon Kendra can assist you find your content faster and easier. Amazon Kendra is a fully handled service backed by maker knowing (ML). Step Functions calls Lambda functions to triage and process the files living in Amazon S3. The information that lives in Amazon S3 and your Amazon Kendra cluster will not be erased.
To discover more about how Amazon Kendra can assist your business, visit the site.

Select the acknowledgement check boxes, and pick Create Stack.

You ought to use Amazon Kendra Enterprise Edition for production workloads.

return
” receipt_handle”: receipt_handle,.
” bucketName”: outputBucketName,.
” crucial”: s3_key.

def lambda_handler( occasion, context):.
“”” Sample Lambda function.

context: object, needed.
Lambda Context runtime approaches and qualities.

Release the service.
To deploy the service, we utilize an AWS CloudFormation design template. Total the following actions:.

About the Authors.
Gaurav Rele is a Data Scientist at the Amazon ML Solution Lab, where he deals with AWS clients throughout various verticals to accelerate their usage of machine knowing and AWS Cloud services to fix their service obstacles.
Sia Gholami is a Senior Data Scientist at the Amazon ML Solutions Lab, where he constructs AI/ML solutions for clients throughout numerous markets. He is enthusiastic about natural language processing (NLP) and deep learning. Beyond work, Sia takes pleasure in hanging out in nature and playing tennis.

For KendraIndex, get in the IndexId (not the index name) for an existing Amazon Kendra index in your account and Region.

Leaving out the S3 buckets and Amazon Kendra cluster, the AWS CloudFormation stack produces the rest of our resources and gets our option up and running. Youre now ready to include RTF and markdown files to your Amazon Kendra cluster.
Tidy up.
To avoid sustaining unneeded charges, you can use the AWS CloudFormation console to erase the stack that you released. This gets rid of all the resources you produced when releasing the solution. The information that lives in Amazon S3 and your Amazon Kendra cluster will not be deleted.
Conclusion.
In this post, we provided a highly offered fault-tolerant serverless service to add extra file formats to your Amazon Kendra index. We implemented this service for RTF and markdown files and offered guidance on how to broaden this option to other comparable file formats.
You can utilize this service as a beginning point for your own option. For skilled assistance, Amazon ML Solutions Lab, AWS Professional Services, and partners are ready to assist you in your journey. To read more about how Amazon Kendra can help your service, visit the website. Learn more about the Amazon ML Solutions Lab and how they can help your business. Contact us today!

Select Launch Stack:.

from datetime import datetime
from random import randint
from shared.s3 _ utils import get_s3_object, upload_to_s3.
from shared.log import logger.
from striprtf.striprtf import rtf_to_text.
import os.

s3_response = get_s3_object( rawBucketName, key).
rtf_decoded = s3_response. decipher( UTF-8).
text = rtf_to_text( rtf_decoded).
text = text.replace(|,”).
upload_to_s3( outputBucketName, f” s3_key “, text).

Leave a Reply

Your email address will not be published.