Perform audio redaction for personally identifiable information with Amazon Transcribe

Amazon Transcribe is an automated speech recognition (ASR) service that makes it easy to include speech-to-text abilities to your applications. Automatic content redaction is a feature of Amazon Transcribe that can automatically remove information such as delicate personally recognizable information (PII) from your transcription outcomes.
A popular content redaction use case is the automatic transcription of customer calls (such as in call centers and telemarketing). To construct datasets for downstream analytics and natural language processing (NLP) jobs, such as belief analysis, you may need to remove all PII to safeguard privacy and abide by local laws and guidelines. This post acts on a previous post about redacting PII in Amazon Transcribe and shows a technique for redacting PII from both a text transcription and source audio file.
Service introduction
The following figure reveals an example architecture for carrying out PII audio redaction, utilizing Amazon Simple Storage Service (Amazon S3) and AWS Lambda. In addition, we use the AWS SDK for Python (Boto3) for the Lambda functions.

For this post, we provide an AWS CloudFormation audio redaction design template, which offers the complete information of the execution to make it possible for repeatable releases. If you utilize the template, you should define the name of the input S3 pail where the audio files get published and the name of the output S3 pail where the developed artifacts get saved.
Edit PII from a text transcription
When instructed, Amazon Transcribe can natively determine and redact delicate PII from text transcription output in its supported languages. Supported PII entities consist of the following:

Bank account number
Bank routing number
Credit or debit card number
Credit or debit card CVV code
Credit or debit card expiration date
Credit or debit card PIN
Email address
United States mailing address
Name
United States phone number
Social Security number

Found PII entities are changed with a [PII] tag in the transcribed text. A redaction confidence score (rather of the typical ASR score) and associated beginning and ending timestamps are also attended to each entity. These timestamps enable you to quickly find the PII in the original audio source files for redaction.
Redact PII from an audio file
As portrayed in our architecture, you can show the workflow for editing PII from the transcribed audio by first publishing the sample MP3 file with simulated individual details to an S3 bucket. You can do this either directly through the AWS Management Console or the AWS Command Line Interface (AWS CLI). Lossless audio formats such as FLAC or WAV can also be utilized for improved precision.
The next action is to transcribe the source audio file using the StartTranscriptionJob API. The following is a snippet of the Lambda function to create the Amazon Transcribe job:

The following code is the command used for the sample audio:.

” items”: […

reaction = transcribe.start _ transcription_job(.

United States East (N. Virginia), US East (Ohio), United States West (N. California), US West (Oregon).
Asia Pacific (Hong Kong), Asia Pacific (Mumbai), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo).
Canada (Central).
Europe (Frankfurt), Europe (Ireland), Europe (London), Europe (Paris).
Middle East (Bahrain).
South America (São Paulo).
AWS GovCloud (US-West).

After all the pertinent timestamps for each PII entity, such as start_time and end_time, have actually been drawn out, we use FFmpeg to reduce the audio volume for the determined PII time sectors. FFmpeg is a totally free and open-source software application project including a large suite of libraries and programs for managing video, audio, and other multimedia files and streams.
As portrayed, the start_time and end_time for each PII entity from the JSON output file produced by Amazon Transcribe is passed as a parameter to the FFmpeg executable. It should also be kept in mind that you can pass all the PII time varies at once, and there are timestamps for seven PII entities.
We use the following command format:.

,.

]

Automatic content redaction is a feature of Amazon Transcribe that can immediately get rid of info such as sensitive personally recognizable information (PII) from your transcription results.
As illustrated in our architecture, you can demonstrate the workflow for redacting PII from the transcribed audio by first submitting the sample MP3 file with simulated individual info to an S3 container. Upon shipment of the Amazon Transcribe JSON file to an S3 bucket, a 2nd Lambda function is activated, which determines the timestamps of the PII entities (see the following code):.

/ opt/bin/ffmpeg -i -af “volume= enable= in between( t,,): volume= 0, …, volume= make it possible for= between( t,,): volume= 0”

For each transcription job with automated material redaction enabled, you can generate either the redacted records only or both the redacted transcript and the unredacted records (see the ContentRedaction settings in the Lambda function code snippet). Both redacted and unredacted records are kept in the very same output S3 container you define or in the default S3 pail managed by the service. This function of Amazon Transcribe supplies additional levels of control to secure this sensitive customer details by controlling access to the redacted and non-redacted information through user-defined approval groups.
Upon shipment of the Amazon Transcribe JSON file to an S3 pail, a 2nd Lambda function is triggered, which identifies the timestamps of the PII entities (see the following code):.

You can examine the final audio edited version of the sample MP3 file.
Conclusion.
This post demonstrated how to edit PII from both text transcriptions and source audio files. The Amazon Transcribe material redaction function is available for US English in the following Regions:.

,.

PII] PII] And I hope that Amazon transcribe is doing a great job at editing that individual information away.

/ opt/bin/ffmpeg -i original-audio. mp3 -af “volume= allow= in between( t,3.13,4.18): volume= 0, volume= allow= in between( t,11.24,16.25): volume= 0, volume= allow= in between( t,19.06,22.99): volume= 0, volume= enable= in between( t,24.85,25.96): volume= 0, volume= allow= in between( t,28.83,33.06): volume= 0, volume= allow= in between( t,35.71,38.46): volume= 0, volume= enable= in between( t,40.66,44.75): volume= 0”
redacted-original-audio. mp3.

TranscriptionJobName= jobName,.
LanguageCode= en-US,
MediaFormat= media_format,.
Media=
,.
OutputBucketName= transcribe_output_bucket,.
ContentRedaction=
RedactionType: PII,.
RedactionOutput: redacted.
,.
).

Have a look at the rates page, give the feature a shot, and send us feedback either in the AWS online forum for Amazon Transcribe or through your normal AWS support contacts.

Full details on the Lambda function can be found in the CloudFormation design template.
The transcription is a JSON document including detailed details about each word. The full transcript portion of the JSON file is displayed in the following code:.

About the Authors.
Erwin Gilmore is a Senior Specialist Technical Account Manager in Artificial Intelligence and Machine Learning at Amazon Web Services. He provides technical guidance and assists clients accelerate their ability to innovate through revealing the art of the possible on AWS. In his extra time, he takes pleasure in hanging out, treking, and taking a trip with his household.
Esther Lee is a Product Manager for AWS Language AI Services. She is passionate about the crossway of innovation and education. Out of the workplace, Esther enjoys long walks along the beach, dinners with buddies and friendly rounds of Mahjong.
Priyank Goyal, Principal PM- Tech for Amazon Transcribe.

As depicted, the start_time and end_time for each PII entity from the JSON output file produced by Amazon Transcribe is passed as a parameter to the FFmpeg executable. It must also be kept in mind that you can pass all the PII time varies at once, and there are timestamps for seven PII entities.

Leave a Reply

Your email address will not be published.