This repository showcases a serverless solution that processes PDF fors using AWS Step Functions Distributed Map, extracts data with Amazon Textract, and stores results in Amazon S3 Tables in Iceberg format.
The repository demonstrates how to:
- Process PDF forms at scale using Step Functions Distributed Map
- Extract structured data from PDFs using Amazon Textract
- Store data in Amazon S3 Tables (Iceberg format) via Kinesis Data Firehose
- Schedule automated processing with Amazon EventBridge Scheduler
Warning This application is not ready for production use. It was written for demonstration and educational purposes. Review the Security section of this README and consult with your security team before deploying this stack. No warranty is implied in this example.
Note This architecture creates resources that have costs associated with them. Please see the AWS Pricing page for details and make sure to understand the costs before deploying this stack.
The solutions comprises of the below steps:
- A user uploads customer interest forms as scanned PDFs to an Amazon S3 bucket.
- An Amazon EventBridge Scheduler rule triggers at regular interval, initiating a Step Functions workflow execution.
- The workflow execution activates a Distributed Map State, which lists all PDF files uploaded to Amazon S3 since the previous run.
- The Distributed Map iterates over the list of objects and passes each objects metadata (Bucket, Key, Size, ETag) to a child workflow execution.
- For each object, the child workflow calls Amazon Textract with the provided Bucket and Key to extract raw text and relevant fields (name, email address, mailing address, interest area) from the PDF.
- The child workflow writes the extracted data to an Amazon Data Firehose, which is configured to forward data to an Amazon S3 Tables.
- The Firehose batches the incoming data from the child workflow and writes it to the Amazon S3 Tables at a pre-configured time interval.
- AWS CLI configured with appropriate permissions
- AWS SAM CLI installed
-
Clone the repository
git clone https://github.com/aws-samples/sample-exporting-to-amazon-s3-tables-with-aws-step-functions-distributed-map.git cd sample-exporting-to-amazon-s3-tables-with-aws-step-functions-distributed-map
-
Deploy the stack
sam build sam deploy --guided
-
Upload test PDFs
Upload PDF forms to your S3 bucket under the path:
RawInterestForms/YYYY/WW/
-
Trigger processing
Execute the Step Function manually or wait for the schedule
To avoid ongoing charges, delete the stack and associated resources:
sam delete
Manual cleanup required:
- S3 Tables bucket and data (if not empty)
- CloudWatch log groups (if retention is set)
- Any uploaded PDF files in the source bucket
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.
The solution architecture sample code is provided without any guarantees, and you're not recommended to use it for production-grade workloads. The intention is to provide content to build and learn. Be sure of reading the licensing terms.