Serverless Headless Document Retrieval

Designed and built a containerized AWS Lambda function that automates document retrieval from construction industry portals using headless Chrome. The function authenticates with multiple provider systems, navigates complex web interfaces, applies configurable download rules from DynamoDB, splits multi-page PDFs, and archives results in S3 before forwarding work to downstream AI enrichment stages via SQS FIFO queues.

Key Achievements

  • Headless Chrome in Lambda: Packaged Chrome and ChromeDriver inside a Lambda container image using a dynamic installer script, with single-process mode and isolated temp directories optimized for the Lambda execution environment.
  • Dual-source workflow architecture: Implemented a strategy pattern supporting multiple document providers (ConstructConnect and Dodge), each with distinct authentication flows, page navigation, and download mechanisms — extensible to future sources without modifying core orchestration.
  • Rule-based document filtering: Built a rule engine that reads configurable download rules from DynamoDB, matches documents by type (division, plan, matched), and generates metadata tracking which rules matched which files for downstream analytics.
  • PDF splitting and artifact management: Automated splitting of multi-page PDFs into single-page files using pypdf, enabling parallel AI enrichment of individual documents. Archives results as timestamped ZIP files in S3 alongside structured metadata JSON.
  • Event-driven observability: Published project lifecycle events (processing, retrying, completed, error) to Amazon EventBridge, enabling cross-system visibility into pipeline health without coupling to the orchestrator.
  • Comprehensive debug capture: On failure, captures Chrome browser logs, CDP performance logs, page source HTML, and download directory contents to an S3 debug bucket for rapid post-mortem analysis.
Technologies
  • AWS Lambda (Container Image)
  • Python 3.12
  • Selenium (Headless Chrome)
  • Docker
  • Amazon S3
  • Amazon DynamoDB
  • Amazon SQS (FIFO)
  • Amazon EventBridge
  • AWS SSM Parameter Store
  • AWS SAM
  • pypdf
Year
2025