When distributed systems become necessary, favor minimal interdependence
Single engineer
Data pipeline
sequenceDiagram
participant O as Orchestrator
participant SI as Spot Instance
participant S3 as S3 Storage
participant PyProc as Python Process
participant NVMe as Local NVMe
participant MStore as Mapping Store
participant KVS as KVS (Key-Value Store data)
participant ES as Elasticsearch Data
participant CR as Container Registry
O->>O: Monthly check for new IRS data
O->>SI: Launch spot instance if new data
Note right of SI: Instance starts up <br/>with NVMe storage
SI->>S3: Download new IRS XML data
S3-->>SI: Raw XML files
SI->>NVMe: Store raw IRS data locally
SI->>MStore: Load existing mappings & historical data
MStore-->>SI: Previous mappings and enriched sets
Note over PyProc: Python process begins transformations
SI->>PyProc: Invoke Python script with raw + historical data
PyProc->>PyProc: Parse and merge XML files
PyProc->>PyProc: Create time series per nonprofit
PyProc->>PyProc: Compute segment statistics across nonprofits
PyProc->>NVMe: Write intermediate and final transformed data
Note over PyProc: Final data preparation for services
PyProc->>KVS: Populate sections and skeletons into KVS format
PyProc->>ES: Build Elasticsearch index files (search indices)
Note over PyProc: Package the data into Docker images
PyProc->>CR: Build and push Docker image containing KVS data
PyProc->>CR: Build and push Docker image containing ES indices
PyProc->>SI: Notify completion of data processing & image creation
SI->>O: Signal job completion
Note over O: Future runs of the web app
O->>SI: Launch new spot instance for serving requests
SI->>CR: Pull pre-built Docker images for KVS & ES
SI->>KVS: Run KVS container with pre-populated data
SI->>ES: Run ES container with pre-built indices
SI->>User: Serve web requests via FastAPI + React frontend