Constraints

  • Keep costs as low as possible
  • Avoid distributed systems when reasonable
    • When distributed systems become necessary, favor minimal interdependence
  • Single engineer

Data pipeline

sequenceDiagram
    participant O as Orchestrator
    participant SI as Spot Instance
    participant S3 as S3 Storage
    participant PyProc as Python Process
    participant NVMe as Local NVMe
    participant MStore as Mapping Store
    participant KVS as KVS (Key-Value Store data)
    participant ES as Elasticsearch Data
    participant CR as Container Registry

    O->>O: Monthly check for new IRS data
    O->>SI: Launch spot instance if new data
    Note right of SI: Instance starts up <br/>with NVMe storage
    SI->>S3: Download new IRS XML data
    S3-->>SI: Raw XML files

    SI->>NVMe: Store raw IRS data locally
    SI->>MStore: Load existing mappings & historical data
    MStore-->>SI: Previous mappings and enriched sets

    Note over PyProc: Python process begins transformations
    SI->>PyProc: Invoke Python script with raw + historical data
    PyProc->>PyProc: Parse and merge XML files
    PyProc->>PyProc: Create time series per nonprofit
    PyProc->>PyProc: Compute segment statistics across nonprofits
    PyProc->>NVMe: Write intermediate and final transformed data

    Note over PyProc: Final data preparation for services
    PyProc->>KVS: Populate sections and skeletons into KVS format
    PyProc->>ES: Build Elasticsearch index files (search indices)

    Note over PyProc: Package the data into Docker images
    PyProc->>CR: Build and push Docker image containing KVS data
    PyProc->>CR: Build and push Docker image containing ES indices

    PyProc->>SI: Notify completion of data processing & image creation
    SI->>O: Signal job completion

    Note over O: Future runs of the web app
    O->>SI: Launch new spot instance for serving requests
    SI->>CR: Pull pre-built Docker images for KVS & ES
    SI->>KVS: Run KVS container with pre-populated data
    SI->>ES: Run ES container with pre-built indices
    SI->>User: Serve web requests via FastAPI + React frontend