David's raw ML reference notes

        • Approximate nearest neighbor search
        • Bedrock
        • Concept drift
        • Datadog
        • Elastic Beats
        • Feature store
        • Fluentd
        • Google Bigtable
        • Hybrid fanout
        • Kibana
        • Logstash
        • Loss functions
        • Mean functions
        • Monitoring across the ML stack
        • Open990 System design diagrams
        • Pointwise, pairwise, and listwise ranking
        • Prometheus
        • Python parallelism options for APIs
        • Redis
        • Retrieval systems playbook
        • Sagemaker Feature Store
        • Splunk
        • SurveyMonkey panel
        • Untitled
      • Latex Suite configuration
          • 2024-05-30 First attempt
        • (Kopec) Classic Computer Science Problems in Python
        • (La Rocca) Advanced Algorithms and Data Structures
        • (Raff) Inside Deep Learning - Math, Algorithms, Models
        • Alammar 2018
        • Kingma and Ba (2014)
        • LeCun, Bengio, and Hinton (2015)
        • Notes from "PyTorch autograd mechanics" (documentation)
        • Notes from "PyTorch modules" (documentation)
        • Nwankpa, et al. (2018)
        • Ruder 2017
        • Rumelhart, Hinton, and Williams (1986)
      • 01 Statistical (machine) learning (data science)
            • 00 Mean squared error (MSE) loss ("regression loss")
          • Naradaya-Watson regression
          • 01 Classification
            • Accuracy and precision
            • Accuracy vs F-scores
            • Area under the ROC curve (AUROC, AUC)
            • Binary classification metrics
            • Binary classification
            • F-beta score ("f-score")
            • F1 score
            • Fall-out (False-positive rate, FPR)
            • False discovery rate (FDR, precision error rate)
            • False positive rate (FPR, false alarm ratio, fall-out rate)
            • Positive and negative predictive value
            • Precision and recall
            • Receiver operating characteristic (ROC) curve
            • Sensitivity and specificity
            • True negative rate (TNR)
            • True positive rate (TPR)
            • Type I vs Type II error
            • 01 Cross-entropy loss
            • 02 Binary cross-entropy loss
            • 03 Softmax cross-entropy loss
            • 04 Focal loss
            • 05 Hinge loss, aka support vector machine (SVM) loss
            • Loss functions for classification
            • Confusion (error) matrix
            • Micro, macro, and weighted averaging
            • Multi-class classification metrics
            • Multi-class classification
            • Precision and recall in multi-class classification
          • 02 Ranking
            • Comparing Siamese, triplet, and two-tower networks
            • Contrastive learning
            • Contrastive vs triplet loss
            • Siamese neural network
            • The original "contrastive loss"
            • Triplet loss
            • Triplet neural network
          • Multi-stage ranking systems
            • 01 Precision and recall at K
            • 02 Average precision
            • 03 Mean average precision (MAP)
            • 04 r-precision
            • 05 Cumulative gain (normalized, discounted) (NDCG)
            • 06 Graded precision, see cumulative gain
            • 07 Mean reciprocal rank (MRR)
            • 08 Ranking metrics in practice
            • Ranking metrics
          • 03 Time series
            • Autoregression ("auto-regression")
          • 7 Regularization
            • 00 Learning rate and regularization
            • 01 L1 regularization (and LASSO)
            • 02 L2 regularization
            • 04 Data augmentation
          • Structural prior
              • Backpropagation of errors
              • Backpropagation through time (for RNNs)
              • Implementation of Rumelhart 1986 network
            • Cybenko's universal function approximation theorem
            • Deep learning
            • Feedforward neural network
            • Hidden (latent) state
            • Neural network
            • Neuron (neural network)
              • Multi-layer perceptrons
              • Perceptron (disambiguation)
              • Perceptron (neuron)
              • Single-layer perceptron
            • Stochastic gradient descent justifies everything
              • Pretrained neural network playbook
            • Fully connected (aka linear) layer
            • Max pooling
            • Residual connection (ResNet), aka skip connection
            • Softmax
            • 02 Activation functions
            • 03 Architectures
              • Attention (neural networks)
              • Bahdanau, Cho, and Bengio (2014)
              • Chaudhari, et al. (2021)
              • Context vector (aka attention vector)
              • Cross-attention (aka Encoder-Decoder attention)
                • Additive (Bahdanau) attention
                • Dense attention
                • Dot-product (multiplicative) attention
                • Efficient attention
                • Scaled dot-product attention
              • Entropy of self-attention as a function of sequence length
              • Even practitioners struggle to understand attention
              • Masked self-attention
              • Multi-head attention
                • LSH-based attention (Reformer attention)
                • Sparse attention
            • Deep mixture-of-experts
              • Cho et al. (2014) encoder-decoder model
              • Encoder-decoder architecture
              • Recurrent neural network (RNN)
                • 00 Constants used in analysis of The Annotated Transformer
                • 01 Implementation of scaled dot-product attention
                • 02 Use of multi-head attention throughout the codebase
                • 03 Implementation of multi-head attention
                • 04 Why the (B x 1 x L) mask must be unsqueezed to (B x 1 x 1 x L)
                • 05 Implementation of the transformer encoder
                • 06 Implementation of the transformer decoder
                • 07 Implementation of sublayer connection
                • 08 Uses of masking in the encoder and the decoder
                • 09 Implementation of the position-wise feedforward network
                • 10 Implementation of positional encoding
                • 11 Implementation of the embedding model
                • 12 Scaffolding of the transformer model (encoder-decoder, generator, decoder)
                • 13 Implementation of the transformer model factory (make_model)
                • Annotated Transformer, The
                • Self-attention
                • Transformer block
                • Vaswani transformer model
                • Vaswani, et al. (2017)
                • What limits transformer sequence length?
            • Two-tower neural network
            • Dropout
            • Gradient clipping
            • Layer normalization
            • Layer vs batch normalization
            • Naming of layer and batch normalizations
            • Stochastic gradient descent as a regularizer
            • Weight decay is equivalent to L2 regularization
          • Linear regression
          • Logistic regression
          • Mixture-of-Experts (MoE)
            • 00 Most important forms of regularization for gradient boosting
            • 01 Limiting number of trees
            • 02 Limiting tree depth
            • 03 Bootstrap sampling (parallel ensembles)
            • 04 Minimum samples per leaf
            • 05 Feature subsampling
            • 06 Limiting number of leaf nodes
            • Example subsampling (loosely called "bagging")
          • Computer vision
          • ResNet (pretrained models)
          • 0 Text features, see Feature Engineering - Text features
            • 0 Text embedding, see Feature engineering - Text features - Text embeddings
            • Constituency vs dependency parsing
            • Named entity recognition (NER)
            • Part-of-speech (POS) tagging
            • Sentence segmentation finds sentence boundaries
            • 0 Specific models, see "NLP - NLP-specific models"
            • Beam search in autoregressive language models
            • Causal language modeling
            • Cross-entropy loss in language models
              • Beam search (token decoding)
              • Decoding (token selection) strategies
              • Greedy decoding
              • Temperature sampling
              • Top-k sampling
              • Top-p (nucleus) sampling
            • Distributional hypothesis
            • Generative pre-training
            • How Transformers handle novel (unknown) tokens
            • Next-word prediction
              • Chain-of-thought (CoT) prompting
              • Prompting strategies
            • Special tokens in language models
            • Transformers, see "Transformer architectures"
            • Why do we need a start-of-sequence token?
              • Augmented transformers
                • Contextualized late interactions over BERT (ColBERT)
              • Relating REALM, DPR, and RAG
                • Retrieval-augmented generation (RAG)
              • Decoder-only transformers
                • Generative Pre-trained Model (GPT-1)
                • Radford, et al. (2018)
            • ELMo vs BERT vs GPT
                • Kitaev, Kaiser, and Leskaya (2020)
                • Reformer model
                • BERT variants
                • Sentence BERT (SBERT, S-BERT)
                • BERT embedding sequence structure
                • BERT pre-training and fine-tuning
                • Bidirectional encoder representations from transformers (BERT)
                • Classification token (CLS)
                • Masked language modeling (MLM)
                • Next-sentence prediction (NSP)
              • Encoder-only transformers
              • BLIP-2
              • Bootstrapping Language-Image Pre-training (BLIP)
              • CLIP, BLIP, and BLIP-2
              • Contrastive image-language pre-training (CLIP)
            • Vaswani transformer, see Architectures
          • Natural language processing (NLP)
            • Collaborative filtering through matrix factorization
            • Collaborative filtering using deep learning
            • Collaborative filtering
            • Neural Collaborative Filtering (NCF)
          • Recommendation systems
          • Design matrix and target matrix
          • Features and feature space
          • Sampling has replacement, subsampling does not
          • Dimensionality reduction
          • Embeddings
          • Johnson-Lindenstrauss lemma
          • Principal component analysis (PCA)
          • Random projection
          • Representation learning (learned embeddings)
          • Binning, bucketing, cutting
          • K-means for quantization
          • Levels and codebook
          • Product quantization (PQ)
          • Quantile discretization
          • Quantization
          • Scalar quantization (SQ)
          • Truncation
          • Uniform quantization
          • Vector quantization (VQ)
          • Feature binarization coerces any data to boolean
          • Stemmers and lemmatizers
            • Sentence and document embeddings
            • word2vec vs GloVe
            • word2vec
          • Token representations
          • Tokenization (tokenizer)
          • Cross-validation for time series
          • Cross-validation
          • Grouping in cross-validation
          • Leave-one-out (LOO) and leave-p-out (LPO) cross-validation
          • Shuffle-and-split cross-validation
          • Stratification in cross-validation
          • k-fold cross-validation
          • Gradient descent
            • Batch gradient descent
            • Mini-batch gradient estimation
            • Stochastic gradient descent
            • 00 See also "mathematics - optimization"
            • 01 Vanilla gradient descent optimizer
            • 02 Gradient descent with momentum
            • 03 Nesterov accelerated gradients (NAG)
            • 04 AdaGrad
            • 05 Root-mean-squared propagation (RMSProp)
            • 06 Adam optimizer
            • Optimization ("optimizers")
          • Grid search
          • Hyperparameter tuning
          • Optuna, see Python libraries - Optuna
          • Randomized hyperparameter search
          • Successive halving
          • Cyclical learning rate
          • Early stopping
          • Learning rate scheduling
          • Learning rate
        • Loss landscape (manifold, surface)
        • Offline vs online metrics
        • Variance=overfitting, bias=underfitting
          • The woman worked as a babysitter (Sheng et al, 2019)
          • Chaudhury 2024, ch. 2 (linear algebra)
          • Chaudhury 2024, ch. 3 (classifiers and vector calculus)
          • Chaudhury 2024, ch. 4 (linear algebraic tools for ML)
          • Chaudhury 2024, ch. 6 (Bayes, information theory)
          • Chaudhury 2024, ch. 7 (neural networks)
          • Chaudhury 2024, ch. 8 (training neural networks)
          • Chaudhury 2024, ch. 9 (loss, optimization, and regularization)
          • Chaudhury, et al. (2024)
        • Complex conjugate
        • Dirac delta "function"
        • Heaviside step function
        • Kroenecker delta
        • Gradient of a function
        • Interpretation of eigenvalues and eigenvectors in ordinary differential equations
        • Minimizers and minima
        • Partial derivative notation is (ab)used for gradients in the neural network literature
        • Why the chain rule for derivatives works
          • Collinearity
          • Eigenvalues and eigenvectors
          • Linear combination
          • Linear transformation
          • Normal to a plane
          • Orthogonality
          • Quadratic form
          • Row- and column-major ordering (tensor vectorization)
          • Tensor
          • Vector space
        • Broadcast (algebra)
          • Diagonal matrix
          • Orthogonal matrix
          • Positive (or negative) (semi-)definite matrix
          • Rotation matrix
          • Similar matrices
          • Singular matrix
          • Symmetric matrix
          • Unitary matrix
          • Eigenvalue decomposition for a square matrix
          • Matrix decompositions
          • Matrix diagonalization
          • Singular value decomposition (SVD)
          • Batched matrix multiplication
          • Conjugate transpose of a matrix
          • Determinant of a (square) matrix
          • Dot product of two vectors
          • Element-wise (Hadamard) product
          • Frobenius inner product
          • Frobenius norm
          • Matrix inverse
          • Matrix multiplication (product)
          • Matrix transpose
          • Orthogonal projection
          • Outer product of two vectors
          • Projection (projection matrix)
          • Pseudo-inverse of a matrix (Moore-Penrose)
          • Spectral norm
          • Trace of a (square) matrix
          • Frobenius product of A and B is the trace of A transpose B
          • Matrix for arbitrary rotation in N dimensions (Rodrigues' rotation formula)
          • Minimization (maximization) of a quadratic form
          • Transpose of a matrix product is the reversed product of the two transposes
          • Harmonic mean
        • Contingency table (crosstab)
        • Covariance matrix
          • Chebyshev distance (L-infinity norm)
          • Cosine similarity
          • Curse of dimensionality
          • Distance metrics
          • Euclidean distance (L2 norm)
          • Hamming distance
          • Inner product (dot product) similarity
          • L-p norm (Minkowski distance)
          • Levenshtein distance
          • Mahalanobis distance
          • Manhattan (taxicab) distance (L1 norm)
        • Distribution notation in probability and information theory
          • 0 Information theory notation
            • Comparing distributions
            • Conditional entropy
            • Cross-entropy is less than or equal to the Shannon entropy of the source distribution
            • Cross-entropy
            • Jensen-Shannon (JS) divergence (JSD)
            • Kullback-Leibler (KL) divergence ("relative entropy")
            • Mutual information
            • Population stability index, (Jeffreys distance, PSI)
            • Response of JSD and PSI to a rare event
            • Wasserstein metric (Earth mover's distance, EMD)
            • Describing distributions
            • Information content of a random event
            • Perplexity
            • Shannon entropy
          • Information theory
          • Series of Approximations to English
          • Shannon (1948)
          • Gundersen 2020
          • Moment of a function
          • Moment-generating functions
        • Finite state machine
          • An Unbiased Evaluation of Environment Management and Packaging Tools
            • Applying element-wise functions to tensors
            • dotenv
            • jupytext
              • Hugging Face
              • accelerate
              • datasets
              • evaluate
              • sentence-transformers
              • transformers
            • Optuna
              • Forcing Pandas to show all rows just once
              • Pandas
              • Split a dataframe by data type
                • 00 Example PyTorch script overview
                • 01 Data import and preprocessing (PyTorch)
                • 02 Defining a custom PyTorch module
                • 03 PyTorch training function
                • 04 PyTorch test function
                • 05 PyTorch train-test loop
                • Batching in PyTorch
                • Composition of operations in PyTorch
                • PyTorch computational graph (autograd functions)
                • PyTorch transforms
                • Registering parameters in PyTorch
                • Dataset and DataLoader
                • PyTorch tensors
                • nn.Embedding vs nn.Linear
                • nn.Module
                • zero_grad method (Optimizer and Module)
                • PyTorch installation
              • PyTorch
              • LabelEncoder is basically a simplified OrdinalEncoder
              • Process categorical and numerical variables separately
              • SciKit learn lumps hyperparameter tuning with cross-validation
              • SciKit-Learn (SKL) overview and reference
              • SciKit-Learn (SKL, SKLearn)
                • Classifier comparisons
                • Topic extraction with latent dirichlet allocation
            • markdownify
            • pillow (also Python Imaging Library, PIL)
        • 01 Python
          • Decorator to convert an instance method to a class method
          • Anaconda
          • Jupyter (Lab)
          • Method object (Command)
          • Osterhout "Philosophy of Software Design"
          • Hash tables and junk drawers
        • Beam search
        • Best-first search
      • 03 Computer programming
        • Merkle tree
          • (Kleppman) Designing Data-Intensive Systems
          • Kleppman ch. 11 -- Stream processing
          • Kleppman ch. 2 -- Data Models and Query Languages
          • Kleppman ch. 3 -- Storage and Retrieval
          • Kleppman ch. 4 -- Encoding and Evolution
          • Kleppman ch. 5 -- Replication
          • Kleppman ch. 6 -- Partitioning
          • Kleppman ch. 8 -- Distributed systems
        • (Xu) System Design Inteview, vol. 1
            • Dynamo (DynamoDB)
            • KeySpace
            • GCP Reference architectures
          • Dataflow client on M1 Mac
          • (Kurose and Ross) Computer Networking -- a Top-Down Approach, 6e
            • CS-340 Lecture 1 High-level overview of the Internet
            • CS-340 Lecture 10 Router internals
            • CS-340 Lecture 11 BGP routing
            • CS-340 Lecture 2 Introduction to Routing
            • CS-340 Lecture 3 HTTP and SMTP
            • CS-340 Lecture 4 Cookies, DNS
            • CS-340 Lecture 5 Reliable transport
            • CS-340 Lecture 6 TCP packets
            • CS-340 Lecture 7 TCP congestion control
            • CS-340 Lecture 8 IPv4 addressing
            • CS-340 Lecture 9 NAT and IPv6
          • Friedlander, et al. (2007)
          • Mockapetris and Dunlap (1988)
          • RFC 3833 (Threat analysis of the Domain Name System (DNS))
            • Medium access control (MAC) address
            • IP propagation from local to remote
            • Transport control protocol (TCP)
            • User datagram protocol (UDP)
            • Secure socket layer (SSL), see TLS
            • Transport layer security (TLS)
              • DNS lookup from the local host
              • DNS nameserver
              • DNS resolver
              • DNS root server
              • DNS zone
              • Domain name system (DNS)
              • Domain name
              • Top-level domain (TLD)
                • 301 and 308 Moved permanently
                • 302 and 307 Found ("moved temporarily")
              • HTTP message (request or response)
              • HTTP vs HTTPS
              • Hypertext transfer protocol (HTTP)
              • Universal resource locator (URL)
          • 7-layer, 5-layer, and 4-layer network models
          • Gossip protocol
          • Bandwidth
          • Byzantine generals problem
          • Two generals problem
          • What happens when I navigate to a URL in my browser
        • Cloud provider networks, see Vendors
        • Distributed systems, see 65.3
        • Distributed systems
            • Split-brain (distributed systems)
            • Heartbeat (timeout detection)
            • Timeouts for detecting node failure
          • Faults
            • Network faults
            • Byzantine fault
            • Types of node failures (faults) in distributed systems
        • Globally monotonic identifiers (IDs)
          • Consistent hashing
          • Partition skew and hot spots (hot shards)
          • Partitioning (aka sharding)
          • Partitioning strategies for distributed systems
          • Rehashing (hash mod N)
          • 0 Distributed system performance characteristics (FLAT CAD)
          • Accessibility (distributed systems)
          • Availability (uptime)
          • Durability (distributed systems)
          • Fault tolerance
          • Latency (distributed system)
          • Round-trip time (RTT)
          • Throughput
        • Remote Procedure Call (RPC)
        • Replication vs partitioning
            • Dual writes
            • Handling write conflicts in multi-leader replication
            • Failover for leader failure in leader-based replication
            • Leader-based replication
            • Logical (row-based) log replication
            • Standing up new followers
            • Statement-based replication
            • Trigger-based replication
            • Write-ahead log (WAL)-based replication ("physical log replication")
          • Synchronous, asynchronous, and semi-synchronous replication
        • Shared-nothing system
        • API gateway
        • Middleware
        • Rate limiter
        • Buffer
        • Information retrieval
          • Change data capture
          • Event sourcing (event log)
      • 05 Data engineering and information science
          • Apache Cassandra
          • Anti-entropy process
          • CAP theorem
          • Command-query responsibility segregation (CQRS)
          • Integrity checking
          • Log compaction
          • Database indexes
          • Forward and inverted (file) index (file flat index)
          • Primary vs secondary index
        • Databases
          • Key-value store
        • Vector databases, see vector search
          • Dense and sparse vector search
            • Pan, et al. (2024)
            • Sun 2020
          • Nearest-neighbor search (vector search)
            • Vector databases (VDBMS)
            • Data-dependent vs data-independent partitioning schemes
              • Learned partitioning of vectors (learning to hash, L2H)
              • Random partitioning of vectors
              • Spectral hashing of vectors
              • Table-based vector indexes
              • Defeatist search
              • Principal component tree
              • Random projection tree
              • Tree-based vector indexes
            • Vector indexing (hashing, partitioning)
            • Vector search libraries
        • Similarity search systems
          • Elasticsearch
          • Lucene
          • Solr
          • Text search platforms
        • Ephemeral (traditional) message brokers
        • Fanout
        • Message queue (broker)
        • Message topics
        • Persistent and ephemeral message brokers as databases
        • Persistent message brokers are based on partitioned logs
        • Persistent message brokers
        • Persistent message queues don't care if a consumer goes offline
        • Stream (data processing)
        • Stream joins
        • When stream consumers lag producers
        • Fowler 2024
          • Migrating between IMAP providers
        • Information technology (IT)
            • Permanently disable re-open apps
            • Restore Time Machine from NAS
              • Restore Time Machine from NAS
              • Restore files from an orphaned Time Machine backup
              • Time machine
            • BitLocker
          • Comparing AWS options for ML model inference (deployment)
          • Comparing AWS options for ML model training
          • ML on AWS
          • Managed ML on AWS
        • MLOps & LLMOps
          • Model deployment
          • Model lifecycle management
        • Seldon Core
          • OpenTofu (OpenTF) is Terraform with a better license
          • Distinction between marks and bookmarks in Sioyek
        • Design Spotify
        • Design questions
        • Hierarchical classification from text embeddings
        • Session-based recommendation
        • Video search
        • Visual search
        • Group chat service
        • Rate limiter
        • Search autocomplete
        • Social news feed (Facebook, Twitter)
        • URL shortener
        • Universally unique ID (UUID)
        • Non-violent communication
        • Rules for dealing with difficult people
        • Discussion failure modes
          • 0 LeetCode Log
          • Two-pointer problems
        • Interviews
            • 2024 sabbatical
            • Alt-text hackathon project narrative
            • End-to-end project narrative
            • Feature extraction narrative
            • High stakes discussion with manager absent
            • Open990 technical overview
            • Vertex Matching Engine narrative
            • EvolutionIQ staff MLE
        • Venisa's management questions
    Home

    ❯

    04 Networked systems

    ❯

    System boundary (ingress and egress)

    ❯

    API gateway

    API gateway

    Feb 14, 20251 min read

    API gateways are ingress middleware that handle various tasks such as service discovery, load balancing, rate limiting, TLS termination, etc.

    AWS went ahead and named a product “API gateway,” which caused me to believe that it was an AWS-specific product. However, there are open-source API gateways, such as Kong. GCP also has an API gateway called Apigee.

    Briefly discussed in (Xu) System Design Inteview, vol. 1, ch. 4 (“design a rate limiter”).


    Graph View

    Backlinks

    • No backlinks found

    Created with Quartz v4.4.0 © 2025

    • Terms of Use
    • LinkedIn
    • Buy me a coffee