David's raw ML reference notes

        • Approximate nearest neighbor search
        • Bedrock
        • Concept drift
        • Datadog
        • Elastic Beats
        • Feature store
        • Fluentd
        • Google Bigtable
        • Hybrid fanout
        • Kibana
        • Logstash
        • Loss functions
        • Mean functions
        • Monitoring across the ML stack
        • Open990 System design diagrams
        • Pointwise, pairwise, and listwise ranking
        • Prometheus
        • Python parallelism options for APIs
        • Redis
        • Retrieval systems playbook
        • Sagemaker Feature Store
        • Splunk
        • SurveyMonkey panel
        • Untitled
      • Latex Suite configuration
          • 2024-05-30 First attempt
        • (Kopec) Classic Computer Science Problems in Python
        • (La Rocca) Advanced Algorithms and Data Structures
        • (Raff) Inside Deep Learning - Math, Algorithms, Models
        • Alammar 2018
        • Kingma and Ba (2014)
        • LeCun, Bengio, and Hinton (2015)
        • Notes from "PyTorch autograd mechanics" (documentation)
        • Notes from "PyTorch modules" (documentation)
        • Nwankpa, et al. (2018)
        • Ruder 2017
        • Rumelhart, Hinton, and Williams (1986)
      • 01 Statistical (machine) learning (data science)
            • 00 Mean squared error (MSE) loss ("regression loss")
          • Naradaya-Watson regression
          • 01 Classification
            • Accuracy and precision
            • Accuracy vs F-scores
            • Area under the ROC curve (AUROC, AUC)
            • Binary classification metrics
            • Binary classification
            • F-beta score ("f-score")
            • F1 score
            • Fall-out (False-positive rate, FPR)
            • False discovery rate (FDR, precision error rate)
            • False positive rate (FPR, false alarm ratio, fall-out rate)
            • Positive and negative predictive value
            • Precision and recall
            • Receiver operating characteristic (ROC) curve
            • Sensitivity and specificity
            • True negative rate (TNR)
            • True positive rate (TPR)
            • Type I vs Type II error
            • 01 Cross-entropy loss
            • 02 Binary cross-entropy loss
            • 03 Softmax cross-entropy loss
            • 04 Focal loss
            • 05 Hinge loss, aka support vector machine (SVM) loss
            • Loss functions for classification
            • Confusion (error) matrix
            • Micro, macro, and weighted averaging
            • Multi-class classification metrics
            • Multi-class classification
            • Precision and recall in multi-class classification
          • 02 Ranking
            • Comparing Siamese, triplet, and two-tower networks
            • Contrastive learning
            • Contrastive vs triplet loss
            • Siamese neural network
            • The original "contrastive loss"
            • Triplet loss
            • Triplet neural network
          • Multi-stage ranking systems
            • 01 Precision and recall at K
            • 02 Average precision
            • 03 Mean average precision (MAP)
            • 04 r-precision
            • 05 Cumulative gain (normalized, discounted) (NDCG)
            • 06 Graded precision, see cumulative gain
            • 07 Mean reciprocal rank (MRR)
            • 08 Ranking metrics in practice
            • Ranking metrics
          • 03 Time series
            • Autoregression ("auto-regression")
          • 7 Regularization
            • 00 Learning rate and regularization
            • 01 L1 regularization (and LASSO)
            • 02 L2 regularization
            • 04 Data augmentation
          • Structural prior
              • Backpropagation of errors
              • Backpropagation through time (for RNNs)
              • Implementation of Rumelhart 1986 network
            • Cybenko's universal function approximation theorem
            • Deep learning
            • Feedforward neural network
            • Hidden (latent) state
            • Neural network
            • Neuron (neural network)
              • Multi-layer perceptrons
              • Perceptron (disambiguation)
              • Perceptron (neuron)
              • Single-layer perceptron
            • Stochastic gradient descent justifies everything
              • Pretrained neural network playbook
            • Fully connected (aka linear) layer
            • Max pooling
            • Residual connection (ResNet), aka skip connection
            • Softmax
            • 02 Activation functions
            • 03 Architectures
              • Attention (neural networks)
              • Bahdanau, Cho, and Bengio (2014)
              • Chaudhari, et al. (2021)
              • Context vector (aka attention vector)
              • Cross-attention (aka Encoder-Decoder attention)
                • Additive (Bahdanau) attention
                • Dense attention
                • Dot-product (multiplicative) attention
                • Efficient attention
                • Scaled dot-product attention
              • Entropy of self-attention as a function of sequence length
              • Even practitioners struggle to understand attention
              • Masked self-attention
              • Multi-head attention
                • LSH-based attention (Reformer attention)
                • Sparse attention
            • Deep mixture-of-experts
              • Cho et al. (2014) encoder-decoder model
              • Encoder-decoder architecture
              • Recurrent neural network (RNN)
                • 00 Constants used in analysis of The Annotated Transformer
                • 01 Implementation of scaled dot-product attention
                • 02 Use of multi-head attention throughout the codebase
                • 03 Implementation of multi-head attention
                • 04 Why the (B x 1 x L) mask must be unsqueezed to (B x 1 x 1 x L)
                • 05 Implementation of the transformer encoder
                • 06 Implementation of the transformer decoder
                • 07 Implementation of sublayer connection
                • 08 Uses of masking in the encoder and the decoder
                • 09 Implementation of the position-wise feedforward network
                • 10 Implementation of positional encoding
                • 11 Implementation of the embedding model
                • 12 Scaffolding of the transformer model (encoder-decoder, generator, decoder)
                • 13 Implementation of the transformer model factory (make_model)
                • Annotated Transformer, The
                • Self-attention
                • Transformer block
                • Vaswani transformer model
                • Vaswani, et al. (2017)
                • What limits transformer sequence length?
            • Two-tower neural network
            • Dropout
            • Gradient clipping
            • Layer normalization
            • Layer vs batch normalization
            • Naming of layer and batch normalizations
            • Stochastic gradient descent as a regularizer
            • Weight decay is equivalent to L2 regularization
          • Linear regression
          • Logistic regression
          • Mixture-of-Experts (MoE)
            • 00 Most important forms of regularization for gradient boosting
            • 01 Limiting number of trees
            • 02 Limiting tree depth
            • 03 Bootstrap sampling (parallel ensembles)
            • 04 Minimum samples per leaf
            • 05 Feature subsampling
            • 06 Limiting number of leaf nodes
            • Example subsampling (loosely called "bagging")
          • Computer vision
          • ResNet (pretrained models)
          • 0 Text features, see Feature Engineering - Text features
            • 0 Text embedding, see Feature engineering - Text features - Text embeddings
            • Constituency vs dependency parsing
            • Named entity recognition (NER)
            • Part-of-speech (POS) tagging
            • Sentence segmentation finds sentence boundaries
            • 0 Specific models, see "NLP - NLP-specific models"
            • Beam search in autoregressive language models
            • Causal language modeling
            • Cross-entropy loss in language models
              • Beam search (token decoding)
              • Decoding (token selection) strategies
              • Greedy decoding
              • Temperature sampling
              • Top-k sampling
              • Top-p (nucleus) sampling
            • Distributional hypothesis
            • Generative pre-training
            • How Transformers handle novel (unknown) tokens
            • Next-word prediction
              • Chain-of-thought (CoT) prompting
              • Prompting strategies
            • Special tokens in language models
            • Transformers, see "Transformer architectures"
            • Why do we need a start-of-sequence token?
              • Augmented transformers
                • Contextualized late interactions over BERT (ColBERT)
              • Relating REALM, DPR, and RAG
                • Retrieval-augmented generation (RAG)
              • Decoder-only transformers
                • Generative Pre-trained Model (GPT-1)
                • Radford, et al. (2018)
            • ELMo vs BERT vs GPT
                • Kitaev, Kaiser, and Leskaya (2020)
                • Reformer model
                • BERT variants
                • Sentence BERT (SBERT, S-BERT)
                • BERT embedding sequence structure
                • BERT pre-training and fine-tuning
                • Bidirectional encoder representations from transformers (BERT)
                • Classification token (CLS)
                • Masked language modeling (MLM)
                • Next-sentence prediction (NSP)
              • Encoder-only transformers
              • BLIP-2
              • Bootstrapping Language-Image Pre-training (BLIP)
              • CLIP, BLIP, and BLIP-2
              • Contrastive image-language pre-training (CLIP)
            • Vaswani transformer, see Architectures
          • Natural language processing (NLP)
            • Collaborative filtering through matrix factorization
            • Collaborative filtering using deep learning
            • Collaborative filtering
            • Neural Collaborative Filtering (NCF)
          • Recommendation systems
          • Design matrix and target matrix
          • Features and feature space
          • Sampling has replacement, subsampling does not
          • Dimensionality reduction
          • Embeddings
          • Johnson-Lindenstrauss lemma
          • Principal component analysis (PCA)
          • Random projection
          • Representation learning (learned embeddings)
          • Binning, bucketing, cutting
          • K-means for quantization
          • Levels and codebook
          • Product quantization (PQ)
          • Quantile discretization
          • Quantization
          • Scalar quantization (SQ)
          • Truncation
          • Uniform quantization
          • Vector quantization (VQ)
          • Feature binarization coerces any data to boolean
          • Stemmers and lemmatizers
            • Sentence and document embeddings
            • word2vec vs GloVe
            • word2vec
          • Token representations
          • Tokenization (tokenizer)
          • Cross-validation for time series
          • Cross-validation
          • Grouping in cross-validation
          • Leave-one-out (LOO) and leave-p-out (LPO) cross-validation
          • Shuffle-and-split cross-validation
          • Stratification in cross-validation
          • k-fold cross-validation
          • Gradient descent
            • Batch gradient descent
            • Mini-batch gradient estimation
            • Stochastic gradient descent
            • 00 See also "mathematics - optimization"
            • 01 Vanilla gradient descent optimizer
            • 02 Gradient descent with momentum
            • 03 Nesterov accelerated gradients (NAG)
            • 04 AdaGrad
            • 05 Root-mean-squared propagation (RMSProp)
            • 06 Adam optimizer
            • Optimization ("optimizers")
          • Grid search
          • Hyperparameter tuning
          • Optuna, see Python libraries - Optuna
          • Randomized hyperparameter search
          • Successive halving
          • Cyclical learning rate
          • Early stopping
          • Learning rate scheduling
          • Learning rate
        • Loss landscape (manifold, surface)
        • Offline vs online metrics
        • Variance=overfitting, bias=underfitting
          • The woman worked as a babysitter (Sheng et al, 2019)
          • Chaudhury 2024, ch. 2 (linear algebra)
          • Chaudhury 2024, ch. 3 (classifiers and vector calculus)
          • Chaudhury 2024, ch. 4 (linear algebraic tools for ML)
          • Chaudhury 2024, ch. 6 (Bayes, information theory)
          • Chaudhury 2024, ch. 7 (neural networks)
          • Chaudhury 2024, ch. 8 (training neural networks)
          • Chaudhury 2024, ch. 9 (loss, optimization, and regularization)
          • Chaudhury, et al. (2024)
        • Complex conjugate
        • Dirac delta "function"
        • Heaviside step function
        • Kroenecker delta
        • Gradient of a function
        • Interpretation of eigenvalues and eigenvectors in ordinary differential equations
        • Minimizers and minima
        • Partial derivative notation is (ab)used for gradients in the neural network literature
        • Why the chain rule for derivatives works
          • Collinearity
          • Eigenvalues and eigenvectors
          • Linear combination
          • Linear transformation
          • Normal to a plane
          • Orthogonality
          • Quadratic form
          • Row- and column-major ordering (tensor vectorization)
          • Tensor
          • Vector space
        • Broadcast (algebra)
          • Diagonal matrix
          • Orthogonal matrix
          • Positive (or negative) (semi-)definite matrix
          • Rotation matrix
          • Similar matrices
          • Singular matrix
          • Symmetric matrix
          • Unitary matrix
          • Eigenvalue decomposition for a square matrix
          • Matrix decompositions
          • Matrix diagonalization
          • Singular value decomposition (SVD)
          • Batched matrix multiplication
          • Conjugate transpose of a matrix
          • Determinant of a (square) matrix
          • Dot product of two vectors
          • Element-wise (Hadamard) product
          • Frobenius inner product
          • Frobenius norm
          • Matrix inverse
          • Matrix multiplication (product)
          • Matrix transpose
          • Orthogonal projection
          • Outer product of two vectors
          • Projection (projection matrix)
          • Pseudo-inverse of a matrix (Moore-Penrose)
          • Spectral norm
          • Trace of a (square) matrix
          • Frobenius product of A and B is the trace of A transpose B
          • Matrix for arbitrary rotation in N dimensions (Rodrigues' rotation formula)
          • Minimization (maximization) of a quadratic form
          • Transpose of a matrix product is the reversed product of the two transposes
          • Harmonic mean
        • Contingency table (crosstab)
        • Covariance matrix
          • Chebyshev distance (L-infinity norm)
          • Cosine similarity
          • Curse of dimensionality
          • Distance metrics
          • Euclidean distance (L2 norm)
          • Hamming distance
          • Inner product (dot product) similarity
          • L-p norm (Minkowski distance)
          • Levenshtein distance
          • Mahalanobis distance
          • Manhattan (taxicab) distance (L1 norm)
        • Distribution notation in probability and information theory
          • 0 Information theory notation
            • Comparing distributions
            • Conditional entropy
            • Cross-entropy is less than or equal to the Shannon entropy of the source distribution
            • Cross-entropy
            • Jensen-Shannon (JS) divergence (JSD)
            • Kullback-Leibler (KL) divergence ("relative entropy")
            • Mutual information
            • Population stability index, (Jeffreys distance, PSI)
            • Response of JSD and PSI to a rare event
            • Wasserstein metric (Earth mover's distance, EMD)
            • Describing distributions
            • Information content of a random event
            • Perplexity
            • Shannon entropy
          • Information theory
          • Series of Approximations to English
          • Shannon (1948)
          • Gundersen 2020
          • Moment of a function
          • Moment-generating functions
        • Finite state machine
          • An Unbiased Evaluation of Environment Management and Packaging Tools
            • Applying element-wise functions to tensors
            • dotenv
            • jupytext
              • Hugging Face
              • accelerate
              • datasets
              • evaluate
              • sentence-transformers
              • transformers
            • Optuna
              • Forcing Pandas to show all rows just once
              • Pandas
              • Split a dataframe by data type
                • 00 Example PyTorch script overview
                • 01 Data import and preprocessing (PyTorch)
                • 02 Defining a custom PyTorch module
                • 03 PyTorch training function
                • 04 PyTorch test function
                • 05 PyTorch train-test loop
                • Batching in PyTorch
                • Composition of operations in PyTorch
                • PyTorch computational graph (autograd functions)
                • PyTorch transforms
                • Registering parameters in PyTorch
                • Dataset and DataLoader
                • PyTorch tensors
                • nn.Embedding vs nn.Linear
                • nn.Module
                • zero_grad method (Optimizer and Module)
                • PyTorch installation
              • PyTorch
              • LabelEncoder is basically a simplified OrdinalEncoder
              • Process categorical and numerical variables separately
              • SciKit learn lumps hyperparameter tuning with cross-validation
              • SciKit-Learn (SKL) overview and reference
              • SciKit-Learn (SKL, SKLearn)
                • Classifier comparisons
                • Topic extraction with latent dirichlet allocation
            • markdownify
            • pillow (also Python Imaging Library, PIL)
        • 01 Python
          • Decorator to convert an instance method to a class method
          • Anaconda
          • Jupyter (Lab)
          • Method object (Command)
          • Osterhout "Philosophy of Software Design"
          • Hash tables and junk drawers
        • Beam search
        • Best-first search
      • 03 Computer programming
        • Merkle tree
          • (Kleppman) Designing Data-Intensive Systems
          • Kleppman ch. 11 -- Stream processing
          • Kleppman ch. 2 -- Data Models and Query Languages
          • Kleppman ch. 3 -- Storage and Retrieval
          • Kleppman ch. 4 -- Encoding and Evolution
          • Kleppman ch. 5 -- Replication
          • Kleppman ch. 6 -- Partitioning
          • Kleppman ch. 8 -- Distributed systems
        • (Xu) System Design Inteview, vol. 1
            • Dynamo (DynamoDB)
            • KeySpace
            • GCP Reference architectures
          • Dataflow client on M1 Mac
          • (Kurose and Ross) Computer Networking -- a Top-Down Approach, 6e
            • CS-340 Lecture 1 High-level overview of the Internet
            • CS-340 Lecture 10 Router internals
            • CS-340 Lecture 11 BGP routing
            • CS-340 Lecture 2 Introduction to Routing
            • CS-340 Lecture 3 HTTP and SMTP
            • CS-340 Lecture 4 Cookies, DNS
            • CS-340 Lecture 5 Reliable transport
            • CS-340 Lecture 6 TCP packets
            • CS-340 Lecture 7 TCP congestion control
            • CS-340 Lecture 8 IPv4 addressing
            • CS-340 Lecture 9 NAT and IPv6
          • Friedlander, et al. (2007)
          • Mockapetris and Dunlap (1988)
          • RFC 3833 (Threat analysis of the Domain Name System (DNS))
            • Medium access control (MAC) address
            • IP propagation from local to remote
            • Transport control protocol (TCP)
            • User datagram protocol (UDP)
            • Secure socket layer (SSL), see TLS
            • Transport layer security (TLS)
              • DNS lookup from the local host
              • DNS nameserver
              • DNS resolver
              • DNS root server
              • DNS zone
              • Domain name system (DNS)
              • Domain name
              • Top-level domain (TLD)
                • 301 and 308 Moved permanently
                • 302 and 307 Found ("moved temporarily")
              • HTTP message (request or response)
              • HTTP vs HTTPS
              • Hypertext transfer protocol (HTTP)
              • Universal resource locator (URL)
          • 7-layer, 5-layer, and 4-layer network models
          • Gossip protocol
          • Bandwidth
          • Byzantine generals problem
          • Two generals problem
          • What happens when I navigate to a URL in my browser
        • Cloud provider networks, see Vendors
        • Distributed systems, see 65.3
        • Distributed systems
            • Split-brain (distributed systems)
            • Heartbeat (timeout detection)
            • Timeouts for detecting node failure
          • Faults
            • Network faults
            • Byzantine fault
            • Types of node failures (faults) in distributed systems
        • Globally monotonic identifiers (IDs)
          • Consistent hashing
          • Partition skew and hot spots (hot shards)
          • Partitioning (aka sharding)
          • Partitioning strategies for distributed systems
          • Rehashing (hash mod N)
          • 0 Distributed system performance characteristics (FLAT CAD)
          • Accessibility (distributed systems)
          • Availability (uptime)
          • Durability (distributed systems)
          • Fault tolerance
          • Latency (distributed system)
          • Round-trip time (RTT)
          • Throughput
        • Remote Procedure Call (RPC)
        • Replication vs partitioning
            • Dual writes
            • Handling write conflicts in multi-leader replication
            • Failover for leader failure in leader-based replication
            • Leader-based replication
            • Logical (row-based) log replication
            • Standing up new followers
            • Statement-based replication
            • Trigger-based replication
            • Write-ahead log (WAL)-based replication ("physical log replication")
          • Synchronous, asynchronous, and semi-synchronous replication
        • Shared-nothing system
        • API gateway
        • Middleware
        • Rate limiter
        • Buffer
        • Information retrieval
          • Change data capture
          • Event sourcing (event log)
      • 05 Data engineering and information science
          • Apache Cassandra
          • Anti-entropy process
          • CAP theorem
          • Command-query responsibility segregation (CQRS)
          • Integrity checking
          • Log compaction
          • Database indexes
          • Forward and inverted (file) index (file flat index)
          • Primary vs secondary index
        • Databases
          • Key-value store
        • Vector databases, see vector search
          • Dense and sparse vector search
            • Pan, et al. (2024)
            • Sun 2020
          • Nearest-neighbor search (vector search)
            • Vector databases (VDBMS)
            • Data-dependent vs data-independent partitioning schemes
              • Learned partitioning of vectors (learning to hash, L2H)
              • Random partitioning of vectors
              • Spectral hashing of vectors
              • Table-based vector indexes
              • Defeatist search
              • Principal component tree
              • Random projection tree
              • Tree-based vector indexes
            • Vector indexing (hashing, partitioning)
            • Vector search libraries
        • Similarity search systems
          • Elasticsearch
          • Lucene
          • Solr
          • Text search platforms
        • Ephemeral (traditional) message brokers
        • Fanout
        • Message queue (broker)
        • Message topics
        • Persistent and ephemeral message brokers as databases
        • Persistent message brokers are based on partitioned logs
        • Persistent message brokers
        • Persistent message queues don't care if a consumer goes offline
        • Stream (data processing)
        • Stream joins
        • When stream consumers lag producers
        • Fowler 2024
          • Migrating between IMAP providers
        • Information technology (IT)
            • Permanently disable re-open apps
            • Restore Time Machine from NAS
              • Restore Time Machine from NAS
              • Restore files from an orphaned Time Machine backup
              • Time machine
            • BitLocker
          • Comparing AWS options for ML model inference (deployment)
          • Comparing AWS options for ML model training
          • ML on AWS
          • Managed ML on AWS
        • MLOps & LLMOps
          • Model deployment
          • Model lifecycle management
        • Seldon Core
          • OpenTofu (OpenTF) is Terraform with a better license
          • Distinction between marks and bookmarks in Sioyek
        • Design Spotify
        • Design questions
        • Hierarchical classification from text embeddings
        • Session-based recommendation
        • Video search
        • Visual search
        • Group chat service
        • Rate limiter
        • Search autocomplete
        • Social news feed (Facebook, Twitter)
        • URL shortener
        • Universally unique ID (UUID)
        • Non-violent communication
        • Rules for dealing with difficult people
        • Discussion failure modes
          • 0 LeetCode Log
          • Two-pointer problems
        • Interviews
            • 2024 sabbatical
            • Alt-text hackathon project narrative
            • End-to-end project narrative
            • Feature extraction narrative
            • High stakes discussion with manager absent
            • Open990 technical overview
            • Vertex Matching Engine narrative
            • EvolutionIQ staff MLE
        • Venisa's management questions
    Home

    ❯

    02 Mathematics

    ❯

    03 Linear algebra

    ❯

    Proofs

    Folder: 02-Mathematics/03-Linear-algebra/Proofs

    4 items under this folder.

    • Feb 14, 2025

      Transpose of a matrix product is the reversed product of the two transposes

      • Feb 14, 2025

        Minimization (maximization) of a quadratic form

        • Feb 14, 2025

          Matrix for arbitrary rotation in N dimensions (Rodrigues' rotation formula)

          • Feb 14, 2025

            Frobenius product of A and B is the trace of A transpose B


            Created with Quartz v4.4.0 © 2025

            • Terms of Use
            • LinkedIn
            • Buy me a coffee