-
Monthly price
-
Paid subscribers
-
Monthly revenue
-
Paid trend
0.0×
Free trend

Popularity trend

Loading...
Get a weekly roundup of the best Substack posts, by hacker news affinity:

Top posts of the month

By hacker news affinity
day week month year all
dblalock 11 likes 09 May 22
These paper summaries made possible by MosaicML. If you like these papers, you might like our open-source library for faster model training or even joining our team. Convolutional and Residual Networks Provably Contain Lottery Tickets For a CNN with skip connections and reasonable initializations + nonlinearities, there exists a wider and slightly deeper sparse CNN that approximates its outputs with high probability. Previous work couldn’t handle skip connections or convolutions. They actually instantiate their construction for some MNIST networks and show they can match the target network’s accuracy without training—instead just (approximately) solving a large number of subset sum problems to directly identify a sparse subnetwork. Doesn’t seem like a technique one would want to use in practice yet, but always nice to see a theoretical result that 1) applies to somewhat realistic networks, and 2) works without relying on quantities approaching infinity.
dblalock 0 likes 05 May 22
deep-significance - Easy and Meaningful Statistical Significance Testing in the Age of Neural Networks Proposes significance testing based on overlap of CDFs of outcomes. Doesn’t address the main problem, which is that no one reports multiple runs in the first place, but it’s an interesting statistic to look at. Might be an informative way to monitor distributions of weights, gradients, etc, changing over time or across runs.
dblalock 0 likes 04 May 22
Reproducibility Issues for BERT-based Evaluation Metrics People have proposed various BERT-based alternatives to BLEU for natural language generation and machine translation. But it turns out that these often result in a reproducibility dumpster fire, thanks to undocumented preprocessing subtleties, missing code, and various other issues. This sometimes results in inflated results or handicapped baselines. Another paper for my
dblalock 0 likes 04 May 22
A Stitch in Time Saves Nine: A Train-Time Regularizing Loss for Improved Neural Network Calibration Adding focal loss + a new auxiliary “Multi-Class Difference of Confidence and Accuracy” (MDCA) loss often works better than other calibration methods. No results saying absolute classification accuracy AFAICT, so not clear at what cost this better calibration comes.
dblalock 0 likes 04 May 22
Overcoming Oscillations in Quantization-Aware Training Two changes to raw STE quantization: 1) regularization term to try to dampen oscillations in quantized weights; 2) "If the oscillation frequency of any weight exceeds a threshold f, that weight gets frozen until the end of training. We apply the freezing in the integer domain, such that potential change in the scales during optimization does not lead to a different rounding.” The damping seems to help quite a bit (reproducing a 4-bit quantization paper we saw submitted to ICLR a while ago). The freezing helps similarly. Doesn’t seem as effective as some other papers that got iso accuracy on ImageNet, but reproduction of oscillatory behavior hurting performance is a useful datapoint.
dblalock 0 likes 04 May 22
Do We Really Need a Learnable Classifier at the End of Deep Neural Network? I really like this because they did a thing I thought about a long time ago, but couldn’t figure out the math for; namely, constructing a matrix whose maximum cosine similarity between columns is minimized. Instead of having a trainable softmax classifier at the end, they just use such a matrix. Works worse for class balanced problems but often better for small, imbalanced ones. They also replace cross-entropy with a different loss based on cosine similarity. I wonder if we could get it to train faster by solving the procrustes problem to get an initial U matrix that lines up better with the initial average embedding for each class. I’ve been wanting to have a fixed final layer to make classifying each pixel cheaper in segmentation, and this seems like the most promising initialization to make that happen. For more on the math, see this
dblalock 0 likes 03 May 22
Training language models to follow instructions with human feedback The InstructGPT paper. “Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning” Also “outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets.” So basically, reducing alignment problem to supervised learning of human preferences works pretty well, as measured by being as “aligned” as a 100x larger model without such training.
dblalock 0 likes 03 May 22
Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models Factorize the experts and reuse the biggest matrix in the factorization across all the experts. Maybe outperforms switch transformer-style MoE when grafted onto GPT-2? At the operating point, about 1.25x more params overall, whereas switch-like baseline was 4.67x. Seems to do better than regular MoE on WikiText-2 perplexity:
dblalock 0 likes 03 May 22
Mixture-of-Experts with Expert Choice Routing Instead of choosing the top-k experts for each token, you choose the top-k tokens per expert. Seems to work even better. I actually started coding this independently last month (scooped!), and the subtleties are: 1) it makes your routing function super cheap, which is great, but 2) you end up summing different numbers of activation tensors for each token, which is hard to make efficient. You can
dblalock 0 likes 03 May 22
Fantastic Generalization Measures and Where to Find Them. Section 4 summarizes results, which seem to all be on CIFAR10 or SVHN. Confirms that overparameterization helps generalization, that norm-based and classical VC-style measures correlate negatively with generalization, and that flatness (and proxies like final gradient variance) also correlate with generalization. Also, faster initial loss reduction correlated with worse generalization, supporting the explore/exploit model of optimization we’ve seen with cyclic learning rates.
dblalock 0 likes 03 May 22
Locating and Editing Factual Knowledge in GPT They edit factual knowledge in GPT-J, meaning they, e.g., get the model to generate sentences as if the Eiffel tower were in Rome rather than Paris. Find-and-replace for inputs like the Madry paper, but they do this at the token level instead. Rank-1 update of one layer’s weight matrix is nice, but they have to optimize the "replacement" embedding via backprop, so I doubt there's much speedup to be had.
dblalock 0 likes 03 May 22
Characterizing & Finding Good Data Orderings for Fast Convergence of Sequential Gradient Methods Messing with data ordering within a batch. No positive results, and such strong negative results that I'm more inclined to believe strong negative results I’ve gotten myself in the past. Maybe some interesting theory in here but didn't really look at it.
dblalock 0 likes 03 May 22
Convolutional Xformers for Vision Not sure what to make of the overall efficacy of the approach, but they report lifts from 1) switching optimizers from AdamW to SGD during training, and 2) turning off randaugment near the end of training, both of which seem like actionable (if somewhat mysterious) optimizations.
dblalock 0 likes 05 May 22
An Extendable, Efficient and Effective Transformer-based Object Detector Proposes an object detection approach in the same vein as DETR. There’s a lot of stuff here—a particular neck + head structure, three different types of attention, several loss functions, multi-scale feature map fusion, and more. But they have good results, including tradeoff curves of inference latency vs AP, as well as thorough ablations of different components. Worth digging into if you’re trying to push the limits of object detection (or just turn everything ever into a transformer because it lets you use less competitive baselines).
dblalock 0 likes 03 May 22
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets OpenAI paper where they try to teach neural nets to solve problems with algorithmic solutions. The bayes error rate for these problems is zero, but the relationships are not as smooth as in most traditional tasks, making them much harder to learn. What they find is that long after the model has memorized the training set, it suddenly starts doing well on the validation set. They refer to this phenomenon as "grokking". Not that actionable, but interesting work that thinks more deeply about the nature of intelligence than the typical deep learning paper.
dblalock 0 likes 03 May 22
SPViT: Enabling Faster Vision Transformers via Soft Token Pruning. Token pruning for transformers. Results didn’t sound especially compelling but who knows. Resource-Efficient Deep Learning: A Survey on Model-, Arithmetic-, and Implementation-Level Techniques
dblalock 0 likes 03 May 22
A Static Analyzer for Detecting Tensor Shape Errors in Deep Neural Network Training Code Exactly what it sounds like. Might make for a nice IDE plugin or linting tool. AdaViT: Adaptive Tokens for Efficient Vision Transformer. Intelligent token dropping that apparently "improves the throughput of DeiT-Tiny by 62% and DeiT-Small by 38% with only 0.3% accuracy drop", which might actually be a decent win.
dblalock 0 likes 03 May 22
PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures. Points out that there are many safety metrics, and beats existing methods across all of them via a new data augmentation pipeline. Namely, they combine images with crazy LSD-looking images like fractals.
dblalock 0 likes 03 May 22
OMPQ: Orthogonal Mixed Precision Quantization - they figure out how many bits to use for different layers in <9 seconds by using a proxy objective. In the camp of "read this if and only if you care about about quantization." Should We Be Pre-training? An Argument for End-task Aware Training as an Alternative
dblalock 0 likes 03 May 22
Bag of Tricks for Optimizing Transformer Efficiency Really nice paper full of practical improvements. Frustratingly Simple Pretraining Alternatives to Masked Language Modeling They mostly just tie MLM, but they have a couple plots in the appendix where predicting the first letter of masked words does way better. Needs more detailed reading.
dblalock 0 likes 03 May 22
"Understanding and Co-designing the Data Ingestion Pipeline for Industry-Scale RecSys Training" (https://arxiv.org/abs/2108.09373) Facebook talks about their system for data preprocessing. "Bag of Tricks for Training Deeper Graph Neural Networks: A Comprehensive Benchmark Study" (
dblalock 0 likes 03 May 22
Just one this week: The MIT Supercloud Dataset - "Artificial intelligence (AI) and Machine learning (ML) workloads are an increasingly larger share of the compute workloads...In this paper we introduce the MIT Supercloud Dataset which aims to foster innovative AI/ML approaches to the analysis of large scale HPC and datacenter/cloud operations. We provide detailed monitoring logs from the MIT Supercloud system, which include CPU and GPU usage by jobs, memory usage, file system logs, and physical monitoring data."
dblalock 0 likes 03 May 22
Dataset Distillation with Infinitely Wide Convolutional Networks. Not sold on the infinitely wide part, but constructing an "optimized" training set once--which can be arbitrary inputs, rather than being constrained to equal a subset of the original data--is an interesting idea.
dblalock 0 likes 16 May 22
These summaries made possible by MosaicML. If you find them helpful, the best way to thank me is by checking out + starring Composer, our open-source library for faster model training. ⭐ Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning
dblalock 0 likes 03 May 22
⭐ Efficient DNN Training with Knowledge-Guided Layer Freezing They claim 19-43% training speedup at iso accuracy via layer freezing. The freezing is guided by using a small proxy model on the CPU that helps estimate each layer's plasticity. They cache the activations for frozen layers to avoid even doing the forward pass through them, and report results on ResNet-50, BERT, and other useful networks.
dblalock 0 likes 05 May 22
⭐ Merging of neural networks They present evidence that you’re better off training two copies of a model and then merging them than just training one copy for the same amount of time. Results only scale up to ResNet-18 on ImageNet for 150 epochs, and only like a 0.2% accuracy lift. Probably not worth the complexity if your alternative is a normal training workflow, but might be an interesting halfway point between fine-tuning only and training from scratch. Or even a means of parallelizing large-scale training.