nolano.ai

LLM compression and inference platform for deployment on clouds, phones and laptops.

The hottest Substack posts of nolano.ai

And their main takeaways

LLMs running on your laptops and smartphones.

78 implied HN points • 11 Mar 23

🕹 Technology AI Hardware Programming Open Source Mobile devices

Large language models (LLMs) can be used for tasks like email completion and code explanation, but currently need hardware accelerators beyond personal devices.
Using on-device LLMs allows greater control over data and the ability to create personalized generation models.
A community of developers is working towards enabling LLM inference locally to empower creators and researchers in utilizing these models for their projects.

The LoRD (Low Rank Decomposition) of the Code LLMs

39 implied HN points • 20 Aug 23

🕹 Technology Compression Quantization Neural Networks

LoRD compression method offers advantages over pruning and quantization
LoRD models can be parallelized well on GPUs and remain fully differentiable after compression
LoRD technique can serve as a better alternative to unstructured pruning for parameter reduction and model compression

Int-4 LLaMa is not enough - Int-3 and beyond.

2 HN points • 13 Mar 23

🕹 Technology Artificial Intelligence Programming Software Development Machine Learning Data science

Quantization beyond Int-4 can improve performance in models and reduce RAM usage.
Larger transformer models are more compression friendly.
Developing a Python wrapper for the ggml.cpp codebase to make it more accessible and user-friendly.

Introducing the Turbo LLM Inference Engine

0 implied HN points • 21 Sep 23

🕹 Technology Language Models Benchmarking

Nolano introduced the Turbo LLM Engine to improve speed for Large Language Models.
Benchmarking shows the Turbo LLM Engine outperforms vLLM in speed, especially for larger models.
Testing methodology focused on latency improvements, output quality consistency, and hardware specifications.