LLMs for Engineers

The Substack 'LLMs for Engineers' explores scaling large language model (LLM) applications from prototypes to production. Key themes include evaluation methods combining human and automated feedback, optimizing evaluation models, hosting and deploying models, improving LLM chains, coding agents, and fine-tuning open-source LLMs.

Evaluation Methods Human and Automated Feedback Hosting and Deploying LLMs LLM Chains Coding Agents Open-Source LLM Fine-Tuning

The hottest Substack posts of LLMs for Engineers

And their main takeaways

LLMs Know More Than What They Say

120 HN points • 15 Aug 24

🕹 Technology AI Machine Learning Data science Software Development Computing

Using latent space techniques can improve the accuracy of evaluations for AI applications without requiring a lot of human feedback. This approach saves time and resources.
Latent space readout (LSR) helps in detecting issues like hallucinations in AI outputs by allowing users to adjust the sensitivity of detection. This means it can catch more errors if needed, even if that results in some false alarms.
Creating customized evaluation rubrics for AI applications is essential. By gathering targeted feedback from users, developers can create more effective evaluation systems that align with specific needs.

Pytest is All You Need

79 implied HN points • 12 Jun 24

🕹 Technology Software Programming AI Testing Development

Pytest is a great tool for evaluating LLM applications, making it easier to set up tests and check their performance. It allows you to program your own evaluation metrics directly in Python without needing complicated configurations.
You can easily collect and analyze data from multiple test runs using Pytest. This helps to understand how consistent the outputs are across different evaluations.
The examples show how to compare different prompts and LLM models, enhancing the flexibility and variety in testing. This allows you to see which setups work best in various scenarios.

Hybrid Evaluation: Scaling human feedback with custom evaluation models

159 implied HN points • 15 Nov 23

🕹 Technology AI Machine Learning Human feedback Automation

Human feedback is still very important for evaluating models, especially in areas like customer support, but it can slow things down and increase costs.
Combining human input with automated, model-based evaluation can help improve efficiency and accuracy, reducing errors significantly.
Using fewer human-labeled examples with smart bootstrapping techniques can still yield good results, making it cheaper and faster to train evaluation models.

Scaling human feedback with fine-tuned open-source LLMs

59 implied HN points • 30 Jan 24

🕹 Technology AI Machine Learning Open Source Data science

Fine-tuned open-source models like Llama and Mistral can produce accurate feedback, similar to high-performing custom models. They're a great option for companies needing control over their data.
Using tools like Axolotl and Modal makes it easier to fine-tune these models. They help create customized training jobs and simplify deploying models across multiple GPUs.
Fine-tuning significantly improves the clarity and structure of the model's output. It reduces irrelevant information, allowing for cleaner, more useful results.

Evaluating LLM Agents and Applications

79 implied HN points • 11 Jul 23

🕹 Technology AI Development Machine Learning Software Engineering Data Analysis

Evaluating large language models (LLMs) is important because existing test suites don’t always fit real-world needs. So, developers often create their own tools to measure accuracy in specific applications.
There are four main types of evaluations for LLM applications: metric-based, tools-based, model-based, and involving human experts. Each method has its strengths and weaknesses depending on the context.
Understanding how well LLM applications are performing is essential for improving their quality. This allows for better fine-tuning, compiling smaller models, and creating systems that work efficiently together.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

Evolution of LLM Agents

79 implied HN points • 21 Jun 23

🕹 Technology AI Software Automation Data Machine Learning

Large Language Models (LLMs) are becoming more powerful and can now perform complex tasks with the help of internet data and tools. This could significantly boost productivity for both individuals and corporations.
The evolution of LLMs has progressed through several levels, starting from simple API calls to advanced agents that understand tasks better and can even interact without much human guidance.
While these advancements are exciting, there are still challenges to overcome, such as reliability, cost, and the potential for errors in the output of LLMs.

🕵️🗺️ Where do I deploy Llama-2? 🦙🦙

59 implied HN points • 22 Aug 23

🕹 Technology Cloud Computing Artificial Intelligence Machine Learning Data Privacy Cost Analysis

There are many options for hosting Llama-2, including big names like AWS, GCP, and Azure, as well as newer providers like Lambda Labs and CoreWeave. Each has its own pricing and GPU options.
Understanding how much you plan to use Llama-2 is important. This helps you decide whether to use a cloud service provider or a function-based option like Replicate.
Cost-effectiveness varies with different providers. For low usage, function providers can be cheaper, but for higher usage, CSPs might save you money in the long run.

Which Llama-2 Inference API should I use?

39 implied HN points • 31 Oct 23

🕹 Technology AI Development Open Source APIs Machine Learning Cloud Computing

TogetherAI was found to perform the best overall in terms of cost, speed, and accuracy, closely followed by MosaicML.
It's important to understand your specific needs when choosing an API, like cost and speed requirements, to find the best fit.
Experimenting with system prompts can lead to major improvements in performance, so don't hesitate to try different settings!

3 ways to improve LLM Agent chains with debugging

59 implied HN points • 03 May 23

🕹 Technology AI Software Development Applications Tools

Keep an eye on the costs when using LLM chains. Each call adds to the total, and this can add up quickly with many queries.
Use clear and meaningful names for API parameters. This helps improve the accuracy and reliability of LLM-powered applications.
Make sure your LLM chains actually call the necessary tools. Sometimes, the system might pretend to do it without following through, which can lead to problems.

How do I evaluate LLM coding agents? 🧑‍💻

19 implied HN points • 31 Aug 23

🕹 Technology AI Software Development Data Engineering

LLM coding agents have advanced from simple code completion to creating entire code repositories. This shows how technology is evolving to assist with more complex software development tasks.
Evaluating these agents relies on benchmarks like HumanEval and MBPP, which test their coding accuracy. These tests are important to see how well the agents are performing.
While there are new tools and benchmarks for LLM coding agents, users might still need to create specific evaluations for their own needs to get the best results. It's essential to tailor assessments to fit unique projects.

Llama-2 and the open source LLM 🌊

19 implied HN points • 03 Aug 23

🕹 Technology AI Models Open Source Software Development Machine Learning Programming Languages

Llama-2 makes it easier for anyone to run and own their LLM applications. This means people can create their own models at home while keeping their data private.
Self-hosting Llama-2 helps improve performance and reduces delays. This makes the model more efficient for specific tasks and can even reach higher accuracy levels.
There are guides and tools available to help users set up Llama-2 quickly. Users can try it out or integrate it with other platforms, making it more accessible for everyone.

Ready, Set, Test: Building Evaluation into Your LLM Workflow

0 implied HN points • 13 Oct 23

🕹 Technology AI Software Development Engineering Innovation

Developers need to create clear evaluation standards for large language model apps. This helps them understand what makes an app 'good' and improves user experience.
The tool **llmeval** offers a systematic way to evaluate LLM applications using different methods like metrics, tools, and models. It helps teams quickly test and monitor their apps.
Testing LLMs can be tricky because they often give different answers for the same input. Using sampling and setting thresholds in testing can help manage this unpredictability.