machinelearninglibrarian

The hottest Substack posts of machinelearninglibrarian

And their main takeaways

Metadata for model discovery

39 implied HN points • 18 Oct 23

Filtering by metadata fields is crucial for finding suitable machine-learning models.
Using metadata can help narrow down search results on platforms like the Hugging Face Hub.
Combining semantic search and metadata filters can enhance the model discovery process.

Fixing search with AI?

1 HN point • 09 Oct 23

Using machine learning can improve search functionality
Limitations of AI models can include becoming outdated
Creating a model with a small dataset can help filter search results effectively

The role of libraries and librarians in artificial intelligence: What is a Machine Learning Librarian?

1 HN point • 19 Sep 23

Libraries are considering the impact of AI and machine learning on their services.
Metadata is crucial for a healthy machine learning ecosystem.
Machine Learning Librarians can help improve metadata through automation and tools.

Label Studio x Hugging Face datasets hub

0 implied HN points • 07 Sep 22

Using Label Studio and Hugging Face datasets helps in annotating data more efficiently for machine learning tasks. This makes it easier to move back and forth between annotating, training a model, and refining the process.
The Hugging Face hub allows for easier management of large datasets due to its Git-based structure, which also supports versioning. This means you can track changes and update your dataset as you annotate more data.
Creating a loading script for your dataset helps integrate the data into your machine learning pipeline. You can share the dataset easily while ensuring you only load the necessary data based on your annotations.

Searching for machine learning models using semantic search

0 implied HN points • 26 Jul 22

🕹 Technology Machine Learning Data science Artificial Intelligence Model Evaluation

There are a lot of machine learning models available on platforms like Hugging Face, but finding the right one can be tricky. You may need to search through different tags and descriptions to find what fits your need.
Using semantic search can help you find models based on what they can do rather than just their names. This way, you can discover models that are similar even if they use different terms.
Documenting models in README files is important because it helps others understand how to use them. However, not all models have detailed documentation, which can make finding the right one harder.

Get a weekly roundup of the best Substack posts, by hacker news affinity:

Using 🤗 datasets for image search

0 implied HN points • 13 Jan 22

🕹 Technology Artificial Intelligence Machine Learning Data science Computer Vision Software Development

You can use the Hugging Face datasets library to create an image search application easily, allowing you to search images effectively.
The library supports different ways to handle images, like reading from file paths or NumPy arrays, which makes it flexible for usage.
It's important to consider potential biases and performance variability when deploying models for image searches, especially with varied datasets.

Synthetic dataset generation techniques: Self-Instruct

0 implied HN points • 15 May 24

🕹 Technology Machine Learning Artificial Intelligence Data generation Natural Language Processing Computational linguistics

Self-Instruct helps create large sets of instructional data by using language models to generate instructions from initial examples. This saves a lot of time compared to writing everything by hand.
The process involves generating new instructions from a seed dataset, filtering them, and ensuring diversity to avoid repetitive prompts. This way, the dataset expands effectively.
The method is widely adopted in both research and practical applications, showing that using machine-generated data can improve instruction-following models without extensive manual input.

Tracing Text Generation Inference calls

0 implied HN points • 05 Apr 24

🕹 Technology Artificial Intelligence Machine Learning Open Source Software Development Data science

To trace text generation calls, you can use Langfuse with OpenAI integration in your code. This allows you to monitor how your text generation model is performing.
You'll need to set up your secret keys and environment variables to connect to the Langfuse service. Make sure to store your sensitive keys securely.
The example provided shows how to make a chat completion call and receive responses from a model. It's a handy way to see how AI can generate text based on user input.

How to load a Hugging Face dataset into Qdrant?

0 implied HN points • 08 Nov 23

🕹 Technology Machine Learning Data Management Software Development Artificial Intelligence Programming

You can easily load a Hugging Face dataset into Qdrant using simple Python code. Just install the necessary libraries and use the load_dataset function.
Once your dataset is loaded, you can create a Qdrant collection to store and manage your data. This lets you perform tasks like searching for similar articles based on their embeddings.
There are ways to optimize the process of adding data and searching within Qdrant. For example, batching the data can make it faster and smoother.

A (very brief) intro to exploring metadata on the Hugging Face Hub

0 implied HN points • 16 Jan 23

🕹 Technology Machine Learning Data science Programming APIs Open Source

The Hugging Face Hub is a key place for sharing machine learning models and datasets. Finding the right model or dataset can be tough as the number grows, but using metadata can help make the search easier.
You can interact with the Hugging Face Hub programmatically using the `huggingface_hub` library. This library allows you to list datasets and models easily, and it has various features that can help developers.
Exploring tags associated with models and datasets on the Hub is important. Tags provide additional information about the purpose and compatibility of models, but counting them can be misleading without considering their context.

Training an object detection model using Hugging Face

0 implied HN points • 16 Aug 22

🕹 Technology Machine Learning Computer Vision Artificial Intelligence Data science Programming

Object detection helps identify and locate objects in images. It goes beyond just knowing if something is present; it tells us where and how many of those things are there.
Hugging Face offers tools for training object detection models easily, especially using the Detr architecture. This lets users leverage pre-trained models and datasets for better performance.
Using the datasets library simplifies the data handling process during training. It allows for quick loading and preparation of data, which is very helpful when tweaking and iterating on models.

Using the 🤗 Hub for model storage

0 implied HN points • 30 Dec 21

🕹 Technology Machine Learning Software Development Data Storage Open Source

The 🤗 hub is a useful space for sharing and finding machine learning models. It's great for avoiding duplicate work and helps others use or adapt models easily.
Using the huggingface_hub library can simplify working with models stored on the 🤗 hub. It allows for downloading, updating, and managing models more efficiently than using GitHub alone.
You can also upload models directly to the 🤗 hub, making the process smoother after training. Additionally, creating revision branches for models helps manage different versions better.

Using ColPali with Qdrant to index and search a UFO document dataset

0 implied HN points • 02 Oct 24

🕹 Technology Machine Learning Database Data retrieval Artificial Intelligence Open Source

ColPali is a new way to search documents that considers both pictures and text, making it better for complex layouts compared to traditional methods.
Qdrant is a special database that allows for fast searching of data using high-dimensional vectors, which can include multiple vectors to represent one item.
Using techniques like quantization, Qdrant helps save memory and speed up searches, making it a powerful tool for managing large datasets like UFO documents.

Generating a dataset of queries for training and fine-tuning ColPali models on a UFO dataset

0 implied HN points • 23 Sep 24

🕹 Technology Machine Learning Artificial Intelligence Data science Software Development Computer Vision

ColPali is a new model that combines text and images to improve how we find documents. It looks at both the words and the visual parts of a page, making it smarter than older text-only methods.
To train ColPali, we need a dataset that pairs document images with questions about what those documents contain. This helps the model learn how to match questions with the right visual information.
Using a special model called Qwen2-VL, we can create specific and relevant queries from images. This can help refine the dataset even more by making sure the questions are useful for retrieving information.

Extracting Insights from Model Cards Using Open Large Language Models

0 implied HN points • 27 Nov 23

🕹 Technology Machine Learning Data science Artificial Intelligence Open Source Software Development

Model Cards are important for sharing details about machine learning models, but they can vary greatly in format and focus. This makes it hard to know how to find or categorize the information they contain.
There are over 400,000 models on the Hugging Face Hub, and extracting specific details, like the datasets used or evaluation metrics mentioned, could help create clearer guidelines and metadata.
Using open large language models can help annotate and discover key concepts from the diverse data in Model Cards, making it easier to analyze and understand various models and their attributes.

How to do groupby for Hugging Face datasets

0 implied HN points • 18 Sep 23

🕹 Technology Machine Learning Data science Programming Software Development Data Analysis

Hugging Face's datasets don't have built-in groupby features, but you can use Polars to handle this. You can load datasets with Polars and perform group operations easily.
Polars allows you to work with large datasets efficiently using lazy evaluation. This means you can process data without needing to load everything into memory all at once.
You can visualize data comparisons after grouping by specific columns, making it easier to understand patterns or insights from the data.

Dynamically updating a Hugging Face hub organization README

0 implied HN points • 07 Mar 23

🕹 Technology Machine Learning Data science Software Development Web Development Automation

You can use the huggingface_hub library to automatically create and update a README for your Hugging Face organization. This helps keep your information organized without needing to make manual changes.
By listing and grouping datasets by tasks, it makes it easy to see what datasets are available for different activities. This organization helps others find the resources they need quickly.
Using a templating engine like Jinja2 allows you to create a polished and updated README format. It makes the information visually appealing and easier to understand.

Running a Local Vision Language Model with LM Studio to sort out my screenshot mess

0 implied HN points • 23 Oct 24

🕹 Technology Machine Learning Artificial Intelligence Software Development Data processing

Using a local Vision Language Model (VLM) can help organize your messy screenshots effectively. It allows you to categorize images based on their content, making it easier to find them later.
Running local models has become simpler, especially with tools like LM Studio. It includes features like headless mode for background processing and support for both text and images.
Structured outputs from models can enforce formats for responses, making it easier to process and utilize the data generated. This way, tasks like sorting images become more consistent and manageable.

Synthetic dataset generation techniques: generating custom sentence similarity data

0 implied HN points • 23 May 24

🕹 Technology Artificial Intelligence Data science Machine Learning Natural Language Processing Synthetic Data

Large Language Models (LLMs) can help create synthetic datasets for training models, especially where there's a lack of real data. This approach makes it easier to gather specific information needed for tasks like text classification.
Generating sentence similarity data helps in comparing how alike two sentences are. This is useful in areas like information retrieval and clustering.
A structured approach to generating data can improve the quality and relevance of the data produced. Using prompts to control the output can help generate more accurate results for specific training needs.

Exploring language metadata for datasets on the Hugging Face Hub

0 implied HN points • 07 Jun 23

🕹 Technology Machine Learning Data science Programming Artificial Intelligence Software Development

The Hugging Face Hub provides datasets that can be filtered based on available language metadata. It helps identify which datasets contain specific language information.
There are many languages represented in the datasets, with a total of 1719 unique languages noted. This diversity is important for developing models that support different languages.
Visual tools like bar charts and word clouds can effectively represent language frequencies in datasets. These visuals make it easier to understand the distribution and popularity of different languages.

Combining Hugging Face datasets with dask

0 implied HN points • 20 Jun 22

🕹 Technology Machine Learning Data science Software Development Data processing Programming

Hugging Face datasets help you load, process, and share data easily, but they can be tricky for exploring data. Using Dask together with Hugging Face makes data analysis smoother, especially for larger datasets.
Dask allows you to run operations in parallel, which is useful if your data can't fit into memory. You can use Dask's different collection types, like dask bag, to process data efficiently by breaking it into smaller chunks.
Dask dataframes work like pandas dataframes, making it easier to perform complex operations. This includes grouping data and calculating averages, which you can visualize just like you would with pandas.

flyswot

0 implied HN points • 22 Dec 21

🕹 Technology Machine Learning Computer Vision Data Management Image Processing Software Development

The project aims to use computer vision to find and correct mislabeled images in a library's digitized manuscript collection. This will help ensure that images are accurately categorized for future use.
A command line tool called 'flyswot' has been developed to check images for fake labels based on specific filename patterns. This tool helps automate the identification process.
Throughout the project, important lessons were learned about practical machine learning deployment, such as dealing with domain drift and using data version control effectively.

Metadata for machine learning: models

0 implied HN points • 27 Sep 23

Metadata in machine learning provides information about data to make tracking and working with it easier.
Machine learning artifacts include a machine learning model and a dataset for training.
Metadata for machine learning models helps enable model reuse by providing information about the model structure, weights, and configuration.

Using Hugging Face AutoTrain to train an image classifier without writing any code.

0 implied HN points • 22 Feb 23

🕹 Technology Machine Learning Artificial Intelligence Data science Computer Vision Software Development

You can train an image classifier with Hugging Face AutoTrain without needing to write any code. This makes it easier for people who aren't programmers to use machine learning.
Image classification is useful for organizing images into categories, like sorting book covers into 'useful' or 'not useful'.
The success of your model often depends more on having good training data than on the model itself. Adjusting and improving your training data can lead to better results.