This post describes the potential applications of open source machine learning tools from the Hugging Face ecosystem for working with web archive collections, specifically focusing on the Collaborative Art Archive (CARTA) collection. The aim is to explore the collection through image search, image classification, and model training.
The Hugging Face Hub is a repository that provides access to a wide range of open source machine learning models, datasets, and demos. With over 150,000 models available, users can select models that suit their specific needs instead of relying on a single model.
The ARCH (Archives Research Compute Hub) offers access to 16 research-ready datasets generated from web archive collections, including image datasets in CSV format. To overcome the challenge of working with large image collections, tools are used to understand the collection at scale.
Gradio, an open source library supported by Hugging Face, along with Spaces, is used to create a user interface for interacting with the machine learning system, datasets, and models. Gradio facilitates tasks such as exploring images, implementing image search, and image classification.
For image search, an embedding model is utilized to create embeddings for both text and images, allowing comparison and identification of similar images. The post suggests using the CLIP model variant, clip-ViT-B-16, hosted on the Hugging Face Hub for this purpose.
Image classification is another task that can be performed using Hugging Face models. The post mentions the availability of over 3,000 image classification models on the Hub. These models can be tested against the dataset to evaluate label accuracy and identify potential errors.
If no suitable pre-trained model is found, the post suggests training a custom computer vision model using AutoTrain. The labeled dataset can be created using Label Studio, an open-source tool for data annotation.
The post concludes by inviting readers to explore the ARCH Image Dataset Explorer Demo, participate in the upcoming hackathon organized by Internet Archive and Hugging Face, and duplicate and modify the provided Spaces to further explore the dataset.