(cache)GSoC 2024: MusicBrainz – Internet Archive integration: Saving external links in the Wayback Machine

Introduction

Greetings, Everyone!

I am Ashutosh Aswal (IRC nick yellowhatpro), a Computer Science grad from PEC University, India. This is my second time contributing to MetaBrainz as a GSoC contributor, and unlike the last time, when I contributed to the ListenBrainz Android app, this year, I took a challenge to learn a new language and framework (Rust and Postgres) to create this delightful project, Melba, which stands for MusicBrainz’s External Links wayBack machine Archiver.

As the name suggests, the project saves external webpages linked in the MusicBrainz database to the Internet Archive using Wayback Machine API. Let me walk you through the making of Melba.

Let’s begin!! ( •̀ ω •́ )✧

Project Description

MusicBrainz sees a lot of edits daily. Most of the time there are external links contained in the edits and the edit notes. With Melba, we can continuously poll the MusicBrainz database and extract the links from edits and edit notes, and archive them in the Internet Archive using Wayback Machine API. Since webpages change and often can be taken down, preserving these links quickly will help us know the exact content of the webpage when the edit or edit note was made.

Coding Journey

The project is currently accessible as a public repository under the MetaBrainz organization at GitHub: https://github.com/metabrainz/melba

Here are the things I worked on: ( see my commits history )

Different tasks for different purposes

Melba is written in Rust and uses the features of Postgres, like LISTEN/NOTIFY for asynchronous communication between tasks. The project consists of the following tasks, which run concurrently:

POLLER

This task is responsible for polling edit_data and edit_note tables from the musicbrainz schema. It checks for extracting links and stores these in a table called internet_archive_urls. The internet_archive_urls table has the following schema:

CREATE TABLE external_url_archiver.internet_archive_urls (
        id                  serial,
        url                 text,
        job_id              text, -- response returned when we make the URL save request
        from_table          VARCHAR, -- table from where URL is taken
        from_table_id       INTEGER, -- id of the row from where the URL is taken
        created_at          TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
        retry_count         INTEGER, -- keeps track of number of retries made for the URL
        status              INTEGER DEFAULT 1, -- not started
        status_message      text -- keeps the status message of archival of URL
);

Extracting links from edit_note required searching for the links in the text column in the edit_note table. I used the linkify crate to check for the inks in plain text.

For extracting links from edit_data, I first checked if the user is not marked spammer, and if not, I would use the edit.type of the edit, and based on the edit type, filter out the URLs from edit_data:

pub fn extract_urls_from_json(json: &JsonValue, edit_type: i16) -> Vec<String> {
    match edit_type {
        90 => extract_url_from_add_relationship(json)
            .map(|url| vec![url])
            .unwrap_or_default(),
        91 => extract_url_from_edit_relationship(json)
            .map(|url| vec![url])
            .unwrap_or_default(),
        92 => extract_url_from_remove_relationship(json)
            .map(|url| vec![url])
            .unwrap_or_default(),
        101 => extract_url_from_edit_url(json)
            .map(|url| vec![url])
            .unwrap_or_default(),
        _ => extract_url_from_any_annotation(json).unwrap_or_default(),
    }
}

ARCHIVAL
Archival of links is done using 2 concurrent tasks, namely:
1. NOTIFIER
  - This task iterates on rows in the internet_archive_urls table, and sends them to the LISTENER task using Postgres’ LISTEN/NOTIFY feature.
  - In LISTEN/NOTIFY, we create a notification channel for asynchronous communication between the 2 tasks, so when the notifier sends a payload to the channel, the listener will listen to the channel in a FIFO (First In, First Out) manner.
2. LISTENER
  - The listener makes use of reqwest network client to make archival requests to the Wayback Machine API.
  - Once we make the network request to archive the link, we get a job_id in return. We then make status requests with retries to check if the link got archived or not.
  - We make 3 retry status requests with an interval of 2 minutes in the app, and if the status is still pending or is an error, we just update the status of the corresponding row.
RETRY_AND_CLEANUP
- This task runs periodically, with a time interval of 1 day.
- Once we have archived any link, we don’t need to store the link row in internet_archive_urls indefinitely.
- Also, there might be some links that could not get saved, so there must be a way to retry archiving them again.
- This task iterates all over the internet_archive_urls table to check the status of the links. If the link got archived, we can remove the row from the table, or if the error was a transient one, we can retry archiving the link by sending it to the notification channel, which the LISTENER task is listening to.

Dedicated database schema

To isolate these new features, we created a new Postgres schema called external_url_archiver. The related scripts are under the folder scripts/sql/.

CLI feature

To be able to manually queue a link, edit_data or edit_note and check the status of any job_id, we have a CLI application as well.

melba cli

Tests

“No project is complete without tests”, said no one ever.

I have added unit tests for different tasks, testing the database methods, helper functions, and network requests. A couple of integration tests are also added to test the overall flow of the tasks.

Maintainability

To ensure everything works properly, whether in development or production, here are a couple of things I took care of:

Dockerized the project, which simplifies the project development and deployment process.
Added GitHub workflows. One of them checks for mistakes and shouts on smelly code and tests the code against the unit and integration tests. Another workflow facilitates pushing the Docker image to the Docker Hub registry.
Integrated pre-commit hook, which checks for formatting and lints before we git commit, ensuring no smelly code is committed.
Added Prometheus for metrics collection and monitoring, along with Grafana to show the metrics in a cool dashboard.
Configured Sentry for tracking issues and errors in the application.
Used config crate to define configurational variables used all over the application. The variables thus can be configured easily based on the development and production environments.
Added documentation about the installation, architecture of the application, and steps to run it properly.

Grafana dashboard showing melba in action

Next steps

The project works as expected. But since I am still a beginner in Rust, and the project is more useful than what I have coded till now, there are a lot of improvements in/around the project that we should make:

Support archiving external links from previous edits and edit notes.
Show archived versions of the links in MusicBrainz website.
Coding in Rust was a delightful experience, but unfortunately I couldn’t fully utilize its power. I would love to learn more of the Rust features that I can use to make the project better, which includes more idiomatic coding, better error handling, less repetitive code, and much more.
Add more documentation, so that new contributors can understand the project better.
Integrate more MetaBrainz projects, like BookBrainz that could make use of this project to archive external webpages linked in edit notes.
Add feature to regularly re-archive the linked webpages whose content can be changed later by artists, like Spotify, iTunes.
Further improve the naming. We worked on it since the project started with a weird name: mb-exurl-ia-service, which was neither clear nor peachy, so we renamed to melba after it got transferred to MetaBrainz organization at GitHub.

Acknowledgement

Contributing to MetaBrainz has always been fun to me. Thanks to this program, I got to know all these geeks who give me constant motivation to keep learning and have made me more confident in my skills.

I would love to thank yvanzo, bitmap and reosarevok (my mentors, for always helping and supporting me when I needed help), Jade Ellis (my co-contributor for providing the best references), atj (fellow MetaBrainz member who gave me different ideas and approaches), rustynova (the community member I know from IRC, who helped me develop the sense of Rust idiomatic code practices). Thanks to you all, I got to learn a lot, and grow better as an engineer.

At last, I am grateful to all of the MetaBrainz team for the awesome experience I had this summer.

That’s it from my side. Thank you for having me !! ヾ(≧▽≦*)o

5 thoughts on “GSoC 2024: MusicBrainz – Internet Archive integration: Saving external links in the Wayback Machine”

yvanzo says:

October 7, 2024 at 08:59

Thanks Ashutosh for your commitment to complete your project even though the bar was higher this year. It is successfully running with test.musicbrainz.org for some time already. We are looking forward to deploying it to the main MusicBrainz servers.
jesus2099 says:

October 7, 2024 at 16:03

Wow! How cool!!
yellowhatpro says:

October 11, 2024 at 06:52

Thanks yvanzo. It could not have been possible without you.
Thanks jesus2099, it sure is cool
Time Dilation says:

October 13, 2024 at 14:47

I can’t believe this is actually finally happening. MB has needed this for over a decade.
ApeKattQuest, MonkeyPython says:

October 14, 2024 at 17:43

I’m legit *really* looking forward to this being implemented!

This site uses Akismet to reduce spam. Learn how your comment data is processed.

	ApeKattQuest, Monkey… on GSoC 2024: MusicBrainz –…
	Time Dilation on GSoC 2024: MusicBrainz –…
	yellowhatpro on GSoC 2024: MusicBrainz –…
	jesus2099 on GSoC 2024: MusicBrainz –…
	yvanzo on GSoC 2024: MusicBrainz –…