Introduction
Greetings, Everyone!
I am Ashutosh Aswal (IRC nick yellowhatpro), a Computer Science grad from PEC University, India. This is my second time contributing to MetaBrainz as a GSoC contributor, and unlike the last time, when I contributed to the ListenBrainz Android app, this year, I took a challenge to learn a new language and framework (Rust and Postgres) to create this delightful project, Melba, which stands for MusicBrainz’s External Links wayBack machine Archiver.
As the name suggests, the project saves external webpages linked in the MusicBrainz database to the Internet Archive using Wayback Machine API. Let me walk you through the making of Melba.
Let’s begin!! ( •̀ ω •́ )✧
Project Description
MusicBrainz sees a lot of edits daily. Most of the time there are external links contained in the edits and the edit notes. With Melba, we can continuously poll the MusicBrainz database and extract the links from edits and edit notes, and archive them in the Internet Archive using Wayback Machine API. Since webpages change and often can be taken down, preserving these links quickly will help us know the exact content of the webpage when the edit or edit note was made.
Coding Journey
The project is currently accessible as a public repository under the MetaBrainz organization at GitHub: https://github.com/metabrainz/melba
Here are the things I worked on: ( see my commits history )
Different tasks for different purposes
Melba is written in Rust and uses the features of Postgres, like LISTEN/NOTIFY for asynchronous communication between tasks. The project consists of the following tasks, which run concurrently:
- POLLER
- This task is responsible for polling
edit_data
andedit_note
tables from themusicbrainz
schema. It checks for extracting links and stores these in a table calledinternet_archive_urls
. Theinternet_archive_urls
table has the following schema:CREATE TABLE external_url_archiver.internet_archive_urls (
id serial,
url text,
job_id text, -- response returned when we make the URL save request
from_table VARCHAR, -- table from where URL is taken
from_table_id INTEGER, -- id of the row from where the URL is taken
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
retry_count INTEGER, -- keeps track of number of retries made for the URL
status INTEGER DEFAULT 1, -- not started
status_message text -- keeps the status message of archival of URL
); - Extracting links from
edit_note
required searching for the links in thetext
column in theedit_note
table. I used the linkify crate to check for the inks in plain text. - For extracting links from
edit_data
, I first checked if the user is not marked spammer, and if not, I would use theedit.type
of the edit, and based on the edit type, filter out the URLs fromedit_data
:pub fn extract_urls_from_json(json: &JsonValue, edit_type: i16) -> Vec<String> {
match edit_type {
90 => extract_url_from_add_relationship(json)
.map(|url| vec![url])
.unwrap_or_default(),
91 => extract_url_from_edit_relationship(json)
.map(|url| vec![url])
.unwrap_or_default(),
92 => extract_url_from_remove_relationship(json)
.map(|url| vec![url])
.unwrap_or_default(),
101 => extract_url_from_edit_url(json)
.map(|url| vec![url])
.unwrap_or_default(),
_ => extract_url_from_any_annotation(json).unwrap_or_default(),
}
}
- This task is responsible for polling
- ARCHIVAL
Archival of links is done using 2 concurrent tasks, namely:- NOTIFIER
- This task iterates on rows in the
internet_archive_urls
table, and sends them to the LISTENER task using Postgres’LISTEN/NOTIFY
feature. - In LISTEN/NOTIFY, we create a notification channel for asynchronous communication between the 2 tasks, so when the notifier sends a payload to the channel, the listener will listen to the channel in a FIFO (First In, First Out) manner.
- This task iterates on rows in the
- LISTENER
- The listener makes use of
reqwest
network client to make archival requests to the Wayback Machine API. - Once we make the network request to archive the link, we get a
job_id
in return. We then make status requests with retries to check if the link got archived or not. - We make 3 retry status requests with an interval of 2 minutes in the app, and if the status is still pending or is an error, we just update the status of the corresponding row.
- The listener makes use of
- NOTIFIER
- RETRY_AND_CLEANUP
- This task runs periodically, with a time interval of 1 day.
- Once we have archived any link, we don’t need to store the link row in
internet_archive_urls
indefinitely. - Also, there might be some links that could not get saved, so there must be a way to retry archiving them again.
- This task iterates all over the
internet_archive_urls
table to check the status of the links. If the link got archived, we can remove the row from the table, or if the error was a transient one, we can retry archiving the link by sending it to the notification channel, which the LISTENER task is listening to.
Dedicated database schema
- To isolate these new features, we created a new Postgres schema called
external_url_archiver
. The related scripts are under the folderscripts/sql/
.
CLI feature
- To be able to manually queue a link,
edit_data
oredit_note
and check the status of anyjob_id
, we have a CLI application as well.
melba cli
Tests
“No project is complete without tests”, said no one ever.
I have added unit tests for different tasks, testing the database methods, helper functions, and network requests. A couple of integration tests are also added to test the overall flow of the tasks.
Maintainability
To ensure everything works properly, whether in development or production, here are a couple of things I took care of:
- Dockerized the project, which simplifies the project development and deployment process.
- Added GitHub workflows. One of them checks for mistakes and shouts on smelly code and tests the code against the unit and integration tests. Another workflow facilitates pushing the Docker image to the Docker Hub registry.
- Integrated pre-commit hook, which checks for formatting and lints before we git commit, ensuring no smelly code is committed.
- Added Prometheus for metrics collection and monitoring, along with Grafana to show the metrics in a cool dashboard.
- Configured Sentry for tracking issues and errors in the application.
- Used config crate to define configurational variables used all over the application. The variables thus can be configured easily based on the development and production environments.
- Added documentation about the installation, architecture of the application, and steps to run it properly.
Grafana dashboard showing melba in action
Next steps
The project works as expected. But since I am still a beginner in Rust, and the project is more useful than what I have coded till now, there are a lot of improvements in/around the project that we should make:
- Support archiving external links from previous edits and edit notes.
- Show archived versions of the links in MusicBrainz website.
- Coding in Rust was a delightful experience, but unfortunately I couldn’t fully utilize its power. I would love to learn more of the Rust features that I can use to make the project better, which includes more idiomatic coding, better error handling, less repetitive code, and much more.
- Add more documentation, so that new contributors can understand the project better.
- Integrate more MetaBrainz projects, like BookBrainz that could make use of this project to archive external webpages linked in edit notes.
- Add feature to regularly re-archive the linked webpages whose content can be changed later by artists, like Spotify, iTunes.
- Further improve the naming. We worked on it since the project started with a weird name:
mb-exurl-ia-service
, which was neither clear nor peachy, so we renamed tomelba
after it got transferred to MetaBrainz organization at GitHub.
Acknowledgement
Contributing to MetaBrainz has always been fun to me. Thanks to this program, I got to know all these geeks who give me constant motivation to keep learning and have made me more confident in my skills.
I would love to thank yvanzo, bitmap and reosarevok (my mentors, for always helping and supporting me when I needed help), Jade Ellis (my co-contributor for providing the best references), atj (fellow MetaBrainz member who gave me different ideas and approaches), rustynova (the community member I know from IRC, who helped me develop the sense of Rust idiomatic code practices). Thanks to you all, I got to learn a lot, and grow better as an engineer.
At last, I am grateful to all of the MetaBrainz team for the awesome experience I had this summer.
That’s it from my side. Thank you for having me !! ヾ(≧▽≦*)o
Thanks Ashutosh for your commitment to complete your project even though the bar was higher this year. It is successfully running with test.musicbrainz.org for some time already. We are looking forward to deploying it to the main MusicBrainz servers.
Wow! How cool!!
Thanks yvanzo. It could not have been possible without you.
Thanks jesus2099, it sure is cool
I can’t believe this is actually finally happening. MB has needed this for over a decade.
I’m legit *really* looking forward to this being implemented!