Most of us use love and use the jq command. It works on Linux or Unix-like systems to extract data from JSON documents. Recently I found htmlq, which is like jq and written in Rust lang. Imagine being able to sed or grep for HTML data. We can search, slice, and filter HTML data with htmlq. Let us see how to install and use this handy tool on Linux or Unix and play with HTML data.
Advertisement
What is htmlq tool?
It is like jq, but for HTML. Uses CSS selectors to extract bits of content from HTML files. In CSS, selectors are used to target the HTML elements on our web pages that we want to style. For example, we can extract the images or other URLs using this tool easily.
Installing htmlq on Linux or Unix
Here is how to install cargo and rustc on Ubuntu or Debian Linux using the apt command/apt-get command:
Then you would run:
sudo apt install cargoThen you would run:
cargo install htmlqmacOS installing cargo
Open the Terminal app and then run the port command as follows:
Or you can install Homebrew on macOS to use the brew package manager as follows:
sudo port install cargoOr you can install Homebrew on macOS to use the brew package manager as follows:
brew install rustup # installs both cargo and rustc rustup-init rustc --version
FreeBSD intall cargo
I am going to use the pkg command as follows to install rustc:
See how to install Rust for other operating systems. Now that I have both rustc and cargo tools, then I type the following simple command to get htmlq on my development system:
sudo pkg install rustSee how to install Rust for other operating systems. Now that I have both rustc and cargo tools, then I type the following simple command to get htmlq on my development system:
cargo install htmlqHave you installed Rust lang? Now install htmlq for fun and profit using the cargo command.
Setting up your PATH
Make sure you add $HOME/.cargo/bin to your PATH variable to be able to run the installed binaries using the export command
# sh/bash/ksh etc export PATH="$PATH:$HOME/.cargo/bin" # tcsh/csh etc setenv PATH $PATH:$HOME/.cargo/bin
How to use htmlq to extract content from HTML files on Linux or Unix
Let us use the curl command to find part of a page by ID:
Let us find all the links in a page. For example:curl -s url | htmlq '#css-selector'
curl -s url2 | htmlq '#css-selector'
curl -s https://www.cyberciti.biz/faq/ | htmlq --pretty '#content' | morecurl -s https://www.nixcraft.com | htmlq --attribute href a
Getting help
Simply run:
htmlq --helphtmlq 0.0.1
Michael Maclean <michael@mgdm.net>
Runs CSS selectors on HTML
USAGE:
htmlq [FLAGS] [OPTIONS] <selector>...
FLAGS:
-h, --help Prints help information
-w, --ignore-whitespace When printing text nodes, ignore those that consist entirely of whitespace
-p, --pretty Pretty-print the serialised output
-t, --text Output only the contents of text nodes inside selected elements
-V, --version Prints version information
OPTIONS:
-a, --attribute <attribute> Only return this attribute (if present) from selected elements
-f, --filename <FILE> The input file. Defaults to stdin
-o, --output <FILE> The output file. Defaults to stdout
ARGS:
<selector>... The CSS expression to selectSumming up
The htmlq is a lovely tool indeed, and I liked it very much. Do check the Github source code. Try it out and let me know what you like about it in the comment section below.
🐧 Get the latest tutorials on Linux, Open Source & DevOps via
RSS feed ➔ Weekly email newsletter ➔
Related posts:
bpytop - Awesome Linux, macOS and FreeBSD resource monitor
cpufetch - awesome CPU architecture info tool for Linux and…
Convert HTML Page To a PDF Using Open Source Tool [ Linux /…
Testing HTTP Status: 206 Partial Content and Range Requests
Amazon Cloudfront Dynamic Content Delivery With A WordPress…
Linux / Unix: Install and Use Geany Integrated Development…
How To Use Vagrant To Create Small Virtual Test Lab on a…