How to use htmlq to extract content from HTML files on Linux - nixCra…

archived 23 Jul 2022 07:27:36 UTC
Most of us use love and use the jq command. It works on Linux or Unix-like systems to extract data from JSON documents. Recently I found htmlq, which is like jq and written in Rust lang. Imagine being able to sed or grep for HTML data. We can search, slice, and filter HTML data with htmlq. Let us see how to install and use this handy tool on Linux or Unix and play with HTML data.
Advertisement

What is htmlq tool?

It is like jq, but for HTML. Uses CSS selectors to extract bits of content from HTML files. In CSS, selectors are used to target the HTML elements on our web pages that we want to style. For example, we can extract the images or other URLs using this tool easily.

Installing htmlq on Linux or Unix

Here is how to install cargo and rustc on Ubuntu or Debian Linux using the apt command/apt-get command:
sudo apt install cargo
Then you would run:
cargo install htmlq

macOS installing cargo

Open the Terminal app and then run the port command as follows:
sudo port install cargo
Or you can install Homebrew on macOS to use the brew package manager as follows:
brew install rustup # installs both cargo and rustc
rustup-init
rustc --version

FreeBSD intall cargo

I am going to use the pkg command as follows to install rustc:
sudo pkg install rust
See how to install Rust for other operating systems. Now that I have both rustc and cargo tools, then I type the following simple command to get htmlq on my development system:
cargo install htmlq
How to install htmlq to extract content from HTML files on Ubuntu Linux
Have you installed Rust lang? Now install htmlq for fun and profit using the cargo command.

Setting up your PATH

Make sure you add $HOME/.cargo/bin to your PATH variable to be able to run the installed binaries using the export command
# sh/bash/ksh etc
export PATH="$PATH:$HOME/.cargo/bin" 
 
# tcsh/csh etc
setenv PATH $PATH:$HOME/.cargo/bin

How to use htmlq to extract content from HTML files on Linux or Unix

Let us use the curl command to find part of a page by ID:
curl -s url | htmlq '#css-selector'
curl -s url2 | htmlq '#css-selector'
curl -s https://www.cyberciti.biz/faq/ | htmlq --pretty '#content' | more

htmlq outputs
Click to enlarge
Let us find all the links in a page. For example:
curl -s https://www.nixcraft.com | htmlq --attribute href a

Getting help

Simply run:
htmlq --help
htmlq 0.0.1
Michael Maclean <michael@mgdm.net>
Runs CSS selectors on HTML
 
USAGE:
    htmlq [FLAGS] [OPTIONS] <selector>...
 
FLAGS:
    -h, --help                 Prints help information
    -w, --ignore-whitespace    When printing text nodes, ignore those that consist entirely of whitespace
    -p, --pretty               Pretty-print the serialised output
    -t, --text                 Output only the contents of text nodes inside selected elements
    -V, --version              Prints version information
 
OPTIONS:
    -a, --attribute <attribute>    Only return this attribute (if present) from selected elements
    -f, --filename <FILE>          The input file. Defaults to stdin
    -o, --output <FILE>            The output file. Defaults to stdout
 
ARGS:
    <selector>...    The CSS expression to select

Summing up

The htmlq is a lovely tool indeed, and I liked it very much. Do check the Github source code. Try it out and let me know what you like about it in the comment section below.

🥺 Was this helpful? Please add your comment below to show your appreciation or feedback

🐧 Get the latest tutorials on Linux, Open Source & DevOps via
RSS feed ➔   Weekly email newsletter ➔

Category List of Unix and Linux commands
AnsibleCheck version Fedora FreeBSD Linux Ubuntu 18.04 Ubuntu macOS
Download managerswget
Driver ManagementLinux Nvidia driver lsmod
Documentationhelp mandb man pinfo
Disk Managementdf duf ncdu pydf
File Managementcat cp less mkdir more tree
FirewallAlpine Awall CentOS 8 OpenSUSE RHEL 8 Ubuntu 16.04 Ubuntu 18.04 Ubuntu 20.04
KVM VirtualizationCentOS/RHEL 7 CentOS/RHEL 8 Debian 9/10/11 Ubuntu 20.04
Linux Desktop appsChrome Chromium GIMP Skype Spotify VLC 3
Modern utilitiesbat exa
Network UtilitiesNetHogs dig host ip nmap ping
OpenVPNCentOS 7 CentOS 8 Debian 10 Debian 11 Debian 8/9 Ubuntu 18.04 Ubuntu 20.04
Power Managementupower
Package Managerapk apt-get apt yum
Processes Managementbg chroot cron disown fg glances gtop iotop jobs killall kill pidof pstree pwdx time vtop
Searchingag egrep grep whereis which
Shell builtinscompgen echo printf
System Managementreboot shutdown
Terminal/sshtty
Text processingcut rev
User Environmentexit who
User Informationgroups id lastcomm last lid/libuser-lid logname members users whoami w
User Management/etc/group /etc/passwd /etc/shadow chsh
WireGuard VPNAlpine Amazon Linux CentOS 8 Debian 10 Firewall Ubuntu 20.04 qrencode
1 comment… add one
  • Steve Knoblock Mar 3, 2022 @ 15:10
    This is a good find. I can imagine accessing HTML documents like xPath.
Leave a Reply
Your email address will not be published.
Use HTML <pre>...</pre> for code samples. Your comment will appear only after approval by the site admin.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

mmMwWLliI0fiflO&1
mmMwWLliI0fiflO&1
mmMwWLliI0fiflO&1
mmMwWLliI0fiflO&1
mmMwWLliI0fiflO&1
mmMwWLliI0fiflO&1
mmMwWLliI0fiflO&1