Examples
The -H
flag in the examples below signifies that the file has a header row which is used for naming columns.
The -t
flag is just a shortcut for saying that the file is a tab-separated file (any delimiter is supported - Use the -d
flag).
Queries are given using upper case for clarity, but actual query keywords such as SELECT and WHERE are not really case sensitive.
- Examples
- Example 1 - COUNT DISTINCT values of specific field (uuid of clicks data)
- Example 2 - Filter numeric data, controlling ORDERing and LIMITing output
- Example 3 - Illustrate GROUP BY
- Example 4 - More complex GROUP BY (group by time expression)
- Example 5 - Read input from standard input
- Example 6 - Use column names from header row
- Example 7 - JOIN two files
Example 1 - COUNT DISTINCT values of specific field (uuid of clicks data)
q -H -t "SELECT COUNT(DISTINCT(uuid)) FROM ./clicks.csv"
Output 1:
229
Example 2 - Filter numeric data, controlling ORDERing and LIMITing output
Note that q
understands that the column is numeric and filters according to its numeric value (real numeric value comparison, not string comparison).
q -H -t "SELECT request_id,score FROM ./clicks.csv WHERE score > 0.7 ORDER BY score DESC LIMIT 5"
Output 2:
2cfab5ceca922a1a2179dc4687a3b26e 1.0
f6de737b5aa2c46a3db3208413a54d64 0.986665809568
766025d25479b95a224bd614141feee5 0.977105183282
2c09058a1b82c6dbcf9dc463e73eddd2 0.703255121794
Example 3 - Illustrate GROUP BY
q -t -H "SELECT hashed_source_machine,count(*) FROM ./clicks.csv GROUP BY hashed_source_machine"
Output 3:
47d9087db433b9ba.domain.com 400000
Example 4 - More complex GROUP BY (group by time expression)
q -t -H "SELECT strftime('%H:%M',date_time) hour_and_minute,count(*) FROM ./clicks.csv GROUP BY hour_and_minute"
Output 4:
07:00 138148
07:01 140026
07:02 121826
Example 5 - Read input from standard input
Calculates the total size per user/group in the /tmp subtree.
sudo find /tmp -ls | q "SELECT c5,c6,sum(c7)/1024.0/1024 AS total FROM - GROUP BY c5,c6 ORDER BY total desc"
Output 5:
mapred hadoop 304.00390625
root root 8.0431451797485
smith smith 4.34389972687
Example 6 - Use column names from header row
Calculate the top 3 user ids with the largest number of owned processes, sorted in descending order.
Note the usage of the autodetected column name UID in the query.
ps -ef | q -H "SELECT UID,COUNT(*) cnt FROM - GROUP BY UID ORDER BY cnt DESC LIMIT 3"
Output 6:
root 152
harel 119
avahi 2
The following command joins an ls output (exampledatafile) and a file containing rows of group-name,email (group-emails-example) and provides a row of filename,email for each of the emails of the group. For brevity of output, there is also a filter for a specific filename called ppp which is achieved using a WHERE clause.
q "SELECT myfiles.c8,emails.c2 FROM exampledatafile myfiles JOIN group-emails-example emails ON (myfiles.c4 = emails.c1) WHERE myfiles.c8 = 'ppp'"
Output 7:
ppp dip.1@otherdomain.com
ppp dip.2@otherdomain.com
You can see that the ppp
filename appears twice, each time matched to one of the emails of the group dip to which it belongs. Take a look at the files exampledatafile
and group-emails-example
for the data.
Column name detection is supported for JOIN scenarios as well. Just specify -H
in the command line and make sure that the source files contain the header rows.