「比較」を「結合」に置き換えて考えるとシンプルになる.
How do I compare one colum of a file with another colum of another file using awk? – StackOverflow
1 2 3 4 5 6 | %%writefile file12018-03-14 13:23:00 CID [72883359]2018-03-14 13:23:00 CID [275507537]2018-03-14 13:23:00 CID [275507539]2018-03-14 13:23:00 CID [207101094]2018-03-14 13:23:00 CID [141289821] |
1 2 3 4 5 6 7 | %%writefile file22018-03-14 13:23:00 CID [207101072]2018-03-14 13:23:00 CID [275507524]2018-03-14 13:23:00 CID [141289788]2018-03-14 13:23:00 CID [72883352]2018-03-14 13:23:01 CID [72883359]2018-03-14 13:23:00 CID [275507532] |
1 2 3 4 | %%bashtime {awk 'NR==FNR{a[$4]=$0;next} ($4 in a){print a[$4]FS$1FS$2}' file1 file2} |
2018-03-14 13:23:00 CID [72883359] 2018-03-14 13:23:01
real 0m0.004s
user 0m0.002s
sys 0m0.002s
Pythonの場合:
いかにも,Pandasネタだなと思ったので.
3列目をキーにマージすれば,デフォルトはhow=’inner’なので,
共通部分(積集合)だけが残る(マージされる).
1 2 3 4 5 | import pandas as pddf1 = pd.read_table('file1', sep='\s+', header=None)df2 = pd.read_table('file2', sep='\s+', usecols=[0, 1, 3], header=None)print(df1.merge(df2, on=3).to_csv(sep=' ', header=None, index=None)) |
2018-03-14 13:23:00 CID [72883359] 2018-03-14 13:23:01
ついでに.比較の為のファイルと,比較したいファイルの場合.
「pandas.DataFrame.isin()」が便利.
unix – compare columns of two files – StackOverflow
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | import ioimport pandas as pdstrings = """col1,col2,col3,col4,col5,col6abc,abc,111,xyz,xyz,xyzabc,abc,222,xyz,xyz,xyzabc,abc,333,xyz,xyz,xyzabc,abc,444,xyz,xyz,xyz"""key = [111, 222, 333]df = pd.read_csv(io.StringIO(strings))print(df[~df['col3'].isin(key)].to_csv(index=None)) |
col1,col2,col3,col4,col5,col6
abc,abc,444,xyz,xyz,xyz
簡単な処理であればある程,awkの手軽さが際立つけど.
1 2 3 4 5 6 7 8 9 | %%bashtime {echo """col1,col2,col3,col4,col5,col6abc,abc,111,xyz,xyz,xyzabc,abc,222,xyz,xyz,xyzabc,abc,333,xyz,xyz,xyzabc,abc,444,xyz,xyz,xyz""" |awk -F',' '$3!~/111|222|333/'} |
col1,col2,col3,col4,col5,col6
abc,abc,444,xyz,xyz,xyz
real 0m0.004s
user 0m0.002s
sys 0m0.000s
広告