「比較」を「結合」に置き換えて考えるとシンプルになる.
How do I compare one colum of a file with another colum of another file using awk? – StackOverflow
1 2 3 4 5 6 | % % writefile file1 2018 - 03 - 14 13 : 23 : 00 CID [ 72883359 ] 2018 - 03 - 14 13 : 23 : 00 CID [ 275507537 ] 2018 - 03 - 14 13 : 23 : 00 CID [ 275507539 ] 2018 - 03 - 14 13 : 23 : 00 CID [ 207101094 ] 2018 - 03 - 14 13 : 23 : 00 CID [ 141289821 ] |
1 2 3 4 5 6 7 | % % writefile file2 2018 - 03 - 14 13 : 23 : 00 CID [ 207101072 ] 2018 - 03 - 14 13 : 23 : 00 CID [ 275507524 ] 2018 - 03 - 14 13 : 23 : 00 CID [ 141289788 ] 2018 - 03 - 14 13 : 23 : 00 CID [ 72883352 ] 2018 - 03 - 14 13 : 23 : 01 CID [ 72883359 ] 2018 - 03 - 14 13 : 23 : 00 CID [ 275507532 ] |
1 2 3 4 | % % bash time { awk 'NR==FNR{a[$4]=$0;next} ($4 in a){print a[$4]FS$1FS$2}' file1 file2 } |
2018-03-14 13:23:00 CID [72883359] 2018-03-14 13:23:01
real 0m0.004s
user 0m0.002s
sys 0m0.002s
Pythonの場合:
いかにも,Pandasネタだなと思ったので.
3列目をキーにマージすれば,デフォルトはhow=’inner’なので,
共通部分(積集合)だけが残る(マージされる).
1 2 3 4 5 | import pandas as pd df1 = pd.read_table( 'file1' , sep = '\s+' , header = None ) df2 = pd.read_table( 'file2' , sep = '\s+' , usecols = [ 0 , 1 , 3 ], header = None ) print (df1.merge(df2, on = 3 ).to_csv(sep = ' ' , header = None , index = None )) |
2018-03-14 13:23:00 CID [72883359] 2018-03-14 13:23:01
ついでに.比較の為のファイルと,比較したいファイルの場合.
「pandas.DataFrame.isin()」が便利.
unix – compare columns of two files – StackOverflow
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | import io import pandas as pd strings = """col1,col2,col3,col4,col5,col6 abc,abc,111,xyz,xyz,xyz abc,abc,222,xyz,xyz,xyz abc,abc,333,xyz,xyz,xyz abc,abc,444,xyz,xyz,xyz""" key = [ 111 , 222 , 333 ] df = pd.read_csv(io.StringIO(strings)) print (df[~df[ 'col3' ].isin(key)].to_csv(index = None )) |
col1,col2,col3,col4,col5,col6
abc,abc,444,xyz,xyz,xyz
簡単な処理であればある程,awkの手軽さが際立つけど.
1 2 3 4 5 6 7 8 9 | % % bash time { echo """col1,col2,col3,col4,col5,col6 abc,abc,111,xyz,xyz,xyz abc,abc,222,xyz,xyz,xyz abc,abc,333,xyz,xyz,xyz abc,abc,444,xyz,xyz,xyz""" | awk - F ',' '$3!~/111|222|333/' } |
col1,col2,col3,col4,col5,col6
abc,abc,444,xyz,xyz,xyz
real 0m0.004s
user 0m0.002s
sys 0m0.000s
広告