Contents
In [86]: df
Out[86]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2634 entries, 0 to 2633
Data columns (total 8 columns):
0 2634 non-null values
1 2634 non-null values
2 2634 non-null values
3 2634 non-null values
4 2634 non-null values
5 2634 non-null values
6 2634 non-null values
diff 2634 non-null values
dtypes: float64(5), int64(1), object(2)
In [87]: rows = random.sample(df.index.tolist(), 10)
In [88]: rows
Out[88]: [844, 1585, 115, 1977, 2252, 1550, 1092, 451, 2319, 1908]
In [89]: df.ix[rows]
Out[89]:
0 1 2 3 4 5 6 diff
844 2012.09.12 00:00 77.791 77.809 77.701 77.763 5148 0.108
1585 2013.02.28 16:00 92.403 92.468 92.002 92.146 11489 0.466
115 2012.03.30 12:00 82.012 82.226 81.854 82.199 8176 0.372
1977 2013.05.28 20:00 101.992 102.494 101.980 102.339 13841 0.514
2252 2013.07.30 04:00 97.866 97.967 97.671 97.902 6822 0.296
1550 2013.02.21 00:00 93.674 93.768 93.467 93.700 9570 0.301
1092 2012.11.07 00:00 80.212 80.384 80.203 80.361 5639 0.181
451 2012.06.14 16:00 79.393 79.427 79.278 79.281 9019 0.149
2319 2013.08.14 00:00 98.183 98.334 97.979 98.222 9134 0.355
1908 2013.05.13 18:00 101.753 102.038 101.539 101.779 9948 0.499
移動平均(期間5):
In [357]: df2
Out[357]:
close
0 98.308
1 98.184
2 98.452
3 98.430
4 98.320
5 98.854
6 98.744
7 98.913
8 98.613
9 98.450
10 98.026
11 97.761
12 97.838
13 98.055
14 98.044
15 97.916
16 97.623
17 97.752
18 97.856
19 97.719
In [358]: df2['ma5'] = pd.rolling_mean(df2['close'], 5)
In [359]: df2
Out[359]:
close ma5
0 98.308 NaN
1 98.184 NaN
2 98.452 NaN
3 98.430 NaN
4 98.320 98.3388
5 98.854 98.4480
6 98.744 98.5600
7 98.913 98.6522
8 98.613 98.6888
9 98.450 98.7148
10 98.026 98.5492
11 97.761 98.3526
12 97.838 98.1376
13 98.055 98.0260
14 98.044 97.9448
15 97.916 97.9228
16 97.623 97.8952
17 97.752 97.8780
18 97.856 97.8382
19 97.719 97.7732
In [420]: df
Out[420]:
100M 遠投 球速 懸垂 握力
0 14.2 62 100 71 80
1 12.2 90 130 32 79
2 14.7 53 95 20 32
3 12.7 88 141 28 68
4 12.9 79 128 30 50
5 11.9 88 118 30 70
6 10.9 83 108 33 56
7 15.0 53 87 10 38
8 12.8 92 120 33 62
9 14.2 70 110 10 43
10 11.7 92 134 38 78
11 11.3 82 127 31 76
12 13.5 87 112 20 47
13 14.5 73 130 18 65
14 13.7 58 99 15 72
15 14.7 69 102 22 50
16 12.5 77 123 12 48
17 13.8 78 110 35 39
18 14.0 68 120 25 43
19 12.9 81 128 38 52
In [421]: df.corr()
Out[421]:
100M 遠投 球速 懸垂 握力
100M 1.000000 -0.758706 -0.572941 -0.261440 -0.520474
遠投 -0.758706 1.000000 0.772470 0.239126 0.429345
球速 -0.572941 0.772470 1.000000 0.143382 0.440535
懸垂 -0.261440 0.239126 0.143382 1.000000 0.514426
握力 -0.520474 0.429345 0.440535 0.514426 1.000000
http://pandas.pydata.org/pandas-docs/stable/computation.html#correlation
DataFrame.to_csv(path_or_buf, sep=', ', na_rep='', float_format=None, cols=None, header=True, index=True, index_label=None, mode='w', nanRep=None, encoding=None, quoting=None, line_terminator='n', chunksize=None, tupleize_cols=False, **kwds)
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.to_csv.html
dataframe.var()は不偏分散を返す。(オプションのddof初期値が1となっている。)
普通の分散を求める場合はddof=0を指定する。:
dataframe.var(ddof = 0)
Warning
ndarray.var()が普通の分散を返すのとは異なることに注意
Return unbiased standard deviation over requested axis Normalized by N-1
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.std.html
Warning
ndarray.std()が普通の標準偏差を返すのとは異なることに注意
ndarray.std()で不偏分散を使用した標準偏差を求める場合は次のようにする。:
ndarray.std(ddof=1)
http://docs.scipy.org/doc/numpy/reference/generated/numpy.std.html
pandas で扱うデータ形式として「Series」「DataFrame」の2種に大別されます。
Note
Seriesは1次元配列のオブジェクト
In [17]: dat = pd.Series(range(5), index=['a', 'b', 'c', 'd', 'e'])
In [18]: dat
Out[18]:
a 0
b 1
c 2
d 3
e 4
In [19]: dat = pd.Series(range(5))
In [20]: dat
Out[20]:
0 0
1 1
2 2
3 3
4 4
Note
indexを省略して作成すると自動で番号が振られる。
In [31]: dat.index
Out[31]: Int64Index([0, 1, 2, 3, 4], dtype=int64)
In [32]: dat.index = ['a', 'b', 'c', 'd', 'e']
In [34]: dat.index
Out[34]: Index([a, b, c, d, e], dtype=object)
In [22]: type(dat)
Out[22]: pandas.core.series.Series
In [36]: dat
Out[36]:
a 0
b 1
c 2
d 3
e 4
In [37]: dat.dtype
Out[37]: dtype('int32')
In [38]: dic = {'a':0, 'b':1, 'c':2}
In [39]: dic
Out[39]: {'a': 0, 'b': 1, 'c': 2}
In [40]: pd.Series(dic)
Out[40]:
a 0
b 1
c 2
In [57]: dic
Out[57]: {'a': 0, 'b': 1, 'c': 2}
In [59]: pd.Series(dic, index = ['c','a','d','a','r'])
Out[59]:
c 2
a 0
d NaN
a 0
r NaN
Note
キーが無い場合は、値としてNaNが自動的に代入されています。
NaNはpandasで標準のmissing data
In [63]: pd.Series(3, index=['a','b','c'])
Out[63]:
a 3
b 3
c 3
In [64]: pd.Series(5.5, index=['a','b','c'])
Out[64]:
a 5.5
b 5.5
c 5.5
In [66]: pd.Series('x', index=['a','b','c'])
Out[66]:
a x
b x
c x
In [71]: pd.Series([1, 2], index=['a','b','c'])
-------------------------------------------------------------------
#~略~
AssertionError: Index length did not match values
#同数なら当然ok
In [72]: pd.Series([1, 2, 3], index=['a','b','c'])
Out[72]:
a 1
b 2
c 3
In [87]: s
Out[87]:
0 0.755976
1 1.268044
2 0.485728
3 -1.127213
4 0.842730
In [88]: s[0]
Out[88]: 0.75597621501748491
In [89]: s[3]
Out[89]: -1.1272125283930412
In [90]: s.median()
Out[90]: 0.7559762150174849
In [91]: s[s > s.median()]
Out[91]:
1 1.268044
4 0.842730
In [92]: s[[4, 3, 1]]
Out[92]:
4 0.842730
3 -1.127213
1 1.268044
In [96]: s
Out[96]:
a 0.954394
b 1.021587
c 1.641359
d -0.787615
e -1.337450
In [97]: s['a']
Out[97]: 0.95439370011445235
In [98]: s['d']
Out[98]: -0.78761480364250935
In [99]: s['d'] = 12
In [100]: s
Out[100]:
a 0.954394
b 1.021587
c 1.641359
d 12.000000
e -1.337450
In [101]: 'd' in s
Out[101]: True
In [102]: 'g' in s
Out[102]: False
In [10]: s = pd.Series([1,2,3])
In [11]: s
Out[11]:
0 1
1 2
2 3
In [12]: s*2
Out[12]:
0 2
1 4
2 6
In [13]: s+s
Out[13]:
0 2
1 4
2 6
In [14]: np.exp(s)
Out[14]:
0 2.718282
1 7.389056
2 20.085537
In [15]: s-s
Out[15]:
0 0
1 0
2 0
In [25]: s
Out[25]:
0 1
1 2
2 3
In [26]: s2
Out[26]:
0 1
1 3
In [27]: s - s2
Out[27]:
0 0
1 -1
2 NaN
ndarray だとエラーとなるが、Seriesだと不足箇所がNaNとなる
In [38]: s = pd.Series([1,2,3])
In [39]: s
Out[39]:
0 1
1 2
2 3
In [40]: s.name
In [41]: s.name = 'test'
In [42]: s
Out[42]:
0 1
1 2
2 3
Name: test
In [43]: s.name
Out[43]: 'test'
DataFrameでいうところのカラム名
2次元データ構造のオブジェクト
In [82]: d = {'one': pd.Series([1,2,3]), 'two': pd.Series([4,5,6])}
In [83]: df = pd.DataFrame(d)
In [84]: df
Out[84]:
one two
0 1 4
1 2 5
2 3 6
In [86]: pd.DataFrame(df, index=(2,1,0))
Out[86]:
one two
2 3 6
1 2 5
0 1 4
In [87]: df
Out[87]:
one two
0 1 4
1 2 5
2 3 6
In [88]: df.index = ['a', 'b', 'c']
In [89]: df
Out[89]:
one two
a 1 4
b 2 5
c 3 6
In [138]: df
Out[138]:
a b c
0 1 4 7
1 2 5 8
2 3 6 9
In [139]: df.columns = ['aa', 'bb', 'cc']
In [140]: df
Out[140]:
aa bb cc
0 1 4 7
1 2 5 8
2 3 6 9
In [135]: df
Out[135]:
a b c
0 1 4 7
1 2 5 8
2 3 6 9
In [137]: pd.DataFrame(df, columns=['c', 'd'])
Out[137]:
c d
0 7 NaN
1 8 NaN
2 9 NaN
In [141]: df
Out[141]:
aa bb cc
0 1 4 7
1 2 5 8
2 3 6 9
In [142]: pd.DataFrame(df, index=[2,3])
Out[142]:
aa bb cc
2 3 6 9
3 NaN NaN NaN
In [169]: df
Out[169]:
a b
0 1 4
1 2 5
2 3 6
In [170]: df['a * b'] = df['a'] * df['b']
In [171]: df
Out[171]:
a b a * b
0 1 4 4
1 2 5 10
2 3 6 18
In [177]: df
Out[177]:
a b a * b
0 1 4 4
1 2 5 10
2 3 6 18
In [178]: del df['a * b']
In [179]: df
Out[179]:
a b
0 1 4
1 2 5
2 3 6
In [188]: df
Out[188]:
a b
0 1 4
1 2 5
2 3 6
In [189]: df['b'] = 0
In [190]: df
Out[190]:
a b
0 1 0
1 2 0
2 3 0
In [191]: df['b'] = 'change'
In [192]: df
Out[192]:
a b
0 1 change
1 2 change
2 3 change
In [201]: df
Out[201]:
a b c
0 1 4 7
1 2 5 8
2 3 6 9
In [202]: df.insert(1, 'a-b', [14, 25, 36])
In [203]: df
Out[203]:
a a-b b c
0 1 14 4 7
1 2 25 5 8
2 3 36 6 9
カラム名で選択: | df[col] |
---|---|
index名で選択: | df.loc[label] |
index(番号)で選択: | df.iloc[loc] |
slice(row): | df[5: 10] |
select by boolean vector: | |
df[bool_vec] |
カラム名で選択:
In [32]: df
Out[32]:
col1 col2 col3
id1 1 2 3
id2 1 2 3
id3 1 2 3
In [33]: df['col2']
Out[33]:
id1 2
id2 2
id3 2
Name: col2, dtype: int64
index名で選択:
In [34]: df
Out[34]:
col1 col2 col3
id1 1 2 3
id2 1 2 3
id3 1 2 3
In [35]: df.loc['id3']
Out[35]:
col1 1
col2 2
col3 3
Name: id3, dtype: int64
index番号で選択:
In [25]: df
Out[25]:
col1 col2 col3
0 1 2 3
1 1 2 3
2 1 2 3
3 1 2 3
4 1 2 3
In [26]: df.iloc[3]
Out[26]:
col1 1
col2 2
col3 3
Name: 3, dtype: int64
スライス(slice):
In [30]: df
Out[30]:
col1 col2 col3
0 1 2 3
1 1 2 3
2 1 2 3
3 1 2 3
4 1 2 3
In [31]: df[2:5]
Out[31]:
col1 col2 col3
2 1 2 3
3 1 2 3
4 1 2 3
In [32]: df.loc[2:5]
Out[32]:
col1 col2 col3
2 1 2 3
3 1 2 3
4 1 2 3
In [34]: df = pd.DataFrame(np.random.randn(10,4))
In [35]: df
Out[35]:
0 1 2 3
0 -0.836425 -0.836683 -0.128338 0.185924
1 0.393783 0.051175 -0.134918 1.100155
2 1.895460 0.689592 1.480001 0.343219
3 0.776787 -0.012831 -0.484880 -0.496969
4 -1.446777 1.242913 -0.609033 1.030562
5 1.954871 -0.162846 1.797987 -0.187672
6 -0.035405 0.429107 1.247279 -1.960676
7 0.285284 0.883484 -1.651580 -0.383613
8 0.100606 -0.942300 1.009609 -0.127016
9 -0.844169 -2.120701 -0.846613 0.039404
In [36]: df2 = pd.DataFrame(np.random.randn(7,3))
In [37]: df2
Out[37]:
0 1 2
0 -0.887233 0.281578 0.089743
1 -1.085461 -0.403206 -0.608542
2 -0.805530 0.382693 0.416210
3 -1.738029 1.021369 -1.179627
4 -0.774503 -0.609078 -1.922216
5 0.464220 -1.999902 2.742095
6 -0.468957 0.347906 0.468270
In [38]: df + df2
Out[38]:
0 1 2 3
0 -1.723657 -0.555106 -0.038596 NaN
1 -0.691678 -0.352032 -0.743460 NaN
2 1.089930 1.072285 1.896211 NaN
3 -0.961242 1.008538 -1.664507 NaN
4 -2.221280 0.633835 -2.531249 NaN
5 2.419091 -2.162748 4.540082 NaN
6 -0.504362 0.777013 1.715549 NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN
In [54]: df
Out[54]:
0 1 2 3
0 -0.836425 -0.836683 -0.128338 0.185924
1 0.393783 0.051175 -0.134918 1.100155
2 1.895460 0.689592 1.480001 0.343219
3 0.776787 -0.012831 -0.484880 -0.496969
4 -1.446777 1.242913 -0.609033 1.030562
5 1.954871 -0.162846 1.797987 -0.187672
6 -0.035405 0.429107 1.247279 -1.960676
7 0.285284 0.883484 -1.651580 -0.383613
8 0.100606 -0.942300 1.009609 -0.127016
9 -0.844169 -2.120701 -0.846613 0.039404
In [55]: df - df.iloc[0]
Out[55]:
0 1 2 3
0 0.000000 0.000000 0.000000 0.000000
1 1.230208 0.887858 -0.006580 0.914232
2 2.731885 1.526275 1.608340 0.157295
3 1.613211 0.823852 -0.356542 -0.682893
4 -0.610352 2.079596 -0.480695 0.844638
5 2.791295 0.673837 1.926326 -0.373595
6 0.801020 1.265791 1.375617 -2.146600
7 1.121708 1.720167 -1.523241 -0.569536
8 0.937030 -0.105617 1.137948 -0.312940
9 -0.007744 -1.284018 -0.718275 -0.146520
In [110]: df
Out[110]:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
5 15 16 17
6 18 19 20
7 21 22 23
8 24 25 26
9 27 28 29
In [111]: df.T
Out[111]:
0 1 2 3 4 5 6 7 8 9
a 0 3 6 9 12 15 18 21 24 27
b 1 4 7 10 13 16 19 22 25 28
c 2 5 8 11 14 17 20 23 26 29
In [132]: df
Out[132]:
A B C
a -0.393860 -0.844211 1.493558
b -0.854690 0.125359 -0.683376
c -0.166663 -0.476496 0.296576
d -1.165772 -0.769898 -0.382250
e -0.071999 -1.309287 1.608671
f -0.076891 -1.005731 0.967189
g -0.444061 -1.766110 0.962084
h -0.355894 -0.133935 -1.260297
i 0.302985 -1.322559 -0.937835
j 0.704741 0.256036 1.006929
In [133]: df.index
Out[133]: Index([a, b, c, d, e, f, g, h, i, j], dtype=object)
In [134]: df.columns
Out[134]: Index([A, B, C], dtype=object)
In [138]: df
Out[138]:
A B C
a -0.393860 -0.844211 1.493558
b -0.854690 0.125359 -0.683376
c -0.166663 -0.476496 0.296576
d -1.165772 -0.769898 -0.382250
e -0.071999 -1.309287 1.608671
f -0.076891 -1.005731 0.967189
g -0.444061 -1.766110 0.962084
h -0.355894 -0.133935 -1.260297
i 0.302985 -1.322559 -0.937835
j 0.704741 0.256036 1.006929
In [139]: df.describe()
Out[139]:
A B C
count 10.000000 10.000000 10.000000
mean -0.252210 -0.724683 0.307125
std 0.533305 0.664502 1.049826
min -1.165772 -1.766110 -1.260297
25% -0.431511 -1.233398 -0.608095
50% -0.261278 -0.807054 0.629330
75% -0.073222 -0.219575 0.996994
max 0.704741 0.256036 1.608671
version 0.10から
In [118]: df
Out[118]:
0 1 2 3 4 5 6 \
0 0.211410 0.763776 -0.264011 1.329344 -0.382833 1.148313 -0.711445
1 0.910559 -0.238801 -0.787309 -0.282730 -1.202036 0.531731 -0.327447
2 -0.449601 -0.053122 1.574387 -0.373806 -0.630876 -0.502817 -0.031028
7 8 9
0 1.005495 0.197810 -2.365171
1 -0.644915 -0.113493 0.447467
2 2.531369 1.408145 0.084981
デフォルトで80のところを40に指定:
In [120]: pd.set_option('line_width', 40)
In [121]: df = pd.DataFrame(np.random.randn(3,10))
In [122]: df
Out[122]:
0 1 2 \
0 0.197120 -0.001250 0.934579
1 0.154917 -1.176137 1.835669
2 0.633252 0.576388 -0.277348
3 4 5 \
0 1.072795 -1.232976 -0.399841
1 -0.504809 -0.758600 0.868639
2 -0.687534 0.083170 -0.922135
6 7 8 \
0 0.278137 1.125170 0.423181
1 0.216331 -0.066807 0.245962
2 1.045445 1.096144 1.059333
9
0 -0.064375
1 -1.360624
In [23]: df
Out[23]:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
In [24]: df['b']
Out[24]:
0 1
1 4
2 7
3 10
4 13
Name: b, dtype: int32
In [25]: df.b
Out[25]:
0 1
1 4
2 7
3 10
4 13
Name: b, dtype: int32
In [38]: s = pd.Series([1, 2, 3, None, 5])
In [39]: s
Out[39]:
0 1
1 2
2 3
3 NaN
4 5
dtype: float64
None は NaN に変換される。
In [40]: dates = pd.date_range('20131001', periods=6)
In [41]: dates
Out[41]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-10-01 00:00:00, ..., 2013-10-06 00:00:00]
Length: 6, Freq: D, Timezone: None
In [42]: df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('abcd'))
In [43]: df
Out[43]:
a b c d
2013-10-01 1.266797 -0.307188 0.015516 -1.346249
2013-10-02 0.243138 -0.420187 -0.630685 -1.023720
2013-10-03 -0.017072 0.302387 0.951089 -0.476144
2013-10-04 -0.893489 0.377465 0.232587 -0.412145
2013-10-05 -1.245619 0.111652 0.226033 -0.947730
2013-10-06 -1.102173 2.315227 -0.173461 0.718281
OK事例:
In [97]: dic2
Out[97]: {'a': [1], 'b': [2], 'c': [3]}
In [98]: pd.DataFrame(dic2)
Out[98]:
a b c
0 1 2 3
ダメな事例(2Dでないとダメ):
In [101]: dic3
Out[101]: {'a': 1, 'b': 2, 'c': 3}
In [102]: pd.DataFrame(dic3)
ValueError: If use all scalar values, must pass index
1個でも2D形式なら他は複製されて作成される。:
In [105]: dic4
Out[105]: {'a': 1, 'b': 2, 'c': [3, 4]}
In [106]: pd.DataFrame(dic4)
Out[106]:
a b c
0 1 2 3
1 1 2 4
先頭の5行を表示:
In [117]: df
Out[117]:
0 1 2
0 -0.393860 -0.844211 1.493558
1 -0.854690 0.125359 -0.683376
2 -0.166663 -0.476496 0.296576
3 -1.165772 -0.769898 -0.382250
4 -0.071999 -1.309287 1.608671
5 -0.076891 -1.005731 0.967189
6 -0.444061 -1.766110 0.962084
7 -0.355894 -0.133935 -1.260297
8 0.302985 -1.322559 -0.937835
9 0.704741 0.256036 1.006929
先頭の任意行を表示:
In [121]: df.head(3)
Out[121]:
0 1 2
0 -0.393860 -0.844211 1.493558
1 -0.854690 0.125359 -0.683376
2 -0.166663 -0.476496 0.296576
In [118]: df.head()
Out[118]:
0 1 2
0 -0.393860 -0.844211 1.493558
1 -0.854690 0.125359 -0.683376
2 -0.166663 -0.476496 0.296576
3 -1.165772 -0.769898 -0.382250
4 -0.071999 -1.309287 1.608671
最後の5行を表示:
In [120]: df.tail()
Out[120]:
0 1 2
5 -0.076891 -1.005731 0.967189
6 -0.444061 -1.766110 0.962084
7 -0.355894 -0.133935 -1.260297
8 0.302985 -1.322559 -0.937835
9 0.704741 0.256036 1.006929
昇順ソート:
In [172]: df2
Out[172]:
0 1 2
3 9 10 11
4 12 13 14
1 3 4 5
2 6 7 8
In [173]: df2.sort(columns=2)
Out[173]:
0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
In [174]: df2.sort(columns=2, ascending=True)
Out[174]:
0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
降順ソート:
In [178]: df2
Out[178]:
0 1 2
3 9 10 11
4 12 13 14
1 3 4 5
2 6 7 8
In [179]: df2.sort(columns=1, ascending=False)
Out[179]:
0 1 2
4 12 13 14
3 9 10 11
2 6 7 8
1 3 4 5
スライス選択で任意行を選択できる:
In [227]: df
Out[227]:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
In [228]: df[1:4]
Out[228]:
a b c
1 3 4 5
2 6 7 8
3 9 10 11
行を番号で指定:
In [261]: df
Out[261]:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
In [262]: df.iloc[2:4]
Out[262]:
a b c
2 6 7 8
3 9 10 11
列を番号で指定:
In [259]: df2
Out[259]:
0 1 2 3 4
a 0 3 6 9 12
b 1 4 7 10 13
c 2 5 8 11 14
In [260]: df2.iloc[:,2:4]
Out[260]:
2 3
a 6 9
b 7 10
c 8 11
df.loc[]:
In [270]: df
Out[270]:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
In [271]: df.loc[2,['a', 'c']]
Out[271]:
a 6
c 8
Name: 2, dtype: int32
In [286]: df.loc[2:4, ['a', 'c']]
Out[286]:
a c
2 6 8
3 9 11
4 12 14
df.iloc[]:
In [304]: df
Out[304]:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
In [305]: df.iloc[[1,3], [2]]
Out[305]:
c
1 5
3 11
df.iat[row,cols]:
In [363]: df
Out[363]:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
In [365]: df.iat[1,2]
Out[365]: 5
普通のリストでは、存在しない値を指定してもエラーにはならない。:
In [375]: r
Out[375]: ['a', 'b', 'c']
In [376]: r[2:5]
Out[376]: ['c']
In [377]: r[5:]
Out[377]: []
df[]のスライス指定でも、存在しない値を指定してもエラーにはならない。:
In [386]: df
Out[386]:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
In [387]: df[3:10]
Out[387]:
a b c
3 9 10 11
4 12 13 14
In [388]: df[5:]
Out[388]:
Empty DataFrame
Columns: [a, b, c]
Index: []
Warning
df.iloc[]では、存在しない値を指定するとエラーとなる。
In [400]: df
Out[400]:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
In [401]: df[3:10]
Out[401]:
a b c
3 9 10 11
4 12 13 14
In [402]: df.iloc[3:10]
IndexError: out-of-bounds on slice (end)
In [413]: df
Out[413]:
a b c
0 0.491715 0.712848 -0.027067
1 -0.171917 -0.850121 0.397433
2 -0.770142 0.964528 -0.812055
3 -0.360034 0.451987 -0.444576
4 -0.250321 0.099953 -1.661135
5 1.193116 0.223073 -0.256779
6 0.372960 -0.843722 -0.276419
7 0.329680 1.347930 -0.979296
8 0.503391 0.054636 -1.356218
9 0.410911 -0.050290 -0.376301
In [414]: df[df.a>0]
Out[414]:
a b c
0 0.491715 0.712848 -0.027067
5 1.193116 0.223073 -0.256779
6 0.372960 -0.843722 -0.276419
7 0.329680 1.347930 -0.979296
8 0.503391 0.054636 -1.356218
9 0.410911 -0.050290 -0.376301
In [416]: df[df > 0]
Out[416]:
a b c
0 0.491715 0.712848 NaN
1 NaN NaN 0.397433
2 NaN 0.964528 NaN
3 NaN 0.451987 NaN
4 NaN 0.099953 NaN
5 1.193116 0.223073 NaN
6 0.372960 NaN NaN
7 0.329680 1.347930 NaN
8 0.503391 0.054636 NaN
9 0.410911 NaN NaN
In [428]: df
Out[428]:
a b c
0 0.491715 0.712848 -0.027067
1 -0.171917 -0.850121 0.397433
2 -0.770142 0.964528 -0.812055
3 -0.360034 0.451987 -0.444576
4 -0.250321 0.099953 -1.661135
5 1.193116 0.223073 -0.256779
6 0.372960 -0.843722 -0.276419
7 0.329680 1.347930 -0.979296
8 0.503391 0.054636 -1.356218
9 0.410911 -0.050290 -0.376301
In [429]: df['f'] = 'add'
In [430]: df
Out[430]:
a b c f
0 0.491715 0.712848 -0.027067 add
1 -0.171917 -0.850121 0.397433 add
2 -0.770142 0.964528 -0.812055 add
3 -0.360034 0.451987 -0.444576 add
4 -0.250321 0.099953 -1.661135 add
5 1.193116 0.223073 -0.256779 add
6 0.372960 -0.843722 -0.276419 add
7 0.329680 1.347930 -0.979296 add
8 0.503391 0.054636 -1.356218 add
9 0.410911 -0.050290 -0.376301 add
In [467]: df
Out[467]:
a b c
0 a 2 3
1 a 5 6
2 a 3 6
3 b 5 55
4 b 53 53
5 b 532 352
In [468]: df[df['a']=='a']
Out[468]:
a b c
0 a 2 3
1 a 5 6
2 a 3 6
In [469]: df[df['a']=='b']
Out[469]:
a b c
3 b 5 55
4 b 53 53
5 b 532 352
各列の平均:
In [475]: df
Out[475]:
a b c
0 a 2 3
1 a 5 6
2 a 3 6
3 b 5 55
4 b 53 53
5 b 532 352
In [476]: df.mean()
Out[476]:
b 100.000000
c 79.166667
dtype: float64
各行の平均:
In [488]: df
Out[488]:
a b c
0 a 2 3
1 a 5 6
2 a 3 6
3 b 5 55
4 b 53 53
5 b 532 352
In [489]: df.mean(1)
Out[489]:
0 2.5
1 5.5
2 4.5
3 30.0
4 53.0
5 442.0
dtype: float64
df.apply():
In [497]: df
Out[497]:
b c
0 2 3
1 5 6
2 3 6
3 5 55
4 53 53
5 532 352
In [498]: df.apply(lambda x: x.max() - x.min())
Out[498]:
b 530
c 349
dtype: int64
In [523]: s
Out[523]:
0 2
1 5
2 3
3 5
4 53
5 532
Name: b, dtype: int64
In [524]: s.value_counts()
Out[524]:
5 2
532 1
53 1
3 1
2 1
dtype: int64
In [548]: df
Out[548]:
店名 日付 最高気温 客数
0 A 1 29 326
1 A 2 29 364
2 A 3 30 283
3 A 4 32 369
4 A 5 33 417
5 A 6 32 436
6 A 7 31 438
7 A 8 26 296
8 A 9 28 263
9 A 10 31 389
10 X 1 29 312
11 X 2 30 348
12 X 3 29 284
13 X 4 32 369
14 X 5 33 420
15 X 6 27 275
16 X 7 28 294
17 X 8 32 368
18 X 9 34 451
19 X 10 32 405
In [549]: df.groupby('店名').sum()
Out[549]:
日付 最高気温 客数
店名
A 55 301 3581
X 55 306 3526
時間リスト作成:
In [64]: d = pd.date_range('10/1/2013', periods=10)
In [65]: d
Out[65]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-10-01 00:00:00, ..., 2013-10-10 00:00:00]
Length: 10, Freq: D, Timezone: None
In [66]: d.values
Out[66]:
array(['2013-10-01T09:00:00.000000000+0900',
'2013-10-02T09:00:00.000000000+0900',
'2013-10-03T09:00:00.000000000+0900',
'2013-10-04T09:00:00.000000000+0900',
'2013-10-05T09:00:00.000000000+0900',
'2013-10-06T09:00:00.000000000+0900',
'2013-10-07T09:00:00.000000000+0900',
'2013-10-08T09:00:00.000000000+0900',
'2013-10-09T09:00:00.000000000+0900',
'2013-10-10T09:00:00.000000000+0900'], dtype='datetime64[ns]')
code:
pandas.read_table('file_name')
code:
df = pandas.read_csv('file_name')
pythonのpandasモジュールからsqliteを操る方法について学ぶ
read from sqlite3:
import pandas as pd
import sqlite3
con = sqlite3.connect("test.sqlite")
df1 = sql.read_frame("select * from dat", con)
write to sqlite3:
sql.write_frame(df1, name='dat', con=con)
http://pandas.pydata.org/pandas-docs/stable/io.html#sql-queries
Parameters: | frame: DataFrame :name: name of SQL table :con: an open SQL database connection object :flavor: {‘sqlite’, ‘mysql’, ‘oracle’}, default ‘sqlite’ :if_exists: {‘fail’, ‘replace’, ‘append’}, default ‘fail’ :fail: If table exists, do nothing. replace: If table exists, drop it, recreate it, and insert data. append: If table exists, insert data. Create if does not exist. |
---|
Note
書込み時の注意事項