DA
来自Jack's Lab
(版本间的差异)
(→柱状图 (bar)) |
(→Reverse operation of value_counts()) |
||
| 第294行: | 第294行: | ||
5 2.0 | 5 2.0 | ||
</source> | </source> | ||
| + | |||
| + | for multiple columns you can use: | ||
| + | |||
| + | df.loc[df.index.repeat(df['Count'])] | ||
<br> | <br> | ||
2020年2月22日 (六) 22:57的版本
目录 |
1 Overview
2 描述性统计
2.1 位置量化
直观的:
import numpy as np import matplotlib.pyplot as plt from scipy import stats d = np.array([1, 2, 2, 100, 3, 3, 6, 8]) np.mean(d); stats.trim_mean(d, 0.2); np.median(d) 15.625 4.0 3.0 >>> plt.plot(d, 'o'); plt.show()
实际的:
import pandas as pd
from scipy import stats
p = pd.read_csv('../DA/data/da01-press.csv', index_col='time', date_parser=lambda x: pd.to_datetime(float(x)+28800000000000))
p = p.drop(columns=['name'])
p.mean()
Press 3685.248525
stats.trim_mean(p, 0.1) # stats.trimboth(p['Press'],0.1).mean()
array([3680.07826531])
p.median()
Press 3677.105
p.describe()
Press
count 122.000000
mean 3685.248525
std 123.990939
min 3484.480000
25% 3618.402500
50% 3677.105000
75% 3747.742500
max 4672.060000
2.2 分散性量化
>>> d = np.array([3, 1, 5, 3, 15, 6, 7, 2]) >>> meanl = np.array([np.mean(d)]*len(d)); trimmeanl = np.array([stats.trim_mean(d, 0.2)]*len(d)); medianl = np.array([np.median(d)]*len(d)) >>> iqrv = np.array([stats.iqr(d)]*len(d)) >>> down = medianl -iqrv; up = medianl+iqrv >>> plt.plot(d,'o',color='C1'); plt.plot(meanl, ':C2', label='Mean'); plt.plot(trimmeanl, ':r', label='Trim mean'); plt.plot(medianl, '-g', label='Meidan') >>> plt.plot(up, '-C1'); plt.plot(down, '-C1') >>> plt.legend(); plt.grid(); plt.show()
2.3 相关性估计
>>> t1 = pd.read_csv('../DA/data/da02-temp-0948.csv', index_col='time', date_parser=lambda x: pd.to_datetime(float(x)+28800000000000))
>>> t2 = pd.read_csv('../DA/data/da02-temp-0019.csv', index_col='time', date_parser=lambda x: pd.to_datetime(float(x)+28800000000000))
>>> plt.plot(t1.index, t1['Temp'], label='t1')
>>> plt.plot(t2.index, t2['Temp'], label='t2')
>>> plt.plot(t1['Temp'].index,t3, label='t3')
>>> plt.legend(); plt.show()
3 探索数据分布
3.1 频数统计
>>> import pandas as pd >>> a = pd.Series([0.1, 1.2, 1.2, 2.1, 2.1, 3, 2,]) >>> a.value_counts() 2.1 2 1.2 2 0.1 1 2.0 1 3.0 1 >>> a.value_counts(normalize=True) 2.1 0.285714 1.2 0.285714 0.1 0.142857 2.0 0.142857 3.0 0.142857
高级的,使用 pandas.cut() 进行区间统计:
>>> ag = pd.Series([1, 1, 3, 5, 8, 10, 12, 15, 18, 18, 19, 20, 25, 30, 40, 51, 52])
>>> bins = (0, 10, 13, 18, 21, np.inf)
>>> labels = ('child', 'preteen', 'teen', 'military_age', 'adult')
>>> grp = pd.cut(ag, bins=bins, labels=labels)
>>> grp
0 child
1 child
2 child
3 child
4 child
5 child
6 preteen
7 teen
8 teen
9 teen
10 military_age
11 military_age
12 adult
13 adult
14 adult
15 adult
16 adult
dtype: category
Categories (5, object): [child < preteen < teen < military_age < adult]
>>> grp.value_counts()
child 6
adult 5
teen 3
military_age 2
preteen 1
3.2 直方图 (histogram)
>>> import pandas as pd >>> a = pd.Series([1,2,2,3,3,4,5,6]) >>> a.value_counts() 3 2 2 2 6 1 5 1 4 1 1 1 # 各数出现频次统计直方图 >>> a.plot.hist(bins=6,rwidth=0.9) # 各数出现概率 (频次/总数)直方图 >>> a.value_counts(normalize=True) 3 0.250 2 0.250 6 0.125 5 0.125 4 0.125 1 0.125 >>> a.plot.hist(bins=6, rwidth=0.9, density=True) # normalize,与 pandas.value_counts(normalize=True) 类似 >>> plt.show()
>>> c = pd.Series(np.random.gamma(10,size=1000)**1.5) >>> c.plot.hist(grid=True,bins=20,rwidth=0.9) # plt.hist(c,bins=20,rwidth=0.9) >>> plt.grid(axis='y',alpha=0.75) >>> plt.show()
more info please refere to: matplotlib.pyplot.hist
3.3 KDE
Kernel Density Estimate
>>> np.random.normal(loc=(10,20),scale=(4,2),size=(5,2))
array([[15.87305077, 20.3740753 ],
[14.40449246, 20.73788215],
[12.51111038, 20.81289712],
[ 9.55461887, 21.48781844],
[-0.72336527, 18.81365079]])
>>> dist = pd.DataFrame(np.random.normal(loc=(10,20), scale=(4,2), size=(1000, 2)), columns=['a', 'b'])
>>> dist.agg(['min', 'max', 'mean', 'std']).round(decimals=2)
>>> fig, ax = plt.subplots()
>>> dist.plot.kde(ax=ax, legend=False, title='Histogram: A vs. B')
>>> dist.plot.hist(density=True, ax=ax)
>>> ax.set_ylabel('Probability')
>>> ax.grid(axis='y')
>>> ax.set_facecolor('#d8dcd6')
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
p = pd.read_csv('./data/da03-press.csv',index_col='time')
pp = p['Press']
pp.plot.hist(bins=150, rwidth=.9, density=True, color='C2', alpha=0.8)
pp.plot.kde(bw_method=0.1737, color='C1')
plt.ylabel('Probability'); plt.xlim(xmin=3200,xmax=4200); plt.xlabel('hPa')
plt.grid(linewidth=0.8)
plt.show()
#sns.distplot(pp)
#plt.show()
bw_method 一般取 n^(-1/5)
更多参考:https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html#Notes
3.4 柱状图 (bar)
每天统计事件 A 发生的次数,其实已经做了单个窗口是 24 小时、bins 持续自然增长的频数运算。这类数据直接用柱状图 (bar) 显示一下即可:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdate
hb = pd.read_csv("../DA/data/ncp-hb-new.csv", index_col='Date', parse_dates=True, skipinitialspace=True)
cn = pd.read_csv("../DA/data/ncp-cn-new.csv", index_col='Date', parse_dates=True, skipinitialspace=True)
xhb = cn-hb
plt.gca().xaxis.set_major_formatter(mdate.DateFormatter('%m-%d'))
#plt.bar(hb.index, hb['Confirmed'].values)
plt.bar(xhb.index, xhb['Confirmed'].values)
plt.show()
同时显示湖北和非湖北柱状图:
plt.bar(xhb.index, xhb_cf, align='edge', width=0.3, label='Outside Hubei') plt.bar(hb.index, hb['Confirmed'].values, align='edge', width=-0.4, label='Hubei') plt.legend() plt.gcf().autofmt_xdate() plt.show()
3.5 Reverse operation of value_counts()
>>> col = pd.Series([1.0, 1.0, 2.0, 3.0, 3.0, 3.0]) >>> cc =col.value_counts() >>> cc 3.0 3 1.0 2 2.0 1 >>> np.repeat(cc.index, cc) Float64Index([3.0, 3.0, 3.0, 1.0, 1.0, 2.0], dtype='float64') >>> pd.Series(np.repeat(cc.index, cc)) 0 3.0 1 3.0 2 3.0 3 1.0 4 1.0 5 2.0
for multiple columns you can use:
df.loc[df.index.repeat(df['Count'])]
4 时序数据分析
>>> x = pd.date_range('2020-1-9','2020-2-15',freq='1d')
>>> print(x)
DatetimeIndex(['2020-01-09', '2020-01-10', '2020-01-11', '2020-01-12',
'2020-01-13', '2020-01-14', '2020-01-15', '2020-01-16',
'2020-01-17', '2020-01-18', '2020-01-19', '2020-01-20',
'2020-01-21', '2020-01-22', '2020-01-23', '2020-01-24',
'2020-01-25', '2020-01-26', '2020-01-27', '2020-01-28',
'2020-01-29', '2020-01-30', '2020-01-31', '2020-02-01',
'2020-02-02', '2020-02-03', '2020-02-04', '2020-02-05',
'2020-02-06', '2020-02-07', '2020-02-08', '2020-02-09',
'2020-02-10', '2020-02-11', '2020-02-12', '2020-02-13',
'2020-02-14', '2020-02-15'],
dtype='datetime64[ns]', freq='D')
5 Reference
- Numpy API reference
- Pandas API reference
- matplotlib Gallery
- Change the Colors Changes to the default style
- matplotlib.pyplot.plot()
- matplotlib.pyplot.figure()
- Time Series Analysis Example
- Introduction to Data Science
- Data Visualization tutorial
- FlowingData Tutorials