查看DA的源代码

== 探索数据分布 ==

=== 频数统计 ===

<source lang=python>
>>> import pandas as pd
>>> a = pd.Series([0.1, 1.2, 1.2, 2.1, 2.1, 3, 2,])
>>> a.value_counts()
2.1    2
1.2    2
0.1    1
2.0    1
3.0    1
>>> a.value_counts(normalize=True)
2.1    0.285714
1.2    0.285714
0.1    0.142857
2.0    0.142857
3.0    0.142857
</source>

高级的，使用 pandas.cut() 进行区间统计:

<source lang=python>
>>> ag = pd.Series([1, 1, 3, 5, 8, 10, 12, 15, 18, 18, 19, 20, 25, 30, 40, 51, 52])
>>> bins = (0, 10, 13, 18, 21, np.inf)
>>> labels = ('child', 'preteen', 'teen', 'military_age', 'adult')
>>> grp = pd.cut(ag, bins=bins, labels=labels)
>>> grp
0            child
1            child
2            child
3            child
4            child
5            child
6          preteen
7             teen
8             teen
9             teen
10    military_age
11    military_age
12           adult
13           adult
14           adult
15           adult
16           adult
dtype: category
Categories (5, object): [child < preteen < teen < military_age < adult]
>>> grp.value_counts()
child           6
adult           5
teen            3
military_age    2
preteen         1
</source>

<br>

=== 直方图 (histogram) ===

<source lang=python>
>>> import pandas as pd
>>> a = pd.Series([1,2,2,3,3,4,5,6])
>>> a.value_counts()
3    2
2    2
6    1
5    1
4    1
1    1
# 各数出现频次统计直方图
>>> a.plot.hist(bins=6,rwidth=0.9)

# 各数出现概率 （频次/总数）直方图
>>> a.value_counts(normalize=True)
3    0.250
2    0.250
6    0.125
5    0.125
4    0.125
1    0.125
>>> a.plot.hist(bins=6, rwidth=0.9, density=True)  # normalize，与 pandas.value_counts(normalize=True) 类似

>>> plt.show()
</source>

<source lang=python>
>>> c = pd.Series(np.random.gamma(10,size=1000)**1.5)
>>> c.plot.hist(grid=True,bins=20,rwidth=0.9)   # plt.hist(c,bins=20,rwidth=0.9)
>>> plt.grid(axis='y',alpha=0.75)
>>> plt.show()
</source>

more info please refere to: [https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html matplotlib.pyplot.hist]

<br>

=== KDE ===

核密度估计 (Kernel Density Estimate, KDE), 用来估计未知密度函数，属于非参数检验方法之一

<source lang=python>
>>> np.random.normal(loc=(10,20),scale=(4,2),size=(5,2))
array([[15.87305077, 20.3740753 ],
       [14.40449246, 20.73788215],
       [12.51111038, 20.81289712],
       [ 9.55461887, 21.48781844],
       [-0.72336527, 18.81365079]])
>>> dist = pd.DataFrame(np.random.normal(loc=(10,20), scale=(4,2), size=(1000, 2)), columns=['a', 'b'])
>>> dist.agg(['min', 'max', 'mean', 'std']).round(decimals=2)
>>> fig, ax = plt.subplots()
>>> dist.plot.kde(ax=ax, legend=False, title='Histogram: A vs. B')
>>> dist.plot.hist(density=True, ax=ax)
>>> ax.set_ylabel('Probability')
>>> ax.grid(axis='y')
>>> ax.set_facecolor('#d8dcd6')
</source>

<source lang=python>
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

p = pd.read_csv('./data/da03-press.csv',index_col='time')
pp = p['Press']

pp.plot.hist(bins=150, rwidth=.9, density=True, color='C2', alpha=0.8)
pp.plot.kde(bw_method=0.1737, color='C1')

plt.ylabel('Probability'); plt.xlim(xmin=3200,xmax=4200); plt.xlabel('hPa')
plt.grid(linewidth=0.8)
plt.show()
#sns.distplot(pp, color="#ff8000")
#plt.show()
</source>

'''bw_method''' 一般取 n^(-1/5)

更多参考：https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html#Notes

<source lang=python>
>>> s1 = np.random.normal(-1.0, 1, 320)
>>> s2 = np.random.normal(2.0, 0.6, 32)
>>> s = np.hstack([s1, s2])
>>> pdf = stats.kde.gaussian_kde(s)
>>> x = np.linspace(-5, 5, 200)
>>> plt.plot(x, pdf(x), 'r')
>>> plt.hist(s, normed=1, alpha=0.45, color='purple')
>>> plt.show()
</source>

stats.norm.rvs(), ppf(), pdf(), cdf():  https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html

<br>

=== 柱状图 (bar) ===

每天统计事件 A 发生的次数，其实已经做了单个窗口是 24 小时、bins 持续自然增长的频数运算。这类数据直接用柱状图 (bar) 显示一下即可：

<source lang=python>
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdate

hb = pd.read_csv("../DA/data/ncp-hb-new.csv", index_col='Date', parse_dates=True, skipinitialspace=True)
cn = pd.read_csv("../DA/data/ncp-cn-new.csv", index_col='Date', parse_dates=True, skipinitialspace=True)
xhb = cn-hb
plt.gca().xaxis.set_major_formatter(mdate.DateFormatter('%m-%d'))
#plt.bar(hb.index, hb['Confirmed'].values)
plt.bar(xhb.index, xhb['Confirmed'].values)
plt.show()
</source>

同时显示湖北和非湖北柱状图：

<source lang=python>
plt.bar(xhb.index, xhb_cf, align='edge', width=0.3, label='Outside Hubei')
plt.bar(hb.index, hb['Confirmed'].values, align='edge', width=-0.4, label='Hubei')
plt.legend()
plt.gcf().autofmt_xdate()
plt.show()
</source>

* [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.kde.html Pandas KDE]
* [https://matplotlib.org/tutorials/introductory/lifecycle.html#sphx-glr-tutorials-introductory-lifecycle-py X 轴 label 格式]
* [https://matplotlib.org/gallery/statistics/histogram_cumulative.html?highlight=cdf Using histograms to plot a cumulative distribution]

<br>

=== Reverse operation of value_counts() ===

<source lang=python>
>>> col = pd.Series([1.0, 1.0, 2.0, 3.0, 3.0, 3.0])
>>> cc =col.value_counts()
>>> cc
3.0    3
1.0    2
2.0    1
>>> np.repeat(cc.index, cc)
Float64Index([3.0, 3.0, 3.0, 1.0, 1.0, 2.0], dtype='float64')
>>> pd.Series(np.repeat(cc.index, cc))
0    3.0
1    3.0
2    3.0
3    1.0
4    1.0
5    2.0
</source>

For multiple columns you can use:

<source lang=python>
>>> df.loc[df.index.repeat(df['Count'])]
</source>

<br>