查看DA的源代码

== Overview ==

* [https://blog.csdn.net/qq_32412759/article/details/77774286 用python进行统计分析]
* [https://www.jianshu.com/p/933f45ba36fb python数据统计分析]
* [https://www.jianshu.com/p/ffa8c60ccbc3 描述性统计]
* [https://www.jianshu.com/p/98061b97e485 抽样方法与抽样分布]
* [https://www.jianshu.com/p/44041c4ba9e4 参数估计]
* [https://www.jianshu.com/p/3e094e12c906 方差分析]
* [https://www.jianshu.com/p/f899312ee01d 聚类分析]
* [https://www.jianshu.com/p/59e685d96970 主成分分析]
* [https://www.jianshu.com/p/bdebc1700ceb 线性回归]

* [https://www.coursera.org/learn/exploratory-data-analysis EDA coursera]

<br>

== 描述性统计 ==

=== 位置量化 ===

直观的：

<source lang=python>
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
d = np.array([1, 2, 2, 100, 3, 3, 6, 8])
np.mean(d); stats.trim_mean(d, 0.2); np.median(d)
15.625
4.0
3.0

>>> plt.plot(d, 'o'); plt.show()
</source>


实际的：

<source lang=python>
import pandas as pd
from scipy import stats

p = pd.read_csv('../DA/data/da01-press.csv', index_col='time', date_parser=lambda x: pd.to_datetime(float(x)+28800000000000))
p = p.drop(columns=['name'])
p.mean()
Press    3685.248525

stats.trim_mean(p, 0.1)   # stats.trimboth(p['Press'],0.1).mean()
array([3680.07826531])

p.median()
Press    3677.105

p.describe()
             Press
count   122.000000
mean   3685.248525
std     123.990939
min    3484.480000
25%    3618.402500
50%    3677.105000
75%    3747.742500
max    4672.060000
</source>

* [https://numpy.org/doc/1.18/reference/routines.statistics.html NumPy Statistics]
* [http://docs.scipy.org/doc/scipy/reference/stats.html SciPy Statistical functions]
* [https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html?highlight=statistics#descriptive-statistics Pandas Statistics]

<br>

=== 分散性量化 ===

<source lang=python>
>>> d = np.array([3, 1, 5, 3, 15, 6, 7, 2])
>>> meanl = np.array([np.mean(d)]*len(d)); trimmeanl = np.array([stats.trim_mean(d, 0.2)]*len(d)); medianl = np.array([np.median(d)]*len(d))
>>> iqrv = np.array([stats.iqr(d)]*len(d))
>>> down = medianl -iqrv; up = medianl+iqrv
>>> plt.plot(d,'o',color='C1'); plt.plot(meanl, ':C2', label='Mean'); plt.plot(trimmeanl, ':r', label='Trim mean'); plt.plot(medianl, '-g', label='Meidan')

>>> plt.plot(up, '-C1'); plt.plot(down, '-C1')

>>> plt.legend(); plt.grid(); plt.show()
</source>

<br>

=== 相关性估计 ===

<source lang=python>
>>> t1 = pd.read_csv('../DA/data/da02-temp-0948.csv', index_col='time', date_parser=lambda x: pd.to_datetime(float(x)+28800000000000))
>>> t2 = pd.read_csv('../DA/data/da02-temp-0019.csv', index_col='time', date_parser=lambda x: pd.to_datetime(float(x)+28800000000000))
>>> plt.plot(t1.index, t1['Temp'], label='t1')
>>> plt.plot(t2.index, t2['Temp'], label='t2')
>>> plt.plot(t1['Temp'].index,t3, label='t3')
>>> plt.legend(); plt.show()
</source>

<br>

== 探索数据分布 ==

=== 频数统计 ===

<source lang=python>
>>> import pandas as pd
>>> a = pd.Series([0.1, 1.2, 1.2, 2.1, 2.1, 3, 2,])
>>> a.value_counts()
2.1    2
1.2    2
0.1    1
2.0    1
3.0    1
>>> a.value_counts(normalize=True)
2.1    0.285714
1.2    0.285714
0.1    0.142857
2.0    0.142857
3.0    0.142857
</source>

高级的，使用 pandas.cut() 进行区间统计:

<source lang=python>
>>> ag = pd.Series([1, 1, 3, 5, 8, 10, 12, 15, 18, 18, 19, 20, 25, 30, 40, 51, 52])
>>> bins = (0, 10, 13, 18, 21, np.inf)
>>> labels = ('child', 'preteen', 'teen', 'military_age', 'adult')
>>> grp = pd.cut(ag, bins=bins, labels=labels)
>>> grp
0            child
1            child
2            child
3            child
4            child
5            child
6          preteen
7             teen
8             teen
9             teen
10    military_age
11    military_age
12           adult
13           adult
14           adult
15           adult
16           adult
dtype: category
Categories (5, object): [child < preteen < teen < military_age < adult]
>>> grp.value_counts()
child           6
adult           5
teen            3
military_age    2
preteen         1
</source>

<br>

=== histogram ===

<source lang=python>
>>> import pandas as pd
>>> a = pd.Series([1,2,2,3,3,4,5,6])
>>> a.value_counts()
3    2
2    2
6    1
5    1
4    1
1    1
>>> a.plot.hist(bins=6,rwidth=0.9)  # 频次统计直方图

>>> a.value_counts(normalize=True)
3    0.250
2    0.250
6    0.125
5    0.125
4    0.125
1    0.125
>>> a.plot.hist(bins=6, rwidth=0.9, density=True)  # 各数出现概率 （频次/总数）直方图，和 pandas.value_counts(normalize=True) 类似

>>> plt.show()
</source>

<source lang=python>
>>> c = pd.Series(np.random.gamma(10,size=1000))
>>> c.plot.hist(grid=True,bins=20,rwidth=0.9)   # plt.hist(c,bins=20,rwidth=0.9)
>>> plt.grid(axis='y',alpha=0.75)
>>> plt.show()
</source>

<br>

=== bar ===

每天统计事件 A 发生的次数，其实已经做了单个窗口是 24 小时、bins 持续自然增长的频数运算。这类数据直接用柱状图 (bar) 显示一下即可：

<source lang=python>
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdate

hb = pd.read_csv("../DA/data/ncp-hb-new.csv", index_col='Date', parse_dates=True, skipinitialspace=True)
cn = pd.read_csv("../DA/data/ncp-cn-new.csv", index_col='Date', parse_dates=True, skipinitialspace=True)
xhb = cn-hb
plt.gca().xaxis.set_major_formatter(mdate.DateFormatter('%m-%d'))
#plt.bar(hb.index, hb['Confirmed'].values)
plt.bar(xhb.index, xhb['Confirmed'].values)
plt.show()
</source>

同时显示湖北和非湖北柱状图：

<source lang=python>
plt.bar(xhb.index, xhb_cf, align='edge', width=0.3, label='Outside Hubei')
plt.bar(hb.index, hb['Confirmed'].values, align='edge', width=-0.4, label='Hubei')
plt.legend()
plt.gcf().autofmt_xdate()
plt.show()
</source>


* [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.kde.html Pandas KDE]
* [https://matplotlib.org/tutorials/introductory/lifecycle.html#sphx-glr-tutorials-introductory-lifecycle-py X 轴 label 格式]

<br>

== 时序数据分析 ==

* [https://www.jianshu.com/p/b91e3ae940ec pandas 日期处理]

<source lang=python>
>>> x = pd.date_range('2020-1-9','2020-2-15',freq='1d')
>>> print(x)
DatetimeIndex(['2020-01-09', '2020-01-10', '2020-01-11', '2020-01-12',
               '2020-01-13', '2020-01-14', '2020-01-15', '2020-01-16',
               '2020-01-17', '2020-01-18', '2020-01-19', '2020-01-20',
               '2020-01-21', '2020-01-22', '2020-01-23', '2020-01-24',
               '2020-01-25', '2020-01-26', '2020-01-27', '2020-01-28',
               '2020-01-29', '2020-01-30', '2020-01-31', '2020-02-01',
               '2020-02-02', '2020-02-03', '2020-02-04', '2020-02-05',
               '2020-02-06', '2020-02-07', '2020-02-08', '2020-02-09',
               '2020-02-10', '2020-02-11', '2020-02-12', '2020-02-13',
               '2020-02-14', '2020-02-15'],
              dtype='datetime64[ns]', freq='D')
</source>

<br>

== Reference ==

* [https://numpy.org/doc/1.18/reference/ Numpy API reference]
* [https://pandas.pydata.org/pandas-docs/stable/reference/index.html Pandas API reference]
* [https://matplotlib.org/gallery/index.html matplotlib Gallery]
* [https://matplotlib.org/users/dflt_style_changes.html Change the Colors] Changes to the default style
* [https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html matplotlib.pyplot.plot()]
* [https://matplotlib.org/api/_as_gen/matplotlib.pyplot.figure.html matplotlib.pyplot.figure()]


* [https://www.kaggle.com/kp4920/s-p-500-stock-data-time-series-analysis Time Series Analysis Example]
* [https://www.kaggle.com/usengecoder/introduction-to-data-science Introduction to Data Science]
* [https://www.kaggle.com/residentmario/welcome-to-data-visualization/ Data Visualization tutorial]
* [https://flowingdata.com/category/tutorials/ FlowingData Tutorials]


* [https://datascienceguide.github.io/outline Data Science Guide]

<br><br>