DA

2020年3月19日 (四) 18:31的版本

1 Overview

EDA coursera

2 描述性统计

2.1 位置估计

直观的：

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
d = np.array([1, 2, 2, 100, 3, 3, 6, 8])
np.mean(d); stats.trim_mean(d, 0.2); np.median(d)
15.625
4.0
3.0

>>> plt.plot(d, 'o'); plt.show()

实际的：

import pandas as pd
from scipy import stats

p = pd.read_csv('../DA/data/da01-press.csv', index_col='time', date_parser=lambda x: pd.to_datetime(float(x)+28800000000000))
p = p.drop(columns=['name'])
p.mean()
Press    3685.248525

stats.trim_mean(p, 0.1)   # stats.trimboth(p['Press'],0.1).mean()
array([3680.07826531])

p.median()
Press    3677.105

p.describe()
             Press
count   122.000000
mean   3685.248525
std     123.990939
min    3484.480000
25%    3618.402500
50%    3677.105000
75%    3747.742500
max    4672.060000

2.2 变异性估计

>>> d = np.array([3, 1, 5, 3, 15, 6, 7, 2])
>>> meanl = np.array([np.mean(d)]*len(d)); trimmeanl = np.array([stats.trim_mean(d, 0.2)]*len(d)); medianl = np.array([np.median(d)]*len(d))
>>> iqrv = np.array([stats.iqr(d)]*len(d))
>>> down = medianl -iqrv; up = medianl+iqrv
>>> plt.plot(d,'o',color='C1'); plt.plot(meanl, ':C2', label='Mean'); plt.plot(trimmeanl, ':r', label='Trim mean'); plt.plot(medianl, '-g', label='Meidan')

>>> plt.plot(up, '-C1'); plt.plot(down, '-C1')

>>> plt.legend(); plt.grid(); plt.show()

2.3 相关性估计

>>> t1 = pd.read_csv('../DA/data/da02-temp-0948.csv', index_col='time', date_parser=lambda x: pd.to_datetime(float(x)+28800000000000))
>>> t2 = pd.read_csv('../DA/data/da02-temp-0019.csv', index_col='time', date_parser=lambda x: pd.to_datetime(float(x)+28800000000000))
>>> plt.plot(t1.index, t1['Temp'], label='t1')
>>> plt.plot(t2.index, t2['Temp'], label='t2')
>>> plt.plot(t1['Temp'].index,t3, label='t3')
>>> plt.legend(); plt.show()

2.3.1 Pearson

>>> t1 = np.array([1,2,3,4,3,2,1])
>>> t2 = np.array([2,4,6,8,6,4,2])
>>> t3 = np.random.normal(4, 1, 7)
>>> stats.pearsonr(t1, t2)
(0.9999999999999998, 1.411088991461081e-39)
>>> stats.pearsonr(t2, t3)
(0.13788121813127208, 0.7681442360425068)
>>> stats.pearsonr(t1, t3)
(0.13788121813127208, 0.7681442360425068)
>>> t4 = np.array([1,2,3,4,3,2,1])
>>> stats.pearsonr(t1, t4)
(0.9999999999999998, 1.411088991461081e-39)

stats.pearsonr() 返回两个值，一个为皮尔逊相关系数 (Pearson's correlation)，另一个为 p-value（表示相关系数不能表示其相关性的概率，即：失效的概率）

scipy.stats.pearsonr()

p-value: Two-tailed p-value

2.3.2 Spearman

斯皮尔曼等级相关系数 (Spearman's correlation coefficient for ranked data)

>>> print(stats.spearmanr([1,2,3,4,5], [5,6,7,8,7]))
SpearmanrResult(correlation=0.8207826816681233, pvalue=0.08858700531354381)

scipy.stats.spearmanr()

p-value: The two-sided p-value, null hypothesis is that two sets of data are uncorrelated

3 探索数据分布

3.1 频数统计

>>> import pandas as pd
>>> a = pd.Series([0.1, 1.2, 1.2, 2.1, 2.1, 3, 2,])
>>> a.value_counts()
2.1    2
1.2    2
0.1    1
2.0    1
3.0    1
>>> a.value_counts(normalize=True)
2.1    0.285714
1.2    0.285714
0.1    0.142857
2.0    0.142857
3.0    0.142857

高级的，使用 pandas.cut() 进行区间统计:

>>> ag = pd.Series([1, 1, 3, 5, 8, 10, 12, 15, 18, 18, 19, 20, 25, 30, 40, 51, 52])
>>> bins = (0, 10, 13, 18, 21, np.inf)
>>> labels = ('child', 'preteen', 'teen', 'military_age', 'adult')
>>> grp = pd.cut(ag, bins=bins, labels=labels)
>>> grp
0            child
1            child
2            child
3            child
4            child
5            child
6          preteen
7             teen
8             teen
9             teen
10    military_age
11    military_age
12           adult
13           adult
14           adult
15           adult
16           adult
dtype: category
Categories (5, object): [child < preteen < teen < military_age < adult]
>>> grp.value_counts()
child           6
adult           5
teen            3
military_age    2
preteen         1

3.2 直方图 (histogram)

>>> import pandas as pd
>>> a = pd.Series([1,2,2,3,3,4,5,6])
>>> a.value_counts()
3    2
2    2
6    1
5    1
4    1
1    1
# 各数出现频次统计直方图
>>> a.plot.hist(bins=6,rwidth=0.9)

# 各数出现概率 （频次/总数）直方图
>>> a.value_counts(normalize=True)
3    0.250
2    0.250
6    0.125
5    0.125
4    0.125
1    0.125
>>> a.plot.hist(bins=6, rwidth=0.9, density=True)  # normalize，与 pandas.value_counts(normalize=True) 类似

>>> plt.show()

>>> c = pd.Series(np.random.gamma(10,size=1000)**1.5)
>>> c.plot.hist(grid=True,bins=20,rwidth=0.9)   # plt.hist(c,bins=20,rwidth=0.9)
>>> plt.grid(axis='y',alpha=0.75)
>>> plt.show()

more info please refere to: matplotlib.pyplot.hist

3.3 KDE

核密度估计 (Kernel Density Estimate, KDE), 用来估计未知密度函数，属于非参数检验方法之一

>>> np.random.normal(loc=(10,20),scale=(4,2),size=(5,2))
array([[15.87305077, 20.3740753 ],
       [14.40449246, 20.73788215],
       [12.51111038, 20.81289712],
       [ 9.55461887, 21.48781844],
       [-0.72336527, 18.81365079]])
>>> dist = pd.DataFrame(np.random.normal(loc=(10,20), scale=(4,2), size=(1000, 2)), columns=['a', 'b'])
>>> dist.agg(['min', 'max', 'mean', 'std']).round(decimals=2)
>>> fig, ax = plt.subplots()
>>> dist.plot.kde(ax=ax, legend=False, title='Histogram: A vs. B')
>>> dist.plot.hist(density=True, ax=ax)
>>> ax.set_ylabel('Probability')
>>> ax.grid(axis='y')
>>> ax.set_facecolor('#d8dcd6')

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

p = pd.read_csv('./data/da03-press.csv',index_col='time')
pp = p['Press']

pp.plot.hist(bins=150, rwidth=.9, density=True, color='C2', alpha=0.8)
pp.plot.kde(bw_method=0.1737, color='C1')

plt.ylabel('Probability'); plt.xlim(xmin=3200,xmax=4200); plt.xlabel('hPa')
plt.grid(linewidth=0.8)
plt.show()
#sns.distplot(pp, color="#ff8000")
#plt.show()

bw_method 一般取 n^(-1/5)

>>> s1 = np.random.normal(-1.0, 1, 320)
>>> s2 = np.random.normal(2.0, 0.6, 32)
>>> s = np.hstack([s1, s2])
>>> pdf = stats.kde.gaussian_kde(s)
>>> x = np.linspace(-5, 5, 200)
>>> plt.plot(x, pdf(x), 'r')
>>> plt.hist(s, normed=1, alpha=0.45, color='purple')
>>> plt.show()

stats.norm.rvs(), ppf(), pdf(), cdf(): https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html

3.4 柱状图 (bar)

每天统计事件 A 发生的次数，其实已经做了单个窗口是 24 小时、bins 持续自然增长的频数运算。这类数据直接用柱状图 (bar) 显示一下即可：

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdate

hb = pd.read_csv("../DA/data/ncp-hb-new.csv", index_col='Date', parse_dates=True, skipinitialspace=True)
cn = pd.read_csv("../DA/data/ncp-cn-new.csv", index_col='Date', parse_dates=True, skipinitialspace=True)
xhb = cn-hb
plt.gca().xaxis.set_major_formatter(mdate.DateFormatter('%m-%d'))
#plt.bar(hb.index, hb['Confirmed'].values)
plt.bar(xhb.index, xhb['Confirmed'].values)
plt.show()

同时显示湖北和非湖北柱状图：

plt.bar(xhb.index, xhb_cf, align='edge', width=0.3, label='Outside Hubei')
plt.bar(hb.index, hb['Confirmed'].values, align='edge', width=-0.4, label='Hubei')
plt.legend()
plt.gcf().autofmt_xdate()
plt.show()

3.5 Reverse operation of value_counts()

>>> col = pd.Series([1.0, 1.0, 2.0, 3.0, 3.0, 3.0])
>>> cc =col.value_counts()
>>> cc
3.0    3
1.0    2
2.0    1
>>> np.repeat(cc.index, cc)
Float64Index([3.0, 3.0, 3.0, 1.0, 1.0, 2.0], dtype='float64')
>>> pd.Series(np.repeat(cc.index, cc))
0    3.0
1    3.0
2    3.0
3    1.0
4    1.0
5    2.0

For multiple columns you can use:

>>> df.loc[df.index.repeat(df['Count'])]

4 假设检验

要解决的问题：在一个样本中观察到的效应是否也会出现在更大规模的总体中？

方法：

Fisher 原假设检验
Neyman-Pearson 决策理论
贝叶斯推理

这三种方法还有一个子集：经典假设检验 (Classical Hypothesis Testing)

经典假设检验 (CHT) 要回答的问题是：在一个样本中观察到的效应，其是偶然出现的概率是多少？步骤：

选一个检验统计量 (Test Statistic)，量化观测到的效应
定义原假设 (Null Hypothesis)：观测到的效应为假。即观测的效应是偶然产生的
计算 p 值 (p-value)，p 值为原假设为真的概率。即一个效应偶然出现的概率
解释结果。如果 p 值很低（一般小于 5%），说明原假设为真的概率很低，效应偶然出现的概率很低，即：效应是显著的，称为统计显著 (Statistically Significant)

本质就是反证法。。。p-value 实际求得是检验统计量 (Test Statistic) 在其分布两端 (Two-tailed) 的概率

4.1 正态检验

4.1.1 QQ 图

>>> np.random.seed(12345678)
>>> x = np.random.normal(5,3,100)
>>> stats.probplot(x, plot=plt); plt.show()

scipy.stats.probplot()

4.1.2 Shapiro-Wilk

Shapiro-Wilk W 检验，基于观测值的排序统计量的协方差矩阵的检验，可以被用于小于等于 50 的样本量下

scipy.stats.shapiro()

返回值 [W, p-value]

>>> np.random.seed(12345678)
>>> x = np.random.normal(5, 3, 100)
>>> np.random.seed()
>>> y = np.random.normal(5, 3, 100)

>>> stats.shapiro(x)
(0.9772805571556091, 0.08144091814756393)
>>> stats.shapiro(y)
(0.9933551549911499, 0.9085326790809631)

p-value: for the hypothesis test

4.1.3 Kolmogorov-Smirnov

科尔莫戈罗夫检验(Kolmogorov-Smirnov test)，检验样本数据是否服从某一分布，仅适用于连续分布的检验。下例中用它检验正态分布。

>>> stats.kstest(x,'norm')
KstestResult(statistic=0.8801115630229508, pvalue=1.7157931366221766e-92)
>>> stats.kstest(y,'norm')
KstestResult(statistic=0.8168376836753909, pvalue=1.7239988712511043e-73)

scipy.stats.kstest()

p-value: One-tailed or two-tailed p-value

4.1.4 Pearson omnibus

D'Agostino-Pearson omnibus 检验

stats.normaltest()

>>> stats.normaltest(x)
NormaltestResult(statistic=6.528044509508757, pvalue=0.03823430021917039)
>>> stats.normaltest(y)
NormaltestResult(statistic=0.7706971982031684, pvalue=0.6802134730639648)

p-value: A 2-sided chi squared probability for the hypothesis test

5 时序数据分析

5.1 datetime

>>> d = {'date': 20200318, 'positive': 7731, 'death': 112}
>>> d['date']
20200318
>>> pd.to_datetime(d['date'], format='%Y%m%d')
Timestamp('2020-03-18 00:00:00')

5.2 datetime range

>>> x = pd.date_range('2020-1-9','2020-2-15',freq='1d')
>>> x.astype(str).tolist()  # 转字符串 list
>>> print(x)
DatetimeIndex(['2020-01-09', '2020-01-10', '2020-01-11', '2020-01-12',
               '2020-01-13', '2020-01-14', '2020-01-15', '2020-01-16',
               '2020-01-17', '2020-01-18', '2020-01-19', '2020-01-20',
               '2020-01-21', '2020-01-22', '2020-01-23', '2020-01-24',
               '2020-01-25', '2020-01-26', '2020-01-27', '2020-01-28',
               '2020-01-29', '2020-01-30', '2020-01-31', '2020-02-01',
               '2020-02-02', '2020-02-03', '2020-02-04', '2020-02-05',
               '2020-02-06', '2020-02-07', '2020-02-08', '2020-02-09',
               '2020-02-10', '2020-02-11', '2020-02-12', '2020-02-13',
               '2020-02-14', '2020-02-15'],
              dtype='datetime64[ns]', freq='D')

>>> ii = np.arange('2020-01-15',5,1,dtype='M8[D]')
array(['2020-01-15', '2020-01-16', '2020-01-17', '2020-01-18',
       '2020-01-19'], dtype='datetime64[D]')
>>> iii = np.datetime_as_string(ii, unit='D')  # 转字符串 list
array(['2020-01-15', '2020-01-16', '2020-01-17', '2020-01-18',
       '2020-01-19'], dtype='<U28')

>>> from datetime import datetime
>>> [datetime.strptime(d, '%Y-%m-%d').date() for d in iii]
[datetime.date(2020, 1, 15), datetime.date(2020, 1, 16), datetime.date(2020, 1, 17)
, datetime.date(2020, 1, 18), datetime.date(2020, 1, 19)]

6 Pandas

插入一行：

>>> us.loc[pd.to_datetime(d['date'],format='%Y%m%d')] = [d['positive'], , d['death'], 0]

7 Reference

Numpy API reference
Pandas API reference
matplotlib Gallery
Change the Colors Changes to the default style
matplotlib.pyplot.plot()
matplotlib.pyplot.figure()

Flourish 也有折线图版本：Line chart race

宏观数据库：https://www.ceicdata.com/zh-hans
国家统计局数据：http://data.stats.gov.cn/ https://mp.weixin.qq.com/s/6t5Wz1PTbG_ZKD88QAFH5g
- 新中国六十年统计资料汇编
- 各省市国民经济与社会发展统计公报
- 中国统计年鉴，各省市统计年鉴
- 各地区财政预算执行情况与财政预算公告

Data Science Guide

@@ 第504行： / 第504行： @@
 == Pandas ==
-<source lang=python>
+插入一行：
+<source lang=python>
+>>> us.loc[pd.to_datetime(d['date'],format='%Y%m%d')] = [d['positive'], , d['death'], 0]
 </source>

DA