DA
(→柱状图 (bar)) |
(→Reference) |
||
(未显示1个用户的61个中间版本) | |||
第17行: | 第17行: | ||
== 描述性统计 == | == 描述性统计 == | ||
− | === | + | === 位置估计 === |
直观的: | 直观的: | ||
第70行: | 第70行: | ||
<br> | <br> | ||
− | === | + | === 变异性估计 === |
<source lang=python> | <source lang=python> | ||
第96行: | 第96行: | ||
>>> plt.legend(); plt.show() | >>> plt.legend(); plt.show() | ||
</source> | </source> | ||
+ | |||
+ | <br> | ||
+ | |||
+ | ==== Pearson ==== | ||
+ | |||
+ | <source lang=python> | ||
+ | >>> t1 = np.array([1,2,3,4,3,2,1]) | ||
+ | >>> t2 = np.array([2,4,6,8,6,4,2]) | ||
+ | >>> t3 = np.random.normal(4, 1, 7) | ||
+ | >>> stats.pearsonr(t1, t2) | ||
+ | (0.9999999999999998, 1.411088991461081e-39) | ||
+ | >>> stats.pearsonr(t2, t3) | ||
+ | (0.13788121813127208, 0.7681442360425068) | ||
+ | >>> stats.pearsonr(t1, t3) | ||
+ | (0.13788121813127208, 0.7681442360425068) | ||
+ | >>> t4 = np.array([1,2,3,4,3,2,1]) | ||
+ | >>> stats.pearsonr(t1, t4) | ||
+ | (0.9999999999999998, 1.411088991461081e-39) | ||
+ | </source> | ||
+ | |||
+ | stats.pearsonr() 返回两个值,一个为皮尔逊相关系数 (Pearson's correlation),另一个为 p-value(表示相关系数不能表示其相关性的概率,即:失效的概率) | ||
+ | |||
+ | [https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html scipy.stats.pearsonr()] | ||
+ | |||
+ | p-value: Two-tailed p-value | ||
+ | |||
+ | <br> | ||
+ | |||
+ | ==== Spearman ==== | ||
+ | |||
+ | 斯皮尔曼等级相关系数 (Spearman's correlation coefficient for ranked data) | ||
+ | |||
+ | <source lang=python> | ||
+ | >>> print(stats.spearmanr([1,2,3,4,5], [5,6,7,8,7])) | ||
+ | SpearmanrResult(correlation=0.8207826816681233, pvalue=0.08858700531354381) | ||
+ | </source> | ||
+ | |||
+ | [https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html scipy.stats.spearmanr()] | ||
+ | |||
+ | p-value: The two-sided p-value, null hypothesis is that two sets of data are uncorrelated | ||
<br> | <br> | ||
第191行: | 第231行: | ||
>>> plt.show() | >>> plt.show() | ||
</source> | </source> | ||
+ | |||
+ | more info please refere to: [https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html matplotlib.pyplot.hist] | ||
<br> | <br> | ||
第196行: | 第238行: | ||
=== KDE === | === KDE === | ||
− | Kernel Density Estimate | + | 核密度估计 (Kernel Density Estimate, KDE), 用来估计未知密度函数,属于非参数检验方法之一 |
<source lang=python> | <source lang=python> | ||
第229行: | 第271行: | ||
plt.grid(linewidth=0.8) | plt.grid(linewidth=0.8) | ||
plt.show() | plt.show() | ||
− | #sns.distplot(pp) | + | #sns.distplot(pp, color="#ff8000") |
#plt.show() | #plt.show() | ||
</source> | </source> | ||
第236行: | 第278行: | ||
更多参考:https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html#Notes | 更多参考:https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html#Notes | ||
+ | |||
+ | <source lang=python> | ||
+ | >>> s1 = np.random.normal(-1.0, 1, 320) | ||
+ | >>> s2 = np.random.normal(2.0, 0.6, 32) | ||
+ | >>> s = np.hstack([s1, s2]) | ||
+ | >>> pdf = stats.kde.gaussian_kde(s) | ||
+ | >>> x = np.linspace(-5, 5, 200) | ||
+ | >>> plt.plot(x, pdf(x), 'r') | ||
+ | >>> plt.hist(s, normed=1, alpha=0.45, color='purple') | ||
+ | >>> plt.show() | ||
+ | </source> | ||
+ | |||
+ | stats.norm.rvs(), ppf(), pdf(), cdf(): https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html | ||
<br> | <br> | ||
第266行: | 第321行: | ||
plt.show() | plt.show() | ||
</source> | </source> | ||
− | |||
* [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.kde.html Pandas KDE] | * [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.kde.html Pandas KDE] | ||
* [https://matplotlib.org/tutorials/introductory/lifecycle.html#sphx-glr-tutorials-introductory-lifecycle-py X 轴 label 格式] | * [https://matplotlib.org/tutorials/introductory/lifecycle.html#sphx-glr-tutorials-introductory-lifecycle-py X 轴 label 格式] | ||
* [https://matplotlib.org/gallery/statistics/histogram_cumulative.html?highlight=cdf Using histograms to plot a cumulative distribution] | * [https://matplotlib.org/gallery/statistics/histogram_cumulative.html?highlight=cdf Using histograms to plot a cumulative distribution] | ||
+ | |||
+ | <br> | ||
+ | |||
+ | === Reverse operation of value_counts() === | ||
+ | |||
+ | <source lang=python> | ||
+ | >>> col = pd.Series([1.0, 1.0, 2.0, 3.0, 3.0, 3.0]) | ||
+ | >>> cc =col.value_counts() | ||
+ | >>> cc | ||
+ | 3.0 3 | ||
+ | 1.0 2 | ||
+ | 2.0 1 | ||
+ | >>> np.repeat(cc.index, cc) | ||
+ | Float64Index([3.0, 3.0, 3.0, 1.0, 1.0, 2.0], dtype='float64') | ||
+ | >>> pd.Series(np.repeat(cc.index, cc)) | ||
+ | 0 3.0 | ||
+ | 1 3.0 | ||
+ | 2 3.0 | ||
+ | 3 1.0 | ||
+ | 4 1.0 | ||
+ | 5 2.0 | ||
+ | </source> | ||
+ | |||
+ | For multiple columns you can use: | ||
+ | |||
+ | <source lang=python> | ||
+ | >>> df.loc[df.index.repeat(df['Count'])] | ||
+ | </source> | ||
+ | |||
+ | <br> | ||
+ | |||
+ | == 假设检验 == | ||
+ | |||
+ | 要解决的问题:在一个样本中观察到的效应是否也会出现在更大规模的总体中? | ||
+ | |||
+ | 方法: | ||
+ | |||
+ | * Fisher 原假设检验 | ||
+ | * Neyman-Pearson 决策理论 | ||
+ | * 贝叶斯推理 | ||
+ | |||
+ | |||
+ | 这三种方法还有一个子集:经典假设检验 (Classical Hypothesis Testing) | ||
+ | |||
+ | 经典假设检验 (CHT) 要回答的问题是:在一个样本中观察到的效应,其是偶然出现的概率是多少?步骤: | ||
+ | |||
+ | * 选一个检验统计量 (Test Statistic),量化观测到的效应 | ||
+ | * 定义原假设 (Null Hypothesis):观测到的效应为假。即观测的效应是偶然产生的 | ||
+ | * 计算 p 值 (p-value),p 值为原假设为真的概率。即一个效应偶然出现的概率 | ||
+ | * 解释结果。如果 p 值很低(一般小于 5%),说明原假设为真的概率很低,效应偶然出现的概率很低,即:效应是显著的,称为统计显著 (Statistically Significant) | ||
+ | |||
+ | 本质就是'''反证法'''。。。p-value 实际求得是检验统计量 (Test Statistic) 在其分布两端 (Two-tailed) 的概率 | ||
+ | |||
+ | <br> | ||
+ | |||
+ | === 正态检验 === | ||
+ | |||
+ | ==== QQ 图 ==== | ||
+ | |||
+ | <source lang=python> | ||
+ | >>> np.random.seed(12345678) | ||
+ | >>> x = np.random.normal(5,3,100) | ||
+ | >>> stats.probplot(x, plot=plt); plt.show() | ||
+ | </source> | ||
+ | |||
+ | [https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.probplot.html scipy.stats.probplot()] | ||
+ | |||
+ | <br> | ||
+ | |||
+ | ==== Shapiro-Wilk ==== | ||
+ | |||
+ | Shapiro-Wilk W 检验,基于观测值的排序统计量的协方差矩阵的检验,可以被用于小于等于 50 的样本量下 | ||
+ | |||
+ | [https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.shapiro.html scipy.stats.shapiro()] | ||
+ | |||
+ | 返回值 [W, p-value] | ||
+ | |||
+ | <source lang=python> | ||
+ | >>> np.random.seed(12345678) | ||
+ | >>> x = np.random.normal(5, 3, 100) | ||
+ | >>> np.random.seed() | ||
+ | >>> y = np.random.normal(5, 3, 100) | ||
+ | |||
+ | >>> stats.shapiro(x) | ||
+ | (0.9772805571556091, 0.08144091814756393) | ||
+ | >>> stats.shapiro(y) | ||
+ | (0.9933551549911499, 0.9085326790809631) | ||
+ | </source> | ||
+ | |||
+ | p-value: for the hypothesis test | ||
+ | |||
+ | <br> | ||
+ | |||
+ | ==== Kolmogorov-Smirnov ==== | ||
+ | |||
+ | 科尔莫戈罗夫检验(Kolmogorov-Smirnov test),检验样本数据是否服从某一分布,仅适用于连续分布的检验。下例中用它检验正态分布。 | ||
+ | |||
+ | <source lang=python> | ||
+ | >>> stats.kstest(x,'norm') | ||
+ | KstestResult(statistic=0.8801115630229508, pvalue=1.7157931366221766e-92) | ||
+ | >>> stats.kstest(y,'norm') | ||
+ | KstestResult(statistic=0.8168376836753909, pvalue=1.7239988712511043e-73) | ||
+ | </source> | ||
+ | |||
+ | [https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html scipy.stats.kstest()] | ||
+ | |||
+ | p-value: One-tailed or two-tailed p-value | ||
+ | |||
+ | <br> | ||
+ | |||
+ | ==== Pearson omnibus ==== | ||
+ | |||
+ | D'Agostino-Pearson omnibus 检验 | ||
+ | |||
+ | [https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html?highlight=omnibus stats.normaltest()] | ||
+ | |||
+ | <source lang=python> | ||
+ | >>> stats.normaltest(x) | ||
+ | NormaltestResult(statistic=6.528044509508757, pvalue=0.03823430021917039) | ||
+ | >>> stats.normaltest(y) | ||
+ | NormaltestResult(statistic=0.7706971982031684, pvalue=0.6802134730639648) | ||
+ | </source> | ||
+ | |||
+ | p-value: A 2-sided chi squared '''probability for the hypothesis test''' | ||
<br> | <br> | ||
第277行: | 第455行: | ||
* [https://www.jianshu.com/p/b91e3ae940ec pandas 日期处理] | * [https://www.jianshu.com/p/b91e3ae940ec pandas 日期处理] | ||
+ | * [https://docs.scipy.org/doc/numpy/reference/generated/numpy.datetime_as_string.html Numpy datetime as str] | ||
+ | |||
+ | === datetime === | ||
+ | |||
+ | <source lang=python> | ||
+ | >>> d = {'date': 20200318, 'positive': 7731, 'death': 112} | ||
+ | >>> d['date'] | ||
+ | 20200318 | ||
+ | >>> pd.to_datetime(d['date'], format='%Y%m%d') | ||
+ | Timestamp('2020-03-18 00:00:00') | ||
+ | </source> | ||
+ | |||
+ | <br> | ||
+ | |||
+ | === datetime range === | ||
<source lang=python> | <source lang=python> | ||
>>> x = pd.date_range('2020-1-9','2020-2-15',freq='1d') | >>> x = pd.date_range('2020-1-9','2020-2-15',freq='1d') | ||
+ | >>> x.astype(str).tolist() # 转字符串 list | ||
>>> print(x) | >>> print(x) | ||
DatetimeIndex(['2020-01-09', '2020-01-10', '2020-01-11', '2020-01-12', | DatetimeIndex(['2020-01-09', '2020-01-10', '2020-01-11', '2020-01-12', | ||
第292行: | 第486行: | ||
'2020-02-14', '2020-02-15'], | '2020-02-14', '2020-02-15'], | ||
dtype='datetime64[ns]', freq='D') | dtype='datetime64[ns]', freq='D') | ||
+ | |||
+ | >>> ii = np.arange('2020-01-15',5,1,dtype='M8[D]') | ||
+ | array(['2020-01-15', '2020-01-16', '2020-01-17', '2020-01-18', | ||
+ | '2020-01-19'], dtype='datetime64[D]') | ||
+ | >>> iii = np.datetime_as_string(ii, unit='D') # 转字符串 list | ||
+ | array(['2020-01-15', '2020-01-16', '2020-01-17', '2020-01-18', | ||
+ | '2020-01-19'], dtype='<U28') | ||
+ | |||
+ | >>> from datetime import datetime | ||
+ | >>> [datetime.strptime(d, '%Y-%m-%d').date() for d in iii] | ||
+ | [datetime.date(2020, 1, 15), datetime.date(2020, 1, 16), datetime.date(2020, 1, 17) | ||
+ | , datetime.date(2020, 1, 18), datetime.date(2020, 1, 19)] | ||
+ | </source> | ||
+ | |||
+ | <br> | ||
+ | |||
+ | == Pandas == | ||
+ | |||
+ | 插入行: | ||
+ | |||
+ | <source lang=python> | ||
+ | >>> idx = pd.to_datetime(d['date'],format='%Y%m%d') | ||
+ | >>> us.loc[idx] = [d['positive'], , d['death'], 0] | ||
+ | </source> | ||
+ | |||
+ | |||
+ | 删除行: | ||
+ | |||
+ | <source lang=python> | ||
+ | >>> us.index[-1] | ||
+ | Timestamp('2020-03-18 00:00:00') | ||
+ | >>> us.index[[-1, -2]] | ||
+ | Index([2020-03-18 00:00:00, 1970-01-01 00:00:00.020200318], dtype='object', name='Date') | ||
+ | |||
+ | >>> us.drop(us.index[-2], inplace=True) # 删除最后一行 | ||
+ | >>> us.drop(us.index[[-1,-2]], inplace=True) # 删除最后两行 | ||
+ | </source> | ||
+ | |||
+ | 增加一列: | ||
+ | |||
+ | <source lang=python> | ||
+ | >>> u.columns | ||
+ | Index(['Confirmed', 'New Confirmed', 'Deaths', 'New Deaths'], dtype='object') | ||
+ | >>> u['New Col'] = 2 | ||
+ | >>> u.columns | ||
+ | Index(['Confirmed', 'New Confirmed', 'Deaths', 'New Deaths', 'New Col'], dtype='object') | ||
+ | >>> u.tail() | ||
+ | Confirmed New Confirmed Deaths New Deaths New Col | ||
+ | Date | ||
+ | 2020-03-15 00:00:00 3173 723 60 11 2 | ||
+ | 2020-03-16 00:00:00 4019 846 71 11 2 | ||
+ | </source> | ||
+ | |||
+ | 删除列: | ||
+ | |||
+ | <source lang=python> | ||
+ | >>> u.drop(['New Col'], axis=1, inplace=True) | ||
+ | >>> u.columns | ||
+ | Index(['Confirmed', 'New Confirmed', 'Deaths', 'New Deaths'], dtype='object') | ||
+ | |||
+ | >>> del u['Deaths'] | ||
+ | >>> u.columns | ||
+ | Index(['Confirmed', 'New Confirmed', 'New Deaths'], dtype='object') | ||
+ | </source> | ||
+ | |||
+ | <source lang=python> | ||
+ | >>> nd = u.pop('New Deaths') | ||
+ | >>> nd.tail() | ||
+ | Date | ||
+ | 2020-03-15 11 | ||
+ | 2020-03-16 11 | ||
+ | Name: New Deaths, dtype: int64 | ||
+ | >>> u.tail() | ||
+ | Confirmed New Confirmed | ||
+ | Date | ||
+ | 2020-03-15 00:00:00 3173 723 | ||
+ | 2020-03-16 00:00:00 4019 846 | ||
</source> | </source> | ||
第310行: | 第581行: | ||
* [https://www.kaggle.com/residentmario/welcome-to-data-visualization/ Data Visualization tutorial] | * [https://www.kaggle.com/residentmario/welcome-to-data-visualization/ Data Visualization tutorial] | ||
* [https://flowingdata.com/category/tutorials/ FlowingData Tutorials] | * [https://flowingdata.com/category/tutorials/ FlowingData Tutorials] | ||
+ | ** [https://flowingdata.com/2017/07/17/how-to-make-animated-line-charts-in-r/ How to Make Animated Line Charts in R] | ||
+ | ** [https://flowingdata.com/2017/02/23/the-first-time/ Relationships: The First Time…] | ||
+ | |||
+ | |||
+ | * [https://flourish.studio/examples/ Flourish] 也有折线图版本:Line chart race | ||
+ | |||
+ | * 宏观数据库:https://www.ceicdata.com/zh-hans | ||
+ | * 国家统计局数据:http://data.stats.gov.cn/ https://mp.weixin.qq.com/s/6t5Wz1PTbG_ZKD88QAFH5g | ||
+ | ** 新中国六十年统计资料汇编 | ||
+ | ** 各省市国民经济与社会发展统计公报 | ||
+ | ** 中国统计年鉴,各省市统计年鉴 | ||
+ | ** 各地区财政预算执行情况与财政预算公告 | ||
+ | * U.S. Census Bureau, Current Population Survey | ||
+ | * [https://www.equityinhighered.org/indicators/u-s-population-trends-and-educational-attainment/race-and-ethnicity-of-the-u-s-population/ Race and Ethnicity of the U.S. Population] | ||
* [https://datascienceguide.github.io/outline Data Science Guide] | * [https://datascienceguide.github.io/outline Data Science Guide] | ||
+ | |||
+ | * [https://www.cdc.gov/csels/dsepd/ss1978/lesson3/section1.html Principles of Epidemiology in Public Health Practice Third Edition][https://www.cdc.gov/csels/dsepd/ss1978/SS1978.pdf PDF Third Edition] | ||
+ | * [http://seismo.berkeley.edu/~kirchner/eps_120/Odds_n_ends/Students_original_paper.pdf Students distribution original paper] | ||
<br><br> | <br><br> |
2020年3月25日 (三) 23:44的最后版本
目录 |
[编辑] 1 Overview
[编辑] 2 描述性统计
[编辑] 2.1 位置估计
直观的:
import numpy as np import matplotlib.pyplot as plt from scipy import stats d = np.array([1, 2, 2, 100, 3, 3, 6, 8]) np.mean(d); stats.trim_mean(d, 0.2); np.median(d) 15.625 4.0 3.0 >>> plt.plot(d, 'o'); plt.show()
实际的:
import pandas as pd from scipy import stats p = pd.read_csv('../DA/data/da01-press.csv', index_col='time', date_parser=lambda x: pd.to_datetime(float(x)+28800000000000)) p = p.drop(columns=['name']) p.mean() Press 3685.248525 stats.trim_mean(p, 0.1) # stats.trimboth(p['Press'],0.1).mean() array([3680.07826531]) p.median() Press 3677.105 p.describe() Press count 122.000000 mean 3685.248525 std 123.990939 min 3484.480000 25% 3618.402500 50% 3677.105000 75% 3747.742500 max 4672.060000
[编辑] 2.2 变异性估计
>>> d = np.array([3, 1, 5, 3, 15, 6, 7, 2]) >>> meanl = np.array([np.mean(d)]*len(d)); trimmeanl = np.array([stats.trim_mean(d, 0.2)]*len(d)); medianl = np.array([np.median(d)]*len(d)) >>> iqrv = np.array([stats.iqr(d)]*len(d)) >>> down = medianl -iqrv; up = medianl+iqrv >>> plt.plot(d,'o',color='C1'); plt.plot(meanl, ':C2', label='Mean'); plt.plot(trimmeanl, ':r', label='Trim mean'); plt.plot(medianl, '-g', label='Meidan') >>> plt.plot(up, '-C1'); plt.plot(down, '-C1') >>> plt.legend(); plt.grid(); plt.show()
[编辑] 2.3 相关性估计
>>> t1 = pd.read_csv('../DA/data/da02-temp-0948.csv', index_col='time', date_parser=lambda x: pd.to_datetime(float(x)+28800000000000)) >>> t2 = pd.read_csv('../DA/data/da02-temp-0019.csv', index_col='time', date_parser=lambda x: pd.to_datetime(float(x)+28800000000000)) >>> plt.plot(t1.index, t1['Temp'], label='t1') >>> plt.plot(t2.index, t2['Temp'], label='t2') >>> plt.plot(t1['Temp'].index,t3, label='t3') >>> plt.legend(); plt.show()
[编辑] 2.3.1 Pearson
>>> t1 = np.array([1,2,3,4,3,2,1]) >>> t2 = np.array([2,4,6,8,6,4,2]) >>> t3 = np.random.normal(4, 1, 7) >>> stats.pearsonr(t1, t2) (0.9999999999999998, 1.411088991461081e-39) >>> stats.pearsonr(t2, t3) (0.13788121813127208, 0.7681442360425068) >>> stats.pearsonr(t1, t3) (0.13788121813127208, 0.7681442360425068) >>> t4 = np.array([1,2,3,4,3,2,1]) >>> stats.pearsonr(t1, t4) (0.9999999999999998, 1.411088991461081e-39)
stats.pearsonr() 返回两个值,一个为皮尔逊相关系数 (Pearson's correlation),另一个为 p-value(表示相关系数不能表示其相关性的概率,即:失效的概率)
p-value: Two-tailed p-value
[编辑] 2.3.2 Spearman
斯皮尔曼等级相关系数 (Spearman's correlation coefficient for ranked data)
>>> print(stats.spearmanr([1,2,3,4,5], [5,6,7,8,7])) SpearmanrResult(correlation=0.8207826816681233, pvalue=0.08858700531354381)
p-value: The two-sided p-value, null hypothesis is that two sets of data are uncorrelated
[编辑] 3 探索数据分布
[编辑] 3.1 频数统计
>>> import pandas as pd >>> a = pd.Series([0.1, 1.2, 1.2, 2.1, 2.1, 3, 2,]) >>> a.value_counts() 2.1 2 1.2 2 0.1 1 2.0 1 3.0 1 >>> a.value_counts(normalize=True) 2.1 0.285714 1.2 0.285714 0.1 0.142857 2.0 0.142857 3.0 0.142857
高级的,使用 pandas.cut() 进行区间统计:
>>> ag = pd.Series([1, 1, 3, 5, 8, 10, 12, 15, 18, 18, 19, 20, 25, 30, 40, 51, 52]) >>> bins = (0, 10, 13, 18, 21, np.inf) >>> labels = ('child', 'preteen', 'teen', 'military_age', 'adult') >>> grp = pd.cut(ag, bins=bins, labels=labels) >>> grp 0 child 1 child 2 child 3 child 4 child 5 child 6 preteen 7 teen 8 teen 9 teen 10 military_age 11 military_age 12 adult 13 adult 14 adult 15 adult 16 adult dtype: category Categories (5, object): [child < preteen < teen < military_age < adult] >>> grp.value_counts() child 6 adult 5 teen 3 military_age 2 preteen 1
[编辑] 3.2 直方图 (histogram)
>>> import pandas as pd >>> a = pd.Series([1,2,2,3,3,4,5,6]) >>> a.value_counts() 3 2 2 2 6 1 5 1 4 1 1 1 # 各数出现频次统计直方图 >>> a.plot.hist(bins=6,rwidth=0.9) # 各数出现概率 (频次/总数)直方图 >>> a.value_counts(normalize=True) 3 0.250 2 0.250 6 0.125 5 0.125 4 0.125 1 0.125 >>> a.plot.hist(bins=6, rwidth=0.9, density=True) # normalize,与 pandas.value_counts(normalize=True) 类似 >>> plt.show()
>>> c = pd.Series(np.random.gamma(10,size=1000)**1.5) >>> c.plot.hist(grid=True,bins=20,rwidth=0.9) # plt.hist(c,bins=20,rwidth=0.9) >>> plt.grid(axis='y',alpha=0.75) >>> plt.show()
more info please refere to: matplotlib.pyplot.hist
[编辑] 3.3 KDE
核密度估计 (Kernel Density Estimate, KDE), 用来估计未知密度函数,属于非参数检验方法之一
>>> np.random.normal(loc=(10,20),scale=(4,2),size=(5,2)) array([[15.87305077, 20.3740753 ], [14.40449246, 20.73788215], [12.51111038, 20.81289712], [ 9.55461887, 21.48781844], [-0.72336527, 18.81365079]]) >>> dist = pd.DataFrame(np.random.normal(loc=(10,20), scale=(4,2), size=(1000, 2)), columns=['a', 'b']) >>> dist.agg(['min', 'max', 'mean', 'std']).round(decimals=2) >>> fig, ax = plt.subplots() >>> dist.plot.kde(ax=ax, legend=False, title='Histogram: A vs. B') >>> dist.plot.hist(density=True, ax=ax) >>> ax.set_ylabel('Probability') >>> ax.grid(axis='y') >>> ax.set_facecolor('#d8dcd6')
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns p = pd.read_csv('./data/da03-press.csv',index_col='time') pp = p['Press'] pp.plot.hist(bins=150, rwidth=.9, density=True, color='C2', alpha=0.8) pp.plot.kde(bw_method=0.1737, color='C1') plt.ylabel('Probability'); plt.xlim(xmin=3200,xmax=4200); plt.xlabel('hPa') plt.grid(linewidth=0.8) plt.show() #sns.distplot(pp, color="#ff8000") #plt.show()
bw_method 一般取 n^(-1/5)
更多参考:https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html#Notes
>>> s1 = np.random.normal(-1.0, 1, 320) >>> s2 = np.random.normal(2.0, 0.6, 32) >>> s = np.hstack([s1, s2]) >>> pdf = stats.kde.gaussian_kde(s) >>> x = np.linspace(-5, 5, 200) >>> plt.plot(x, pdf(x), 'r') >>> plt.hist(s, normed=1, alpha=0.45, color='purple') >>> plt.show()
stats.norm.rvs(), ppf(), pdf(), cdf(): https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html
[编辑] 3.4 柱状图 (bar)
每天统计事件 A 发生的次数,其实已经做了单个窗口是 24 小时、bins 持续自然增长的频数运算。这类数据直接用柱状图 (bar) 显示一下即可:
import pandas as pd import matplotlib.pyplot as plt import matplotlib.dates as mdate hb = pd.read_csv("../DA/data/ncp-hb-new.csv", index_col='Date', parse_dates=True, skipinitialspace=True) cn = pd.read_csv("../DA/data/ncp-cn-new.csv", index_col='Date', parse_dates=True, skipinitialspace=True) xhb = cn-hb plt.gca().xaxis.set_major_formatter(mdate.DateFormatter('%m-%d')) #plt.bar(hb.index, hb['Confirmed'].values) plt.bar(xhb.index, xhb['Confirmed'].values) plt.show()
同时显示湖北和非湖北柱状图:
plt.bar(xhb.index, xhb_cf, align='edge', width=0.3, label='Outside Hubei') plt.bar(hb.index, hb['Confirmed'].values, align='edge', width=-0.4, label='Hubei') plt.legend() plt.gcf().autofmt_xdate() plt.show()
[编辑] 3.5 Reverse operation of value_counts()
>>> col = pd.Series([1.0, 1.0, 2.0, 3.0, 3.0, 3.0]) >>> cc =col.value_counts() >>> cc 3.0 3 1.0 2 2.0 1 >>> np.repeat(cc.index, cc) Float64Index([3.0, 3.0, 3.0, 1.0, 1.0, 2.0], dtype='float64') >>> pd.Series(np.repeat(cc.index, cc)) 0 3.0 1 3.0 2 3.0 3 1.0 4 1.0 5 2.0
For multiple columns you can use:
>>> df.loc[df.index.repeat(df['Count'])]
[编辑] 4 假设检验
要解决的问题:在一个样本中观察到的效应是否也会出现在更大规模的总体中?
方法:
- Fisher 原假设检验
- Neyman-Pearson 决策理论
- 贝叶斯推理
这三种方法还有一个子集:经典假设检验 (Classical Hypothesis Testing)
经典假设检验 (CHT) 要回答的问题是:在一个样本中观察到的效应,其是偶然出现的概率是多少?步骤:
- 选一个检验统计量 (Test Statistic),量化观测到的效应
- 定义原假设 (Null Hypothesis):观测到的效应为假。即观测的效应是偶然产生的
- 计算 p 值 (p-value),p 值为原假设为真的概率。即一个效应偶然出现的概率
- 解释结果。如果 p 值很低(一般小于 5%),说明原假设为真的概率很低,效应偶然出现的概率很低,即:效应是显著的,称为统计显著 (Statistically Significant)
本质就是反证法。。。p-value 实际求得是检验统计量 (Test Statistic) 在其分布两端 (Two-tailed) 的概率
[编辑] 4.1 正态检验
[编辑] 4.1.1 QQ 图
>>> np.random.seed(12345678) >>> x = np.random.normal(5,3,100) >>> stats.probplot(x, plot=plt); plt.show()
[编辑] 4.1.2 Shapiro-Wilk
Shapiro-Wilk W 检验,基于观测值的排序统计量的协方差矩阵的检验,可以被用于小于等于 50 的样本量下
返回值 [W, p-value]
>>> np.random.seed(12345678) >>> x = np.random.normal(5, 3, 100) >>> np.random.seed() >>> y = np.random.normal(5, 3, 100) >>> stats.shapiro(x) (0.9772805571556091, 0.08144091814756393) >>> stats.shapiro(y) (0.9933551549911499, 0.9085326790809631)
p-value: for the hypothesis test
[编辑] 4.1.3 Kolmogorov-Smirnov
科尔莫戈罗夫检验(Kolmogorov-Smirnov test),检验样本数据是否服从某一分布,仅适用于连续分布的检验。下例中用它检验正态分布。
>>> stats.kstest(x,'norm') KstestResult(statistic=0.8801115630229508, pvalue=1.7157931366221766e-92) >>> stats.kstest(y,'norm') KstestResult(statistic=0.8168376836753909, pvalue=1.7239988712511043e-73)
p-value: One-tailed or two-tailed p-value
[编辑] 4.1.4 Pearson omnibus
D'Agostino-Pearson omnibus 检验
>>> stats.normaltest(x) NormaltestResult(statistic=6.528044509508757, pvalue=0.03823430021917039) >>> stats.normaltest(y) NormaltestResult(statistic=0.7706971982031684, pvalue=0.6802134730639648)
p-value: A 2-sided chi squared probability for the hypothesis test
[编辑] 5 时序数据分析
[编辑] 5.1 datetime
>>> d = {'date': 20200318, 'positive': 7731, 'death': 112} >>> d['date'] 20200318 >>> pd.to_datetime(d['date'], format='%Y%m%d') Timestamp('2020-03-18 00:00:00')
[编辑] 5.2 datetime range
>>> x = pd.date_range('2020-1-9','2020-2-15',freq='1d') >>> x.astype(str).tolist() # 转字符串 list >>> print(x) DatetimeIndex(['2020-01-09', '2020-01-10', '2020-01-11', '2020-01-12', '2020-01-13', '2020-01-14', '2020-01-15', '2020-01-16', '2020-01-17', '2020-01-18', '2020-01-19', '2020-01-20', '2020-01-21', '2020-01-22', '2020-01-23', '2020-01-24', '2020-01-25', '2020-01-26', '2020-01-27', '2020-01-28', '2020-01-29', '2020-01-30', '2020-01-31', '2020-02-01', '2020-02-02', '2020-02-03', '2020-02-04', '2020-02-05', '2020-02-06', '2020-02-07', '2020-02-08', '2020-02-09', '2020-02-10', '2020-02-11', '2020-02-12', '2020-02-13', '2020-02-14', '2020-02-15'], dtype='datetime64[ns]', freq='D') >>> ii = np.arange('2020-01-15',5,1,dtype='M8[D]') array(['2020-01-15', '2020-01-16', '2020-01-17', '2020-01-18', '2020-01-19'], dtype='datetime64[D]') >>> iii = np.datetime_as_string(ii, unit='D') # 转字符串 list array(['2020-01-15', '2020-01-16', '2020-01-17', '2020-01-18', '2020-01-19'], dtype='<U28') >>> from datetime import datetime >>> [datetime.strptime(d, '%Y-%m-%d').date() for d in iii] [datetime.date(2020, 1, 15), datetime.date(2020, 1, 16), datetime.date(2020, 1, 17) , datetime.date(2020, 1, 18), datetime.date(2020, 1, 19)]
[编辑] 6 Pandas
插入行:
>>> idx = pd.to_datetime(d['date'],format='%Y%m%d') >>> us.loc[idx] = [d['positive'], , d['death'], 0]
删除行:
>>> us.index[-1] Timestamp('2020-03-18 00:00:00') >>> us.index[[-1, -2]] Index([2020-03-18 00:00:00, 1970-01-01 00:00:00.020200318], dtype='object', name='Date') >>> us.drop(us.index[-2], inplace=True) # 删除最后一行 >>> us.drop(us.index[[-1,-2]], inplace=True) # 删除最后两行
增加一列:
>>> u.columns Index(['Confirmed', 'New Confirmed', 'Deaths', 'New Deaths'], dtype='object') >>> u['New Col'] = 2 >>> u.columns Index(['Confirmed', 'New Confirmed', 'Deaths', 'New Deaths', 'New Col'], dtype='object') >>> u.tail() Confirmed New Confirmed Deaths New Deaths New Col Date 2020-03-15 00:00:00 3173 723 60 11 2 2020-03-16 00:00:00 4019 846 71 11 2
删除列:
>>> u.drop(['New Col'], axis=1, inplace=True) >>> u.columns Index(['Confirmed', 'New Confirmed', 'Deaths', 'New Deaths'], dtype='object') >>> del u['Deaths'] >>> u.columns Index(['Confirmed', 'New Confirmed', 'New Deaths'], dtype='object')
>>> nd = u.pop('New Deaths') >>> nd.tail() Date 2020-03-15 11 2020-03-16 11 Name: New Deaths, dtype: int64 >>> u.tail() Confirmed New Confirmed Date 2020-03-15 00:00:00 3173 723 2020-03-16 00:00:00 4019 846
[编辑] 7 Reference
- Numpy API reference
- Pandas API reference
- matplotlib Gallery
- Change the Colors Changes to the default style
- matplotlib.pyplot.plot()
- matplotlib.pyplot.figure()
- Time Series Analysis Example
- Introduction to Data Science
- Data Visualization tutorial
- FlowingData Tutorials
- Flourish 也有折线图版本:Line chart race
- 宏观数据库:https://www.ceicdata.com/zh-hans
- 国家统计局数据:http://data.stats.gov.cn/ https://mp.weixin.qq.com/s/6t5Wz1PTbG_ZKD88QAFH5g
- 新中国六十年统计资料汇编
- 各省市国民经济与社会发展统计公报
- 中国统计年鉴,各省市统计年鉴
- 各地区财政预算执行情况与财政预算公告
- U.S. Census Bureau, Current Population Survey
- Race and Ethnicity of the U.S. Population
- Principles of Epidemiology in Public Health Practice Third EditionPDF Third Edition
- Students distribution original paper