查看DA的源代码
←
DA
跳转到:
导航
,
搜索
因为以下原因,你没有权限编辑本页:
您刚才请求的操作只有这个用户组中的用户才能使用:
用户
您可以查看并复制此页面的源代码:
== Overview == * [https://blog.csdn.net/qq_32412759/article/details/77774286 用python进行统计分析] * [https://www.jianshu.com/p/933f45ba36fb python数据统计分析] * [https://www.jianshu.com/p/ffa8c60ccbc3 描述性统计] * [https://www.jianshu.com/p/98061b97e485 抽样方法与抽样分布] * [https://www.jianshu.com/p/44041c4ba9e4 参数估计] * [https://www.jianshu.com/p/3e094e12c906 方差分析] * [https://www.jianshu.com/p/f899312ee01d 聚类分析] * [https://www.jianshu.com/p/59e685d96970 主成分分析] * [https://www.jianshu.com/p/bdebc1700ceb 线性回归] * [https://www.coursera.org/learn/exploratory-data-analysis EDA coursera] <br> == 描述性统计 == === 位置估计 === 直观的: <source lang=python> import numpy as np import matplotlib.pyplot as plt from scipy import stats d = np.array([1, 2, 2, 100, 3, 3, 6, 8]) np.mean(d); stats.trim_mean(d, 0.2); np.median(d) 15.625 4.0 3.0 >>> plt.plot(d, 'o'); plt.show() </source> 实际的: <source lang=python> import pandas as pd from scipy import stats p = pd.read_csv('../DA/data/da01-press.csv', index_col='time', date_parser=lambda x: pd.to_datetime(float(x)+28800000000000)) p = p.drop(columns=['name']) p.mean() Press 3685.248525 stats.trim_mean(p, 0.1) # stats.trimboth(p['Press'],0.1).mean() array([3680.07826531]) p.median() Press 3677.105 p.describe() Press count 122.000000 mean 3685.248525 std 123.990939 min 3484.480000 25% 3618.402500 50% 3677.105000 75% 3747.742500 max 4672.060000 </source> * [https://numpy.org/doc/1.18/reference/routines.statistics.html NumPy Statistics] * [http://docs.scipy.org/doc/scipy/reference/stats.html SciPy Statistical functions] * [https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html?highlight=statistics#descriptive-statistics Pandas Statistics] <br> === 变异性估计 === <source lang=python> >>> d = np.array([3, 1, 5, 3, 15, 6, 7, 2]) >>> meanl = np.array([np.mean(d)]*len(d)); trimmeanl = np.array([stats.trim_mean(d, 0.2)]*len(d)); medianl = np.array([np.median(d)]*len(d)) >>> iqrv = np.array([stats.iqr(d)]*len(d)) >>> down = medianl -iqrv; up = medianl+iqrv >>> plt.plot(d,'o',color='C1'); plt.plot(meanl, ':C2', label='Mean'); plt.plot(trimmeanl, ':r', label='Trim mean'); plt.plot(medianl, '-g', label='Meidan') >>> plt.plot(up, '-C1'); plt.plot(down, '-C1') >>> plt.legend(); plt.grid(); plt.show() </source> <br> === 相关性估计 === <source lang=python> >>> t1 = pd.read_csv('../DA/data/da02-temp-0948.csv', index_col='time', date_parser=lambda x: pd.to_datetime(float(x)+28800000000000)) >>> t2 = pd.read_csv('../DA/data/da02-temp-0019.csv', index_col='time', date_parser=lambda x: pd.to_datetime(float(x)+28800000000000)) >>> plt.plot(t1.index, t1['Temp'], label='t1') >>> plt.plot(t2.index, t2['Temp'], label='t2') >>> plt.plot(t1['Temp'].index,t3, label='t3') >>> plt.legend(); plt.show() </source> <br> == 探索数据分布 == === 频数统计 === <source lang=python> >>> import pandas as pd >>> a = pd.Series([0.1, 1.2, 1.2, 2.1, 2.1, 3, 2,]) >>> a.value_counts() 2.1 2 1.2 2 0.1 1 2.0 1 3.0 1 >>> a.value_counts(normalize=True) 2.1 0.285714 1.2 0.285714 0.1 0.142857 2.0 0.142857 3.0 0.142857 </source> 高级的,使用 pandas.cut() 进行区间统计: <source lang=python> >>> ag = pd.Series([1, 1, 3, 5, 8, 10, 12, 15, 18, 18, 19, 20, 25, 30, 40, 51, 52]) >>> bins = (0, 10, 13, 18, 21, np.inf) >>> labels = ('child', 'preteen', 'teen', 'military_age', 'adult') >>> grp = pd.cut(ag, bins=bins, labels=labels) >>> grp 0 child 1 child 2 child 3 child 4 child 5 child 6 preteen 7 teen 8 teen 9 teen 10 military_age 11 military_age 12 adult 13 adult 14 adult 15 adult 16 adult dtype: category Categories (5, object): [child < preteen < teen < military_age < adult] >>> grp.value_counts() child 6 adult 5 teen 3 military_age 2 preteen 1 </source> <br> === 直方图 (histogram) === <source lang=python> >>> import pandas as pd >>> a = pd.Series([1,2,2,3,3,4,5,6]) >>> a.value_counts() 3 2 2 2 6 1 5 1 4 1 1 1 # 各数出现频次统计直方图 >>> a.plot.hist(bins=6,rwidth=0.9) # 各数出现概率 (频次/总数)直方图 >>> a.value_counts(normalize=True) 3 0.250 2 0.250 6 0.125 5 0.125 4 0.125 1 0.125 >>> a.plot.hist(bins=6, rwidth=0.9, density=True) # normalize,与 pandas.value_counts(normalize=True) 类似 >>> plt.show() </source> <source lang=python> >>> c = pd.Series(np.random.gamma(10,size=1000)**1.5) >>> c.plot.hist(grid=True,bins=20,rwidth=0.9) # plt.hist(c,bins=20,rwidth=0.9) >>> plt.grid(axis='y',alpha=0.75) >>> plt.show() </source> more info please refere to: [https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html matplotlib.pyplot.hist] <br> === KDE === 核密度估计 (Kernel Density Estimate, KDE), 用来估计未知密度函数,属于非参数检验方法之一 <source lang=python> >>> np.random.normal(loc=(10,20),scale=(4,2),size=(5,2)) array([[15.87305077, 20.3740753 ], [14.40449246, 20.73788215], [12.51111038, 20.81289712], [ 9.55461887, 21.48781844], [-0.72336527, 18.81365079]]) >>> dist = pd.DataFrame(np.random.normal(loc=(10,20), scale=(4,2), size=(1000, 2)), columns=['a', 'b']) >>> dist.agg(['min', 'max', 'mean', 'std']).round(decimals=2) >>> fig, ax = plt.subplots() >>> dist.plot.kde(ax=ax, legend=False, title='Histogram: A vs. B') >>> dist.plot.hist(density=True, ax=ax) >>> ax.set_ylabel('Probability') >>> ax.grid(axis='y') >>> ax.set_facecolor('#d8dcd6') </source> <source lang=python> import pandas as pd import matplotlib.pyplot as plt import seaborn as sns p = pd.read_csv('./data/da03-press.csv',index_col='time') pp = p['Press'] pp.plot.hist(bins=150, rwidth=.9, density=True, color='C2', alpha=0.8) pp.plot.kde(bw_method=0.1737, color='C1') plt.ylabel('Probability'); plt.xlim(xmin=3200,xmax=4200); plt.xlabel('hPa') plt.grid(linewidth=0.8) plt.show() #sns.distplot(pp, color="#ff8000") #plt.show() </source> '''bw_method''' 一般取 n^(-1/5) 更多参考:https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html#Notes <source lang=python> >>> s1 = np.random.normal(-1.0, 1, 320) >>> s2 = np.random.normal(2.0, 0.6, 32) >>> s = np.hstack([s1, s2]) >>> pdf = stats.kde.gaussian_kde(s) >>> x = np.linspace(-5, 5, 200) >>> plt.plot(x, pdf(x), 'r') >>> plt.hist(s, normed=1, alpha=0.45, color='purple') >>> plt.show() </source> stats.norm.rvs(), ppf(), pdf(): https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html <br> === 柱状图 (bar) === 每天统计事件 A 发生的次数,其实已经做了单个窗口是 24 小时、bins 持续自然增长的频数运算。这类数据直接用柱状图 (bar) 显示一下即可: <source lang=python> import pandas as pd import matplotlib.pyplot as plt import matplotlib.dates as mdate hb = pd.read_csv("../DA/data/ncp-hb-new.csv", index_col='Date', parse_dates=True, skipinitialspace=True) cn = pd.read_csv("../DA/data/ncp-cn-new.csv", index_col='Date', parse_dates=True, skipinitialspace=True) xhb = cn-hb plt.gca().xaxis.set_major_formatter(mdate.DateFormatter('%m-%d')) #plt.bar(hb.index, hb['Confirmed'].values) plt.bar(xhb.index, xhb['Confirmed'].values) plt.show() </source> 同时显示湖北和非湖北柱状图: <source lang=python> plt.bar(xhb.index, xhb_cf, align='edge', width=0.3, label='Outside Hubei') plt.bar(hb.index, hb['Confirmed'].values, align='edge', width=-0.4, label='Hubei') plt.legend() plt.gcf().autofmt_xdate() plt.show() </source> * [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.kde.html Pandas KDE] * [https://matplotlib.org/tutorials/introductory/lifecycle.html#sphx-glr-tutorials-introductory-lifecycle-py X 轴 label 格式] * [https://matplotlib.org/gallery/statistics/histogram_cumulative.html?highlight=cdf Using histograms to plot a cumulative distribution] <br> === Reverse operation of value_counts() === <source lang=python> >>> col = pd.Series([1.0, 1.0, 2.0, 3.0, 3.0, 3.0]) >>> cc =col.value_counts() >>> cc 3.0 3 1.0 2 2.0 1 >>> np.repeat(cc.index, cc) Float64Index([3.0, 3.0, 3.0, 1.0, 1.0, 2.0], dtype='float64') >>> pd.Series(np.repeat(cc.index, cc)) 0 3.0 1 3.0 2 3.0 3 1.0 4 1.0 5 2.0 </source> For multiple columns you can use: <source lang=python> >>> df.loc[df.index.repeat(df['Count'])] </source> <br> == 时序数据分析 == * [https://www.jianshu.com/p/b91e3ae940ec pandas 日期处理] * [https://docs.scipy.org/doc/numpy/reference/generated/numpy.datetime_as_string.html Numpy datetime as str] <source lang=python> >>> x = pd.date_range('2020-1-9','2020-2-15',freq='1d') >>> x.astype(str).tolist() # 转字符串 list >>> print(x) DatetimeIndex(['2020-01-09', '2020-01-10', '2020-01-11', '2020-01-12', '2020-01-13', '2020-01-14', '2020-01-15', '2020-01-16', '2020-01-17', '2020-01-18', '2020-01-19', '2020-01-20', '2020-01-21', '2020-01-22', '2020-01-23', '2020-01-24', '2020-01-25', '2020-01-26', '2020-01-27', '2020-01-28', '2020-01-29', '2020-01-30', '2020-01-31', '2020-02-01', '2020-02-02', '2020-02-03', '2020-02-04', '2020-02-05', '2020-02-06', '2020-02-07', '2020-02-08', '2020-02-09', '2020-02-10', '2020-02-11', '2020-02-12', '2020-02-13', '2020-02-14', '2020-02-15'], dtype='datetime64[ns]', freq='D') >>> ii = np.arange('2020-01-15',5,1,dtype='M8[D]') array(['2020-01-15', '2020-01-16', '2020-01-17', '2020-01-18', '2020-01-19'], dtype='datetime64[D]') >>> np.datetime_as_string(ii, unit='D') # 转字符串 list >>> ii array(['2020-01-20', '2020-01-25', '2020-01-30', '2020-02-04', '2020-02-09', '2020-02-14', '2020-02-19'], dtype='<U28') >>> [datetime.strptime(d, '%Y-%m-%d').date() for d in ii] [datetime.date(2020, 1, 20), datetime.date(2020, 1, 25), datetime.date(2020, 1, 30), datetime.date(2020, 2, 4), datetime.date(2020, 2, 9), datetime.date(2020, 2, 14), datetime.date(2020, 2, 19)] </source> <br> == Reference == * [https://numpy.org/doc/1.18/reference/ Numpy API reference] * [https://pandas.pydata.org/pandas-docs/stable/reference/index.html Pandas API reference] * [https://matplotlib.org/gallery/index.html matplotlib Gallery] * [https://matplotlib.org/users/dflt_style_changes.html Change the Colors] Changes to the default style * [https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html matplotlib.pyplot.plot()] * [https://matplotlib.org/api/_as_gen/matplotlib.pyplot.figure.html matplotlib.pyplot.figure()] * [https://www.kaggle.com/kp4920/s-p-500-stock-data-time-series-analysis Time Series Analysis Example] * [https://www.kaggle.com/usengecoder/introduction-to-data-science Introduction to Data Science] * [https://www.kaggle.com/residentmario/welcome-to-data-visualization/ Data Visualization tutorial] * [https://flowingdata.com/category/tutorials/ FlowingData Tutorials] * [https://datascienceguide.github.io/outline Data Science Guide] <br><br>
返回到
DA
。
个人工具
登录
名字空间
页面
讨论
变换
查看
阅读
查看源代码
查看历史
操作
搜索
导航
首页
社区专页
新闻动态
最近更改
随机页面
帮助
工具箱
链入页面
相关更改
特殊页面