DA
来自Jack's Lab
(版本间的差异)
(→描述性统计) |
(→描述性统计) |
||
第16行: | 第16行: | ||
== 描述性统计 == | == 描述性统计 == | ||
+ | |||
+ | === 位置估计 === | ||
直观的: | 直观的: | ||
第66行: | 第68行: | ||
<br> | <br> | ||
+ | |||
+ | === 分散性估计 === | ||
+ | |||
+ | <source lang=python> | ||
+ | >>> d = np.array([3, 1, 5, 3, 15, 6, 7, 2]) | ||
+ | >>> meanl = [np.mean(d)]*len(d); trimmeanl = [stats.trim_mean(d, 0.2)]*len(d); medianl = [np.median(d)]*len(d) | ||
+ | >>> plt.plot(d,'o',color='C1'); plt.plot(meanl, '-C2', label='Mean'); plt.plot(trimmeanl, ':r', label='Trim mean'); plt.plot(medianl, ':g', label='Meidan') | ||
+ | [<matplotlib.lines.Line2D object at 0x000000A9F50E60A0>] | ||
+ | [<matplotlib.lines.Line2D object at 0x000000A9F50E66A0>] | ||
+ | [<matplotlib.lines.Line2D object at 0x000000A9F50E6820>] | ||
+ | [<matplotlib.lines.Line2D object at 0x000000A9F50E66D0>] | ||
+ | >>> plt.legend(); plt.grid(); plt.show() | ||
+ | </source> | ||
== 探索数据分布 == | == 探索数据分布 == |
2020年2月17日 (一) 23:39的版本
目录 |
1 Overview
2 描述性统计
2.1 位置估计
直观的:
import numpy as np import matplotlib.pyplot as plt from scipy import stats d = np.array([1, 2, 2, 100, 3, 3, 6, 8]) np.mean(d); stats.trim_mean(d, 0.2); np.median(d) 15.625 4.0 3.0 >>> plt.plot(d, 'o'); plt.show()
实际的:
import pandas as pd from scipy import stats p = pd.read_csv('../DA/data/da01-press.csv', index_col='time', date_parser=lambda x: pd.to_datetime(float(x)+28800000000000)) p = p.drop(columns=['name']) p.mean() Press 3685.248525 stats.trim_mean(p, 0.1) # stats.trimboth(p['Press'],0.1).mean() array([3680.07826531]) p.median() Press 3677.105 p.describe() Press count 122.000000 mean 3685.248525 std 123.990939 min 3484.480000 25% 3618.402500 50% 3677.105000 75% 3747.742500 max 4672.060000
2.2 分散性估计
>>> d = np.array([3, 1, 5, 3, 15, 6, 7, 2]) >>> meanl = [np.mean(d)]*len(d); trimmeanl = [stats.trim_mean(d, 0.2)]*len(d); medianl = [np.median(d)]*len(d) >>> plt.plot(d,'o',color='C1'); plt.plot(meanl, '-C2', label='Mean'); plt.plot(trimmeanl, ':r', label='Trim mean'); plt.plot(medianl, ':g', label='Meidan') [<matplotlib.lines.Line2D object at 0x000000A9F50E60A0>] [<matplotlib.lines.Line2D object at 0x000000A9F50E66A0>] [<matplotlib.lines.Line2D object at 0x000000A9F50E6820>] [<matplotlib.lines.Line2D object at 0x000000A9F50E66D0>] >>> plt.legend(); plt.grid(); plt.show()
3 探索数据分布
3.1 bar
import pandas as pd import matplotlib.pyplot as plt import matplotlib.dates as mdate hb = pd.read_csv("../DA/data/ncp-hb-new.csv", index_col='Date', parse_dates=True, skipinitialspace=True) cn = pd.read_csv("../DA/data/ncp-cn-new.csv", index_col='Date', parse_dates=True, skipinitialspace=True) xhb = cn-hb plt.gca().xaxis.set_major_formatter(mdate.DateFormatter('%m-%d')) #plt.bar(hb.index, hb['Confirmed'].values) plt.bar(xhb.index, xhb['Confirmed'].values) plt.show()
同时显示湖北和非湖北柱状图:
plt.bar(xhb.index, xhb_cf, align='edge', width=0.3, label='Outside Hubei') plt.bar(hb.index, hb['Confirmed'].values, align='edge', width=-0.4, label='Hubei') plt.legend() plt.gcf().autofmt_xdate() plt.show()
4 时序数据分析
>>> x = pd.date_range('2020-1-9','2020-2-15',freq='1d') >>> print(x) DatetimeIndex(['2020-01-09', '2020-01-10', '2020-01-11', '2020-01-12', '2020-01-13', '2020-01-14', '2020-01-15', '2020-01-16', '2020-01-17', '2020-01-18', '2020-01-19', '2020-01-20', '2020-01-21', '2020-01-22', '2020-01-23', '2020-01-24', '2020-01-25', '2020-01-26', '2020-01-27', '2020-01-28', '2020-01-29', '2020-01-30', '2020-01-31', '2020-02-01', '2020-02-02', '2020-02-03', '2020-02-04', '2020-02-05', '2020-02-06', '2020-02-07', '2020-02-08', '2020-02-09', '2020-02-10', '2020-02-11', '2020-02-12', '2020-02-13', '2020-02-14', '2020-02-15'], dtype='datetime64[ns]', freq='D')
5 Reference
- Numpy API reference
- Pandas API reference
- matplotlib Gallery
- Change the Colors Changes to the default style
- matplotlib.pyplot.plot()
- matplotlib.pyplot.figure()
- Time Series Analysis Example
- Introduction to Data Science
- Data Visualization tutorial
- FlowingData Tutorials