DA

来自Jack's Lab
(版本间的差异)
跳转到: 导航, 搜索
(描述性统计)
(描述性统计)
第16行: 第16行:
  
 
== 描述性统计 ==
 
== 描述性统计 ==
 +
 +
=== 位置估计 ===
  
 
直观的:
 
直观的:
第66行: 第68行:
  
 
<br>
 
<br>
 +
 +
=== 分散性估计 ===
 +
 +
<source lang=python>
 +
>>> d = np.array([3, 1, 5, 3, 15, 6, 7, 2])
 +
>>> meanl = [np.mean(d)]*len(d); trimmeanl = [stats.trim_mean(d, 0.2)]*len(d); medianl = [np.median(d)]*len(d)
 +
>>> plt.plot(d,'o',color='C1'); plt.plot(meanl, '-C2', label='Mean'); plt.plot(trimmeanl, ':r', label='Trim mean'); plt.plot(medianl, ':g', label='Meidan')
 +
[<matplotlib.lines.Line2D object at 0x000000A9F50E60A0>]
 +
[<matplotlib.lines.Line2D object at 0x000000A9F50E66A0>]
 +
[<matplotlib.lines.Line2D object at 0x000000A9F50E6820>]
 +
[<matplotlib.lines.Line2D object at 0x000000A9F50E66D0>]
 +
>>> plt.legend(); plt.grid(); plt.show()
 +
</source>
  
 
== 探索数据分布 ==
 
== 探索数据分布 ==

2020年2月17日 (一) 23:39的版本

目录

1 Overview


2 描述性统计

2.1 位置估计

直观的:

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
d = np.array([1, 2, 2, 100, 3, 3, 6, 8])
np.mean(d); stats.trim_mean(d, 0.2); np.median(d)
15.625
4.0
3.0

>>> plt.plot(d, 'o'); plt.show()


实际的:

import pandas as pd
from scipy import stats

p = pd.read_csv('../DA/data/da01-press.csv', index_col='time', date_parser=lambda x: pd.to_datetime(float(x)+28800000000000))
p = p.drop(columns=['name'])
p.mean()
Press    3685.248525

stats.trim_mean(p, 0.1)   # stats.trimboth(p['Press'],0.1).mean()
array([3680.07826531])

p.median()
Press    3677.105

p.describe()
             Press
count   122.000000
mean   3685.248525
std     123.990939
min    3484.480000
25%    3618.402500
50%    3677.105000
75%    3747.742500
max    4672.060000


2.2 分散性估计

>>> d = np.array([3, 1, 5, 3, 15, 6, 7, 2])
>>> meanl = [np.mean(d)]*len(d); trimmeanl = [stats.trim_mean(d, 0.2)]*len(d); medianl = [np.median(d)]*len(d)
>>> plt.plot(d,'o',color='C1'); plt.plot(meanl, '-C2', label='Mean'); plt.plot(trimmeanl, ':r', label='Trim mean'); plt.plot(medianl, ':g', label='Meidan')
[<matplotlib.lines.Line2D object at 0x000000A9F50E60A0>]
[<matplotlib.lines.Line2D object at 0x000000A9F50E66A0>]
[<matplotlib.lines.Line2D object at 0x000000A9F50E6820>]
[<matplotlib.lines.Line2D object at 0x000000A9F50E66D0>]
>>> plt.legend(); plt.grid(); plt.show()

3 探索数据分布

3.1 bar

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdate

hb = pd.read_csv("../DA/data/ncp-hb-new.csv", index_col='Date', parse_dates=True, skipinitialspace=True)
cn = pd.read_csv("../DA/data/ncp-cn-new.csv", index_col='Date', parse_dates=True, skipinitialspace=True)
xhb = cn-hb
plt.gca().xaxis.set_major_formatter(mdate.DateFormatter('%m-%d'))
#plt.bar(hb.index, hb['Confirmed'].values)
plt.bar(xhb.index, xhb['Confirmed'].values)
plt.show()

同时显示湖北和非湖北柱状图:

plt.bar(xhb.index, xhb_cf, align='edge', width=0.3, label='Outside Hubei')
plt.bar(hb.index, hb['Confirmed'].values, align='edge', width=-0.4, label='Hubei')
plt.legend()
plt.gcf().autofmt_xdate()
plt.show()


4 时序数据分析

>>> x = pd.date_range('2020-1-9','2020-2-15',freq='1d')
>>> print(x)
DatetimeIndex(['2020-01-09', '2020-01-10', '2020-01-11', '2020-01-12',
               '2020-01-13', '2020-01-14', '2020-01-15', '2020-01-16',
               '2020-01-17', '2020-01-18', '2020-01-19', '2020-01-20',
               '2020-01-21', '2020-01-22', '2020-01-23', '2020-01-24',
               '2020-01-25', '2020-01-26', '2020-01-27', '2020-01-28',
               '2020-01-29', '2020-01-30', '2020-01-31', '2020-02-01',
               '2020-02-02', '2020-02-03', '2020-02-04', '2020-02-05',
               '2020-02-06', '2020-02-07', '2020-02-08', '2020-02-09',
               '2020-02-10', '2020-02-11', '2020-02-12', '2020-02-13',
               '2020-02-14', '2020-02-15'],
              dtype='datetime64[ns]', freq='D')


5 Reference





个人工具
名字空间

变换
操作
导航
工具箱