Pythonic “Data Science” Specialization

jasperyang 发布于2019-07-24 17:58 / 1200人阅读

摘要：温习统计学的知识为更深层次的学习做准备在的演讲中说就是我们理解但不知道另外的是如何的我在台下想对于那可以理解的我好像都只懂了参考标准高效的流程课程用的是我不想再学一门类似的语言了我会找出相对应的和的来源流程什么是干净的一个变

Why The "Data Science" Specialization

温习统计学的知识, 为更深层次的学习做准备
Andrew Ng 在 2015 GTC 的演讲中说, deep learning 就是 black magic; 我们理解50%, 但不知道另外的50%是如何work的. 我在台下想, 对于那可以理解的50%, 我好像都只懂了5%.

参考"标准高效"的流程
mine: emacs org mode + emacs magit + bitbucket + python. There must be some room for improvement.

How

课程用的是R. 我不想再学一门类似的语言了, 我会找出相对应的numpy 和 scipy solution.

Getting and Cleaning Data

Raw data 的来源

Website APIs

Databases

Json

Raw texts

Data analysis 流程

Raw data --> Processing scripts --> tidy data (often ignored in the classes but really important)

Record the meta data

Record the recipes

--> data analysis (covered in machine learning classes)

--> data communication

什么是干净的data

Each variable you measure should be in one column, 一个变量占一列.

There should be one table for each "kind" of variable, generally data should be save in one file per table 为什么呢? 管理起来不会麻烦麽?

If you have multiple tables, they should include a column in the table thta allows them to be linked. 参见 dataframe.merge dataframe.join in pandas

The code book

代码簿? (⊙o⊙)…

Info about the variables (including units!)
单位很重要! 没有单位的测量是没有物理意义的!
但测量时候必须要考虑的有效位数在课程中却没有提及. 大抵是因为python 和 R 对于有效位数handle地很好? 不需要像C 里边一样考虑 float 或者 double? 某些极端情况下也会需要像sympy这样的library吧.

Info about the summary choice you made

Info about the experimental study design you used

代码簿的作用类似于wet lab中的实验记录本. 很庆幸很早就知道了emacs 的 org mode, 用在这里很适合. 但是 Info about the variables 的重要性被我忽略了.

如果feature的数量很多, 而且feature本身意义深刻, 就需要仔细挑选. 记得一次听报告, 有家金融公司用decision tree 做portfolio, 算法本身稀松平常, 但是对于具体用了哪些feature, lecturer守口如瓶.

"There are many stages to the design and analysis of a successful study. The last of these steps is the calculation of an inferential statistic such as a P value, and the application of a "decision rule" to it (for example, P < 0.05). In practice, decisions that are made earlier in data analysis have a much greater impact on results — from experimental design to batch effects, lack of adjustment for confounding factors, or simple measurement error. Arbitrary levels of statistical significance can be achieved by changing the ways in which data are cleaned, summarized or modelled."

Leek, Jeffrey T., and Roger D. Peng. "Statistics: P values are just the tip of the iceberg." Nature 520.7549 (2015): 612-612.

Downloading Files

我通常都是直接用wget, 但是那样就不容易整合到脚本中. 几个很可能会在download时候用到的python function:

# set up the env
os.path.dirname(os.path.realpath(__file__))
os.getcwd()
os.path.join()
os.chdir()
os.path.exists()
os.makedirs()

# dowload
urllib.request.urlretrieve()
urllib.request.urlopen()

# to tag your downloaded files
datetime.timezone()
datetime.datetime.now()

# an example
import shutil
import ssl
import urllib.request as ur

def download(myurl):
    """
    download to the current directory
    """
    fn = myurl.split("/")[-1]
    context = ssl._create_unverified_context()
    with ur.urlopen(myurl, context=context) as response, open(fn, "wb") as out_file:
        shutil.copyfileobj(response, out_file)

    return fn

Loading flat files

pandas.read_csv()

Reading XML

Here is a very good introduction

Below are my summaries:

python 标准库中自带了xml.etree.ElementTree用来解析xml. 其中, ElementTree 表示整个XML文件, Element表示一个node.

The first element in every XML document is called the root element. 一个XML文件只能又一个root, 因此以下的不符合xml规范:

recursively 遍历

# an excersice 
# find all elements with zipcode equals 21231
xml_fn = download("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml")
tree = ET.parse(xml_fn)
for child in tree.iter():
    if child.tag == "zipcode" and child.text == "21231":
        print(child)

JSON

JSON stands for Javascript Object Notation

lightweight data storage

JSON 的格式肉眼看起来就像是nested python dict. python 自带的json的用法类似pickle.

Pattern Matching

Python makes a distinction between matching and searching. Matching looks only at the start of the target string, whereas searching looks for the pattern anywhere in the target.

Always use raw strings for regx.

Character sets
sth like r"[A-Za-z_]" would match an underscore or any uppercase or lowercase ASCII letter.

Characters that have special meanings in other regular expression contexts do not have special meanings within square brackets. The only character with a special meaning inside square brackets is a ^, and then only if it is the first character after the left (open- ing) bracket.

Summarizing Data

import pandas as pd
df = pd.DataFrame
# Look at a bit of the data
df.head()
df.tail()

# summary
df.describe()
df.quantile()

# cov and corr
# DataFrame’s corr and cov methods return a full correlation or covariance matrix as a DataFrame, respectively

# to calcuate pairwise correlation between a DataFrame"s columns or rows
dset.corrwith(dset[""])

# you can write your own analsis function and apply it to the dataframe, for example:
f = lambda x: x.max() - x.min()
df.apply(f, axis=1)

Check for missing values

df.dropna()
df.fillna(0)
# to modify inplace
_ = df.fillna(0, inplace=True)

# fill the nan with the mean
# 或者用naive bayesian的prediction
data.fillna(data.mean())

Exploratory Data Analysis Analytic graphics

Principles of Analytic Graphics

Show comparisons
If you build a model that can do some predictions, please come along with the performance of random guess.

Show causality, mechanism, explanation, systematic structure

Show multivariate data
The world is inherently multivariate

Integration of evidence

Describe and document the evidence with appropriate labels, scales, sources, etc.

Simple Summaries of Data

Two dimensions

scatterplots

smooth scatterplots

> 2 dimensions

Overlayed/multiple 2-D plots; coplots

Use color, size, shape to add dimensions

Spinning plots

Actual 3-D plots (not very useful)

Graphics File Devices

pdf: usefule for line-type graphics, resizes well, not efficient if a plot has many objects/points

svg: XML-based scalable vector graphics; supports animation and interactivity, potentially useful for web-based plots

png: bitmapped format, good for line drawings or images with solid colors, uses lossless compression, most web browers can read this format natively, does not resize well

jpeg: good for photographs or natural scenes, uses lossy compression, does not resize well

tiff: bitmapped format, supports lossless compression

Simulation in R

rnorm:generate random Normal variates with a given mean and standard deviation

dnorm: evaluate the Normal probability density (with a given mean/SD) at a point (or vector of points)

pnorm: evaluate the cumulative distribution function for a Normal distribution

d for density

r for random number generation

p for cumulative distribution

q for quantile function

Setting the random number seed with set.seed ensures reproducibility

> set.seed(1)
> rnorm(5)

文章版权归作者所有，未经允许请勿转载,若此文章存在违规行为，您可以联系管理员删除。

转载请注明本文地址：https://www.ucloud.cn/yun/37525.html

想入门人工智能? 这些优质的 AI 资源绝对不要错过

摘要：该课程旨在面向有抱负的工程师，从人工智能的基本概念入门到掌握为人工智能解决方案构建深度学习模型所需技能。 showImg(https://segmentfault.com/img/bVbkP5z?w=800&h=664); 作者 | Jo Stichbury翻译 | Mika本文为 CDA 数据分析师原创作品，转载需授权前言如今人工智能备受追捧，由于传统软件团队缺乏AI技能，常常会...

Barrior 2019-06-26 18:41 评论0 收藏0
蠎周刊 2015 年度最赞

摘要：蠎周刊年度最赞亲俺们又来回顾又一个伟大的年份儿包去年最受欢迎的文章和项目如果你错过了几期就这一期不会丢失最好的嗯哼还为你和你的准备了一批纪念裇从这儿获取任何时候如果想分享好物给大家在这儿提交喜欢我们收集的任何意见建议通过来吧原文 Title: 蠎周刊 2015 年度最赞Date: 2016-01-09 Tags: Weekly,Pycoder,Zh Slug: issue-198-to...

young.li 2019-07-24 18:32 评论0 收藏0
从入门到求职，成为数据科学家的终极指南

摘要：我强烈推荐这本书给初学者，因为本书侧重于统计建模和机器学习的基本概念，并提供详细而直观的解释。关于完善简历，我推荐以下网站和文章怎样的作品集能帮助我们找到第一数据科学或机器学习方面的工作简历是不够的，你还需要作品集的支撑。 showImg(https://segmentfault.com/img/bVblJ0R?w=800&h=533); 作者 | Admond Lee翻译 | Mik...

yanwei 2019-06-26 18:41 评论0 收藏0
每个男孩的机械梦「GitHub 热点速览 v.21.41」

摘要：以下内容摘录自微博的及热帖简称热帖，选项标准新发布实用有趣，根据项目时间分类，发布时间不超过的项目会标注，无该标志则说明项目超过半月。特性可监控记录的正常运行时间。服务器打包为一组微服务，用户可使用命令轻松使用。作者：HelloGitHub-小鱼干机械臂可能在医疗剧中看过，可以用来...

laznrbfe 2021-10-14 09:43 评论0 收藏0