机器学习PIPELINE

2019年4月23日2019年7月8日布鲁斯.LPIPELINE，机器学习

pipeline这个词，应该来自linux。在linux体系下的各种命令工具的处理，支持pipeline，即管道机制，例如：

cat xxx | awk '{xxxx}' | sort | uniq

这是一种良好的接口规范，工具的功能有公共的接口规范，就像流水线一样，一步接着一步。机器学习的处理过程，也可以是pipeline。实际上scikit-learn开发了整套的pipeline机制，并封装到 sklearn.pipline命名空间下面。首先，我们看看这个库都有什么：

pipeline.FeatureUnion(transformer_list[, …])    Concatenates results of multiple transformer objects.
pipeline.Pipeline(steps[, memory])  Pipeline of transforms with a final estimator.
pipeline.make_pipeline(*steps, **kwargs)    Construct a Pipeline from the given estimators.
pipeline.make_union(*transformers, **kwargs)    Construct a FeatureUnion from the given trans

可以看出，最关键的是 FeatureUnion、Pipeline，我们继续看看这2个对象都可以实现什么功能。

Pipeline

sklearn中把机器学习处理过程抽象为estimator，其中estimator都有fit方法，表示“喂”数据进行初始化or训练。
estimator有2种：
1、特征变换（transformer）
可以理解为特征工程，即：特征标准化、特征正则化、特征离散化、特征平滑、onehot编码等
该类型统一由一个transform方法，用于fit数据之后，输入新的数据，进行特征变换。

2、预测器（predictor）
即各种模型，所有模型fit进行训练之后，都要经过测试集进行predict所有，有一个predict的公共方法

上面的抽象的好处即可实现机器学习的pipeline，显然特征变换是可能并行的（FeatureUnion）可以实现，变换在训练集、测试集之间都需要统一，所以pipeline可以达到模块化的目的。举个NLP处理的例子：

# 生成训练数据、测试数据
X_train, X_test, y_train, y_test = train_test_split(X, y)

# pipeline定义
pipeline = Pipeline([
        ('vect', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('clf', RandomForestClassifier())
])

# train classifier
pipeline.fit(X_train, y_train)

# predict on test data
y_pred = pipeline.predict(X_test)

显然，看起来pipeline训练过程只需要fit和predict，其实在pipeline内部传输过程，自动调用了fit\transform

FeatureUnion

上面看到特征变换往往需要并行化处理，即FeatureUnion所实现的功能。直接看例子：

pipeline = Pipeline([
('features', FeatureUnion([
    ('text_pipeline', Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer())
    ])),
    ('findName', FineNameExtractor())
])),

('clf', RandomForestClassifier())
])

看起来，pipeline还可以嵌套pipeline，整个机器学习处理流程就像流水工人一样。上面自定义了一个pipeline处理对象FineNameExtractor，该对象是transformer，实际上自定义个transformer是很简单的，创建一个对象，继承自BaseEstimator, TransformerMixin即可，直接上代码：

from sklearn.base import BaseEstimator, TransformerMixin
class FineNameExtractor(BaseEstimator, TransformerMixin):

    def find_name(self, text):
        return True

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_tagged = pd.Series(X).apply(self.find_name)
        return pd.DataFrame(X_tagged)

执行一个PIPELINE，可能还少了点什么，再加上自动调参就完美了，是的，sklearn的调参通过GridSearchCV实现，pipeline+gridsearch简直是绝配。GridSearchCV实际上也有fit、predict方法，所以，你会发现，整个sklearn的机器学习是高效抽象的，代码可以写的很简洁。

python pandas key words for etl

2019年4月18日2019年7月8日布鲁斯.Lairbnb， pandas

ETL
Extract、Transform、Loading

Extract data from different sources

* csv files
* json files
* APIs
pd.read_csv('../data/population_data.csv',skiprows=4)
df.isnull().sum()
df.drop(['col'],axis=1)
pd.read_json('population_data.json',orient='records')

Transform data

* combining data from different sources
pd.concat([df1,df2])

* data cleaning
pd.drop_duplicates()
df.apply()

* data types
df['totalamt'].sum()
pd.to_numeric()

* parsing dates
pd.to_datetime()

* file encodings
pip install chardet
chardet.detect()

* missing data
df.dropna()
df.apply()

* duplicate data
df['col'].nunique()
df[df['col'].str.contains('xx')]

* dummy variables
pd.get_dummies()

* remove outliers
df['col'].quantile()
boxplot
linearRegression

* scaling features
normalization

* engineering features
col1 */-+** col2

Load

* send the transformed data to a database
* ETL Pipeline
* code an ETL pipeline
df.to_json()
df.to_csv()
df.to_sql()

Exploration on Airbnb Boston data

2019年4月12日2019年6月17日布鲁斯.L

项目详见：

https://github.com/ahomer/airbnb_bst

Business and Data Understanding

As talk on Airbnb kaggle data website, the following Airbnb activity is included in this Boston dataset:

Calendar, including listing id and the price and availability for that day
Listings, including full descriptions and average review score
Reviews, including unique id for each reviewer and detailed comments

Let us take a look on these three csv files.

Calendar

It shows that the hosts are not avaible everyday and price may be changed at the busiest seasons.

What is the most expensive season in Boston?
Which hosts are the most favorite？

Listings

Summary information on listing in Boston.It contains location, host information, cleaning and guest fees, amenities and so on.
We may find some import factors on price.

What are the top factors strong relation to price?
How to predict price？

Reviews

We can find many interesting opinions,sush as

What are the most attractive facilities? It is big bed, large room or location?
What will lead to bad impression？

Data preparing

Clean Calendar

png

So we can see the most expensive season is from August to November，especial September and October.
You can get a lowest price if you go to Boston at February.
The most expensive listing_id is 447826.Go to Boston and experience one night.

	301
id	447826
listing_url	https://www.airbnb.com/rooms/447826
scrape_id	20160906204935
host_url	https://www.airbnb.com/users/show/2053557
name	Sweet Little House in JP, Boston
bedrooms	1
accommodates	2
bathrooms	1
amenities	{TV,"Cable TV",Internet,"Wireless Internet",Ki...

Sweet Little House in JP, Boston

Clean Listings

Let us calculate the mean/std of 'Price'.

Assuming that prices obey normal distribution
The price should be between mean-2std~mean+2std

png

Clean Reviews

Review the reviews.csv file,you will find there are different languages.We just need to keep the english comment.
We need a lib 'langdetect'.

Modeling and evaluation

Let's try to predict the price based on the columns in the listing we selected.

What are the top factors strong relation to price?

png

Top 6 factors strong relation to price:

bedrooms
room type : Private room
number of reviews
accommodates
bathrooms
review scores rating

Deployment

Mostly,the model will be deplyed on product environment based on a RPC server or http server.
You can deploy the model with Tornado(python web framework).

Cross-Industry standard process for data mining.

2019年4月8日2019年7月8日布鲁斯.L数据挖掘，机器学习

数据挖掘方法论，答曰：CRISP-DM

大概分为以下六个步骤：
1、业务知识/ Business understanding
2、指标含义/ Data understanding
3、数据准备/ Data preparation
4、建模训练/ Modeling
5、模型评估/ Evaluation
6、模型部署/ Deployment

对于开发人员，往往直接从3开始，而忽略1、2，其中2是苦力活。从模型评估看，一般模型优化方法会重新返回2进行调整。重点提下“数据准备”大概包含：
0、特征列挑选
1、列空值处理、行空值处理(直接删除、采用统计量替换、线性预测)
2、分类变量处理，one-hot编码
3、归一化处理
4、连续变量分段（比如年龄、收入）
...

crisp

从实用性角度出发，可能并非一定要5、6，模型结果可以直接通过用户分群，通过运营系统或者报告阐述的方式输出结果。

win10 linux for subsystme ubuntu 修改默认用户和密码方法

2019年4月8日2019年4月8日七喜小姐

All other answers were helpful, but can be other scenarios too, follow here as per yours. Mine was ubuntu 1604, so use following:-

ubuntu1604 config --default-user

if you installed ubuntu 1804:-

ubuntu1804 config --default-user

if you used default one, then:-

ubuntu config --default-user

之所以登录：google 了一下午找到答案。安装的是ubuntu1604版本，网上使用的“ubuntu config --default-user ”运行一直报错如下：

“
ubuntu : 无法将“ubuntu”项识别为 cmdlet、函数、脚本文件或可运行程序的名称。请检查名称的拼写，如果包括路径，请确保路径
正确，然后再试一次。
所在位置行:1 字符: 1
+ ubuntu config --default-user
+ ~~~~~~
+ CategoryInfo : ObjectNotFound: (ubuntu:String) [], CommandNotFoundException
+ FullyQualifiedErrorId : CommandNotFoundException”

终于找到正确答案。

原因：
ubuntu安装时会提示设置用户名和密码，这样root的密码就是随机的。需要将root用户设置为默认账户，并设置密码后才可以正常使用su和sudo。

如何查看ubuntu 子系统的账户和密码？

账户和密码保存在%userprofile%\AppData\Local\Packages\CanonicalGroupLimited.Ubuntu16.04onWindows_79rhkp1fndgsc\LocalState\rootfs\etc 文件下 shadow 和shadow- 文件中。

随遇札记

学而时习之旅行随笔编程

月：2019年4月