python pandas key words for etl

2019年4月18日2019年7月8日布鲁斯.Lairbnb， pandas

ETL
Extract、Transform、Loading

Extract data from different sources

* csv files
* json files
* APIs
pd.read_csv('../data/population_data.csv',skiprows=4)
df.isnull().sum()
df.drop(['col'],axis=1)
pd.read_json('population_data.json',orient='records')

Transform data

* combining data from different sources
pd.concat([df1,df2])

* data cleaning
pd.drop_duplicates()
df.apply()

* data types
df['totalamt'].sum()
pd.to_numeric()

* parsing dates
pd.to_datetime()

* file encodings
pip install chardet
chardet.detect()

* missing data
df.dropna()
df.apply()

* duplicate data
df['col'].nunique()
df[df['col'].str.contains('xx')]

* dummy variables
pd.get_dummies()

* remove outliers
df['col'].quantile()
boxplot
linearRegression

* scaling features
normalization

* engineering features
col1 */-+** col2

Load

* send the transformed data to a database
* ETL Pipeline
* code an ETL pipeline
df.to_json()
df.to_csv()
df.to_sql()

Exploration on Airbnb Boston data

2019年4月12日2019年6月17日布鲁斯.L

项目详见：

https://github.com/ahomer/airbnb_bst

Business and Data Understanding

As talk on Airbnb kaggle data website, the following Airbnb activity is included in this Boston dataset:

Calendar, including listing id and the price and availability for that day
Listings, including full descriptions and average review score
Reviews, including unique id for each reviewer and detailed comments

Let us take a look on these three csv files.

Calendar

It shows that the hosts are not avaible everyday and price may be changed at the busiest seasons.

What is the most expensive season in Boston?
Which hosts are the most favorite？

Listings

Summary information on listing in Boston.It contains location, host information, cleaning and guest fees, amenities and so on.
We may find some import factors on price.

What are the top factors strong relation to price?
How to predict price？

Reviews

We can find many interesting opinions,sush as

What are the most attractive facilities? It is big bed, large room or location?
What will lead to bad impression？

Data preparing

Clean Calendar

png

So we can see the most expensive season is from August to November，especial September and October.
You can get a lowest price if you go to Boston at February.
The most expensive listing_id is 447826.Go to Boston and experience one night.

	301
id	447826
listing_url	https://www.airbnb.com/rooms/447826
scrape_id	20160906204935
host_url	https://www.airbnb.com/users/show/2053557
name	Sweet Little House in JP, Boston
bedrooms	1
accommodates	2
bathrooms	1
amenities	{TV,"Cable TV",Internet,"Wireless Internet",Ki...

Sweet Little House in JP, Boston

Clean Listings

Let us calculate the mean/std of 'Price'.

Assuming that prices obey normal distribution
The price should be between mean-2std~mean+2std

png

Clean Reviews

Review the reviews.csv file,you will find there are different languages.We just need to keep the english comment.
We need a lib 'langdetect'.

Modeling and evaluation

Let's try to predict the price based on the columns in the listing we selected.

What are the top factors strong relation to price?

png

Top 6 factors strong relation to price:

bedrooms
room type : Private room
number of reviews
accommodates
bathrooms
review scores rating

Deployment

Mostly,the model will be deplyed on product environment based on a RPC server or http server.
You can deploy the model with Tornado(python web framework).

Cross-Industry standard process for data mining.

2019年4月8日2019年7月8日布鲁斯.L数据挖掘，机器学习

数据挖掘方法论，答曰：CRISP-DM

大概分为以下六个步骤：
1、业务知识/ Business understanding
2、指标含义/ Data understanding
3、数据准备/ Data preparation
4、建模训练/ Modeling
5、模型评估/ Evaluation
6、模型部署/ Deployment

对于开发人员，往往直接从3开始，而忽略1、2，其中2是苦力活。从模型评估看，一般模型优化方法会重新返回2进行调整。重点提下“数据准备”大概包含：
0、特征列挑选
1、列空值处理、行空值处理(直接删除、采用统计量替换、线性预测)
2、分类变量处理，one-hot编码
3、归一化处理
4、连续变量分段（比如年龄、收入）
...

crisp

从实用性角度出发，可能并非一定要5、6，模型结果可以直接通过用户分群，通过运营系统或者报告阐述的方式输出结果。

win10 linux for subsystme ubuntu 修改默认用户和密码方法

2019年4月8日2019年4月8日七喜小姐

All other answers were helpful, but can be other scenarios too, follow here as per yours. Mine was ubuntu 1604, so use following:-

ubuntu1604 config --default-user

if you installed ubuntu 1804:-

ubuntu1804 config --default-user

if you used default one, then:-

ubuntu config --default-user

之所以登录：google 了一下午找到答案。安装的是ubuntu1604版本，网上使用的“ubuntu config --default-user ”运行一直报错如下：

“
ubuntu : 无法将“ubuntu”项识别为 cmdlet、函数、脚本文件或可运行程序的名称。请检查名称的拼写，如果包括路径，请确保路径
正确，然后再试一次。
所在位置行:1 字符: 1
+ ubuntu config --default-user
+ ~~~~~~
+ CategoryInfo : ObjectNotFound: (ubuntu:String) [], CommandNotFoundException
+ FullyQualifiedErrorId : CommandNotFoundException”

终于找到正确答案。

原因：
ubuntu安装时会提示设置用户名和密码，这样root的密码就是随机的。需要将root用户设置为默认账户，并设置密码后才可以正常使用su和sudo。

如何查看ubuntu 子系统的账户和密码？

账户和密码保存在%userprofile%\AppData\Local\Packages\CanonicalGroupLimited.Ubuntu16.04onWindows_79rhkp1fndgsc\LocalState\rootfs\etc 文件下 shadow 和shadow- 文件中。

用户画像的设计和构建流程

2019年3月5日2019年7月8日布鲁斯.L数据，数据分析，用户画像

针对APP埋点之后，你将不断收到各种各样的用户行为数据，为了更好的做用户运营，你需要把原始数据提取成“用户标签”，而用户标签化，即为用户画像的刻画。传统意义的用户画像，可以按生成方式，简单分为三类：

基础属性：根据用户属性上报，直接提取，例如：年龄、性别、地域等人口属性
统计属性：根据用户行为数据，聚类统计，基于一定的概率按时间因子衰减
价值属性：根据多维度特征融合，算法预测，生成各种高潜用户属性，比如付费高潜、流失高潜，一般为概率模型

上述描述的用户画像又怎么生成呢？从数据处理角度看，可以分为三个过程：

基础数据处理，ods层数据建设
画像中间数据ETL，行为偏好数据提取
画像信息宽表建设，用户画像结果数据

把上述步骤继续细化，我们可以得到：
1、用户标签体系设计，技术人员跟业务资深人员，根据业务特点设计用户标签分类，即根据精细化运营目的设计标签
2、埋点数据整理，设计好标签之后，数据开发跟前端开发一起设计埋点数据规范，只有数据埋点完备，用户行为才能完备
3、多源数据拉通，一般不同数据源需要有统一的用户id，比如统一为微信小程序的openid
4、数据融合，多种行为数据提取
5、基于规则的标签提取、生成
6、基于聚类分析结果的标签提取
7、基于算法的特征挖掘，画像建模提取
8、多标签合并，多个标签结果合并成统一的宽表
9、标签质量分析、监控，及时发现标签缺失，有效监控标签质量

最后，所有标签只是把用户逐步细化的过程，你还需要一个可以触达用户的系统，你还需要一批可以打动用户的营销策略。只有把用户细分，用合适的策略通过高效的触达渠道，才能达到精准运营的效果。

随遇札记

学而时习之旅行随笔编程

技术