ETL
Extract、Transform、Loading
Extract data from different sources
* csv files
* json files
* APIs
pd.read_csv('../data/population_data.csv',skiprows=4)
df.isnull().sum()
df.drop(['col'],axis=1)
pd.read_json('population_data.json',orient='records')
Transform data
* combining data from different sources
pd.concat([df1,df2])
* data cleaning
pd.drop_duplicates()
df.apply()
* data types
df['totalamt'].sum()
pd.to_numeric()
* parsing dates
pd.to_datetime()
* file encodings
pip install chardet
chardet.detect()
* missing data
df.dropna()
df.apply()
* duplicate data
df['col'].nunique()
df[df['col'].str.contains('xx')]
* dummy variables
pd.get_dummies()
* remove outliers
df['col'].quantile()
boxplot
linearRegression
* scaling features
normalization
* engineering features
col1 */-+** col2
Load
* send the transformed data to a database
* ETL Pipeline
* code an ETL pipeline
df.to_json()
df.to_csv()
df.to_sql()