python pandas key words for etl

ETL
Extract、Transform、Loading

Extract data from different sources

* csv files
* json files
* APIs
pd.read_csv('../data/population_data.csv',skiprows=4)
df.isnull().sum()
df.drop(['col'],axis=1)
pd.read_json('population_data.json',orient='records')

Transform data

* combining data from different sources
pd.concat([df1,df2])

* data cleaning
pd.drop_duplicates()
df.apply()

* data types
df['totalamt'].sum()
pd.to_numeric()

* parsing dates
pd.to_datetime()

* file encodings
pip install chardet
chardet.detect()

* missing data
df.dropna()
df.apply()

* duplicate data
df['col'].nunique()
df[df['col'].str.contains('xx')]

* dummy variables
pd.get_dummies()

* remove outliers
df['col'].quantile()
boxplot
linearRegression

* scaling features
normalization

* engineering features
col1 */-+** col2

Load

* send the transformed data to a database
* ETL Pipeline
* code an ETL pipeline
df.to_json()
df.to_csv()
df.to_sql()
python pandas key words for etl