Tabular feature engineering tricks
Some useful tricks when doing feature engineering on tabular dataset
One of the most important challenge when working with tabular dataset before feeding into the machine learning model is preprocessing data because the quality of data and the useful information that can be derived from it directly affects the ability of model to learn. In the real world application, the raw dataset is kind of messy which needs some skills to clean it.
This blog will cover some tricks that I found quite useful when dealing with messy data.
Most of the time when doing tabular dataset we will usually import data into dataframe of pandas. Knowing how to deal with pandas will save us a lot of time of computation with better code readability and versatility.
Pandas profiling is an open source Python module with which we can quickly do an exploratory data analysis with just a few lines of code.
from pandas_profiling import ProfileReport
profile = ProfileReport(dataframe)
We usually have to work with a dataframe containing many columns which will collape when reading it. If we want to read all columns of dataframe, we can configure its options at interpreter startup.
pd.options.display.max_columns = 500
#or
pd.set_option('display.max_columns', 500)
df = pd.DataFrame({'Name_Age': ['Smith_32', 'Nadal',
'Federer_36']})
df
df['Name_Age'].str.split('_', expand=True)
If we want to take only first column of our splits:
df['Name_Age'].str.split('_').str[1]
df = pd.Series(['Sector 1;Sector 2', 'Sector 1;Sector 3', np.nan, 'Sector 2;Sector 4'], dtype=str)
df
df.str.get_dummies(';')
df = pd.DataFrame({'Location' : [
'Washington, D.C. 20003',
'Brooklyn, NY 11211-1755',
'Omaha, NE 68154',
'Pittsburgh, PA 15211'
]})
df
If we want to separate out the three city/state/ZIP components neatly into DataFrame fields, we should pass regex extraction into .str.extract
:
regex = (r'(?P<city>[A-Za-z ]+), ' # One or more letters
r'(?P<state>[A-Z]{2}) ' # 2 capital letters
r'(?P<zip>\d{5}(?:-\d{4})?)') # Optional 4-digit extension
df['Location'].str.replace('.', '').str.extract(regex)
.str
is called accessor for string (object) data. It maps to the class StringMethods which contains a lot of methods like cat
, split
, rsplit
, replace
, extract
...
We usually use the lambda
function to apply for each row in dataframe. Sometimes, it is not efficient in term of computation time. Actually we can boost the performance by using Cython
extention.
df = pd.DataFrame({'a': np.random.randn(1000),
'b': np.random.randn(1000),
'N': np.random.randint(100, 1000, (1000)),
'x': 'x'})
def f(x):
return x * (x - 1)
def integrate_f(a, b, N):
s = 0
dx = (b - a) / N
for i in range(N):
s += f(a + i * dx)
return s * dx
%timeit df.apply(lambda x: integrate_f(x['a'], x['b'], x['N']), axis=1)
Now we will pass the lambda
function into Cython
extension.
%load_ext Cython
%%cython
def f(x):
return x * (x - 1)
def integrate_f(a, b, N):
s = 0
dx = (b - a) / N
for i in range(N):
s += f(a + i * dx)
return s * dx
%timeit df.apply(lambda x: integrate_f(x['a'], x['b'], x['N']), axis=1)
By simply putting cython
extension before the lambda
function, we reduce a significant amount of computation time
Groupby is one of the powerful methods which allows us to split the data into groups based on some criteria, compute statistic aggregation on each group or do transformation like standalizing the data within a group and many more.
data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(data)
df
df.groupby('Year')['Points'].agg(['mean', 'sum', 'min', 'max', 'std', 'var', 'count'])
We can also aggregate using the numpy function like np.size
, np.mean
, np.max
...
df.groupby('Team').agg(np.size)
df.groupby('Team').transform(lambda x: (x - x.mean()) / x.std()*10)
df.groupby('Team').filter(lambda x: len(x) >= 3)
One of the most beautiful things I love about sklearn is its creation of pipeline. I found it very neat to use, easily for understanding and particularly very helpful for production.
pipeline = Pipeline([('preprocessing', ColumnTransformer(
transformers = [
('text', Pipeline([
('tfidf', TfidfVectorizer(stop_words=['nan'])),
]), TEXT_COLUMNS),
('cat',
FeatureUnion([
('ordinal', OrdinalEncoder()),
('target_encoder', TargetEncoder())
]), CAT_COLUMNS),
('num', TruncatedSVD(n_components=100) ,
NUM_COLUMNS)
],
remainder='drop')
),
('model', LGBMClassifier()
)
])
pipeline.fit(Xtrain, y_train)
Looking at the pipeline allows to understand right away what we want to do with our data. The example above can be interpreted as the schema below:
We can also pass the whole pipeline into other pipeline like RandomizedSearchCV
or GridSearchCV
params_grid = {
'model__colsample_bytree': [0.3, 0.5, 0.7, 0.9],
'model__n_estimators' : [2000, 5000, 8000],
'model__learning_rate': [0.01, 0.02, 0.05, 0.1, 0.2],
'model__max_depth' : [3, 5, 7],
'preprocessing__num__n_components' : [100, 50, 70],
'preprocessing__text__select__estimator__C' : [1e-2, 1e-1, 1],
'preprocessing__text__select__max_features' : [10, 20, 50, None],
'preprocessing__text__tfidf__binary' : [False, True],
'preprocessing__text__tfidf__ngram_range' : [(1, 1), (1, 2)],
'preprocessing__text__tfidf__max_df': [0.2, 0.4, 0.6],
'preprocessing__text__tfidf__min_df': [20, 50]
}
search = RandomizedSearchCV(estimator=pipeline,
param_distributions=params_grid,
n_iter=100,
n_jobs=1,
cv = 5,
verbose=5,
scoring='roc_auc')
search.fit(Xtrain, y_train)