One of the most important challenge when working with tabular dataset before feeding into the machine learning model is preprocessing data because the quality of data and the useful information that can be derived from it directly affects the ability of model to learn. In the real world application, the raw dataset is kind of messy which needs some skills to clean it.

This blog will cover some tricks that I found quite useful when dealing with messy data.

Pandas tricks

Most of the time when doing tabular dataset we will usually import data into dataframe of pandas. Knowing how to deal with pandas will save us a lot of time of computation with better code readability and versatility.

Pandas Profile for data exploration

Pandas profiling is an open source Python module with which we can quickly do an exploratory data analysis with just a few lines of code.

from pandas_profiling import ProfileReport

profile = ProfileReport(dataframe)

Configure Options & Settings

We usually have to work with a dataframe containing many columns which will collape when reading it. If we want to read all columns of dataframe, we can configure its options at interpreter startup.

pd.options.display.max_columns = 500

#or 

pd.set_option('display.max_columns', 500)

Split a column into multiple columns

Some dataframe columns contain concatenation of data which should be splitted into multiple columns

df = pd.DataFrame({'Name_Age': ['Smith_32', 'Nadal', 
                           'Federer_36']})

df

df['Name_Age'].str.split('_', expand=True)

If we want to take only first column of our splits:

df['Name_Age'].str.split('_').str[1]

0     32
1    NaN
2     36
Name: Name_Age, dtype: object

Extract dummy variables from string columns

df = pd.Series(['Sector 1;Sector 2', 'Sector 1;Sector 3', np.nan, 'Sector 2;Sector 4'], dtype=str)

df

0    Sector 1;Sector 2
1    Sector 1;Sector 3
2                  NaN
3    Sector 2;Sector 4
dtype: object

df.str.get_dummies(';')

Regular expression with text

df = pd.DataFrame({'Location' : [
    'Washington, D.C. 20003',
    'Brooklyn, NY 11211-1755',
    'Omaha, NE 68154',
    'Pittsburgh, PA 15211'
    ]})

df

If we want to separate out the three city/state/ZIP components neatly into DataFrame fields, we should pass regex extraction into .str.extract:

regex = (r'(?P<city>[A-Za-z ]+), '      # One or more letters
         r'(?P<state>[A-Z]{2}) '        # 2 capital letters
         r'(?P<zip>\d{5}(?:-\d{4})?)')  # Optional 4-digit extension

df['Location'].str.replace('.', '').str.extract(regex)

Note: .str is called accessor for string (object) data. It maps to the class StringMethods which contains a lot of methods like cat, split, rsplit, replace, extract...

Enhance performance with Cython

We usually use the lambda function to apply for each row in dataframe. Sometimes, it is not efficient in term of computation time. Actually we can boost the performance by using Cython extention.

df = pd.DataFrame({'a': np.random.randn(1000),
                   'b': np.random.randn(1000),
                   'N': np.random.randint(100, 1000, (1000)),
                   'x': 'x'})

def f(x):
    return x * (x - 1)
def integrate_f(a, b, N):
    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f(a + i * dx)
    return s * dx

%timeit df.apply(lambda x: integrate_f(x['a'], x['b'], x['N']), axis=1)

167 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Now we will pass the lambda function into Cython extension.

%load_ext Cython

The Cython extension is already loaded. To reload it, use:
  %reload_ext Cython

%%cython
def f(x):
    return x * (x - 1)
def integrate_f(a, b, N):
    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f(a + i * dx)
    return s * dx

%timeit df.apply(lambda x: integrate_f(x['a'], x['b'], x['N']), axis=1)

120 ms ± 777 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

By simply putting cython extension before the lambda function, we reduce a significant amount of computation time

Groupby

Groupby is one of the powerful methods which allows us to split the data into groups based on some criteria, compute statistic aggregation on each group or do transformation like standalizing the data within a group and many more.

data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}

df = pd.DataFrame(data)

df

df.groupby('Year')['Points'].agg(['mean', 'sum', 'min', 'max', 'std', 'var', 'count'])

We can also aggregate using the numpy function like np.size, np.mean, np.max...

df.groupby('Team').agg(np.size)

df.groupby('Team').transform(lambda x: (x - x.mean()) / x.std()*10)

df.groupby('Team').filter(lambda x: len(x) >= 3)

Sklearn pipeline

One of the most beautiful things I love about sklearn is its creation of pipeline. I found it very neat to use, easily for understanding and particularly very helpful for production.

Note: Definition of pipeline class according to scikit-learn:

Sequentially apply a list of transforms and a final estimator. Intermediate steps of pipeline must implement fit and transform methods and the final estimator only needs to implement fit.

pipeline = Pipeline([('preprocessing', ColumnTransformer(
                                        transformers = [
                                            ('text', Pipeline([
                                                ('tfidf', TfidfVectorizer(stop_words=['nan'])),
                                            ]), TEXT_COLUMNS),
                                            ('cat', 
                                             FeatureUnion([
                                                    ('ordinal', OrdinalEncoder()),
                                                    ('target_encoder', TargetEncoder())
                                            ]), CAT_COLUMNS),
                                            ('num', TruncatedSVD(n_components=100) , 
                                            NUM_COLUMNS)
                                                    ],
                                            remainder='drop')
                     ),
                     ('model', LGBMClassifier()
                     )
                    ])

pipeline.fit(Xtrain, y_train)

Looking at the pipeline allows to understand right away what we want to do with our data. The example above can be interpreted as the schema below:

We can also pass the whole pipeline into other pipeline like RandomizedSearchCV or GridSearchCV

params_grid = {
    'model__colsample_bytree': [0.3, 0.5, 0.7, 0.9],
    'model__n_estimators' : [2000, 5000, 8000],
    'model__learning_rate': [0.01, 0.02, 0.05, 0.1, 0.2],
    'model__max_depth' : [3, 5, 7],
    'preprocessing__num__n_components' : [100, 50, 70],
    'preprocessing__text__select__estimator__C' : [1e-2, 1e-1, 1],
    'preprocessing__text__select__max_features' : [10, 20, 50, None],
    'preprocessing__text__tfidf__binary' : [False, True],
    'preprocessing__text__tfidf__ngram_range' : [(1, 1), (1, 2)],
    'preprocessing__text__tfidf__max_df': [0.2, 0.4, 0.6],
    'preprocessing__text__tfidf__min_df': [20, 50]
}

search = RandomizedSearchCV(estimator=pipeline,
                            param_distributions=params_grid,
                            n_iter=100,
                            n_jobs=1,
                            cv = 5,
                            verbose=5,
                            scoring='roc_auc')

search.fit(Xtrain, y_train)

	Rank	Year	Points
0	-15.000000	-11.618950	12.843272
1	5.000000	-3.872983	3.020286
2	-7.071068	-7.071068	7.071068
3	7.071068	7.071068	-7.071068
4	11.547005	-10.910895	-8.608621
5	NaN	NaN	NaN
6	-5.773503	2.182179	-2.360428
7	-5.773503	8.728716	10.969049
8	5.000000	3.872983	-7.705963
9	7.071068	-7.071068	-7.071068
10	-7.071068	7.071068	7.071068
11	5.000000	11.618950	-8.157595

Tabular feature engineering tricks