Polars Basic Syntax and Data Analysis Sample (with Scikit-learn)

2023. 3. 16. 22:54it


0. Introduction

Polars is a data analysis and processing tool implemented in Rust that offers superior processing performance for large-scale data and low memory usage. Due to these advantages, Polars is highly effective for large-scale data processing. It provides a similar API to Pandas but utilizes C++ and Rust for better performance. Furthermore, it is designed to ensure interoperability between Python and Rust.

1. Comparison between Polars and Pandas

Polars is optimized for processing large-scale data, resulting in faster speed and lower memory usage compared to Pandas. Polars supports a wider range of data types than Pandas, and it also supports parallel processing, which makes it possible to process large-scale data faster.


2. Using Scikit-learn with Polars

Scikit-learn cannot be directly used in Polars. However, since Polars is compatible with NumPy, it can be used with machine learning libraries such as Scikit-learn by using NumPy arrays. Moreover, Polars offers some statistical and machine learning-related functions, allowing for some simple analysis and modeling tasks. Therefore, Polars is an excellent tool for combining with machine learning libraries such as Scikit-learn for large-scale data processing.

3. Example of using Scikit-learn with Polars

Here's an example of using Polars and Scikit-learn together for the iris dataset.

import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

import polars as pl

# load iris dataset
iris = load_iris()

# convert iris dataset to pandas DataFrame
iris_df = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['target'])

# convert pandas DataFrame to polars DataFrame
iris_pl = pl.from_pandas(iris_df)

# split data into training and testing sets
train, test = train_test_split(iris_pl, test_size=0.2)

# convert polars DataFrame to numpy arrays
X_train, y_train = train[:, :-1].to_numpy(), train[:, -1].to_numpy()
X_test, y_test = test[:, :-1].to_numpy(), test[:, -1].to_numpy()

# create Gradient Boosting classifier
gb = GradientBoostingClassifier()

# train classifier
gb.fit(X_train, y_train)

# evaluate classifier
score = gb.score(X_test, y_test)

4. Basic Syntax of Polars
4.1 Read CSV

import polars as pl

pl_df = pl.read_csv('data.csv')


4.2 Filtering Data with Polars DataFrame (where)

# select columns
pl_df = pl_df.select(['name', 'age'])

# filter rows
pl_df = pl_df.filter(pl.col('age') > 30)

# add a new column
pl_df = pl_df.with_column(pl.col('age') * 2, 'age_doubled')

4.3 Polars DataFrame Aggregation (group by)

# calculate mean
mean_age = pl_df['age'].mean()

# calculate sum
sum_age = pl_df['age'].sum()

# calculate count
count_rows = pl_df.count()
# group by city
grouped_df = pl_df.groupby('city')

# calculate mean age for each group
mean_age_by_city = grouped_df.mean('age')

4.4 Polars DataFrame Joining

4.4.1 DataFrame Example

import polars as pl

left = pl.DataFrame({
    'id': [1, 2, 3],
    'left_value': ['a', 'b', 'c']

right = pl.DataFrame({
    'id': [2, 3, 4],
    'right_value': ['d', 'e', 'f']

4.4.2 DataFrame Joins Inner Join

# perform inner join
inner_join = left.join(right, on='id')

# print result
print(inner_join) Left Join

# perform left join
left_join = left.join(right, on='id', how='left')

# print result
print(left_join) Right Join

# perform right join
right_join = left.join(right, on='id', how='right')

# print result

The syntax of Polars is not significantly different from pandas. However, it can be a bit inconvenient to convert to numpy in the AI/ML field. Nevertheless, it has clear advantages for analyzing large datasets. It would be a good alternative to consider as a solution to the memory issues in pandas.
