Polars Basic Syntax and Data Analysis Sample (with Scikit-learn)

Polars Basic Syntax and Data Analysis Sample (with Scikit-learn)

2023. 3. 16. 22:54ㆍit

0. Introduction

Polars is a data analysis and processing tool implemented in Rust that offers superior processing performance for large-scale data and low memory usage. Due to these advantages, Polars is highly effective for large-scale data processing. It provides a similar API to Pandas but utilizes C++ and Rust for better performance. Furthermore, it is designed to ensure interoperability between Python and Rust.

1. Comparison between Polars and Pandas

Polars is optimized for processing large-scale data, resulting in faster speed and lower memory usage compared to Pandas. Polars supports a wider range of data types than Pandas, and it also supports parallel processing, which makes it possible to process large-scale data faster.

728x90

2. Using Scikit-learn with Polars

Scikit-learn cannot be directly used in Polars. However, since Polars is compatible with NumPy, it can be used with machine learning libraries such as Scikit-learn by using NumPy arrays. Moreover, Polars offers some statistical and machine learning-related functions, allowing for some simple analysis and modeling tasks. Therefore, Polars is an excellent tool for combining with machine learning libraries such as Scikit-learn for large-scale data processing.

3. Example of using Scikit-learn with Polars

Here's an example of using Polars and Scikit-learn together for the iris dataset.

import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

import polars as pl

# load iris dataset
iris = load_iris()

# convert iris dataset to pandas DataFrame
iris_df = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['target'])

# convert pandas DataFrame to polars DataFrame
iris_pl = pl.from_pandas(iris_df)

# split data into training and testing sets
train, test = train_test_split(iris_pl, test_size=0.2)

# convert polars DataFrame to numpy arrays
X_train, y_train = train[:, :-1].to_numpy(), train[:, -1].to_numpy()
X_test, y_test = test[:, :-1].to_numpy(), test[:, -1].to_numpy()

# create Gradient Boosting classifier
gb = GradientBoostingClassifier()

# train classifier
gb.fit(X_train, y_train)

# evaluate classifier
score = gb.score(X_test, y_test)
print(score)

4. Basic Syntax of Polars
4.1 Read CSV

import polars as pl

pl_df = pl.read_csv('data.csv')

4.2 Filtering Data with Polars DataFrame (where)

# select columns
pl_df = pl_df.select(['name', 'age'])

# filter rows
pl_df = pl_df.filter(pl.col('age') > 30)

# add a new column
pl_df = pl_df.with_column(pl.col('age') * 2, 'age_doubled')

4.3 Polars DataFrame Aggregation (group by)

# calculate mean
mean_age = pl_df['age'].mean()

# calculate sum
sum_age = pl_df['age'].sum()

# calculate count
count_rows = pl_df.count()

# group by city
grouped_df = pl_df.groupby('city')

# calculate mean age for each group
mean_age_by_city = grouped_df.mean('age')

4.4 Polars DataFrame Joining

4.4.1 DataFrame Example

import polars as pl

left = pl.DataFrame({
    'id': [1, 2, 3],
    'left_value': ['a', 'b', 'c']
})

right = pl.DataFrame({
    'id': [2, 3, 4],
    'right_value': ['d', 'e', 'f']
})

4.4.2 DataFrame Joins

4.4.2.1 Inner Join

# perform inner join
inner_join = left.join(right, on='id')

# print result
print(inner_join)

4.4.2.2 Left Join

# perform left join
left_join = left.join(right, on='id', how='left')

# print result
print(left_join)

4.4.2.3 Right Join

# perform right join
right_join = left.join(right, on='id', how='right')

# print result
print(right_join)

The syntax of Polars is not significantly different from pandas. However, it can be a bit inconvenient to convert to numpy in the AI/ML field. Nevertheless, it has clear advantages for analyzing large datasets. It would be a good alternative to consider as a solution to the memory issues in pandas.

'it' 카테고리의 다른 글

Kafka 로그(log) 관리 방법 및 설정 (0)	2023.03.19
How to query Redis Sorted Set value range with Python (with zrange)? (0)	2023.03.17
polars 기초 문법 및 데이터 분석 샘플(Scikit-learn) (0)	2023.03.16
python datetime to unix time, convert to string (1)	2023.03.15
Review of AWS Certified SAA Exam (Dump usage prohibited) (0)	2023.03.15

Ram, sTORy

Ram, sTORy

태그

최근글

댓글

공지사항

아카이브

'it' 카테고리의 다른 글

관련글

티스토리툴바