2023. 3. 16. 22:54ㆍit
0. Introduction
Polars is a data analysis and processing tool implemented in Rust that offers superior processing performance for large-scale data and low memory usage. Due to these advantages, Polars is highly effective for large-scale data processing. It provides a similar API to Pandas but utilizes C++ and Rust for better performance. Furthermore, it is designed to ensure interoperability between Python and Rust.
1. Comparison between Polars and Pandas
Polars is optimized for processing large-scale data, resulting in faster speed and lower memory usage compared to Pandas. Polars supports a wider range of data types than Pandas, and it also supports parallel processing, which makes it possible to process large-scale data faster.
2. Using Scikit-learn with Polars
Scikit-learn cannot be directly used in Polars. However, since Polars is compatible with NumPy, it can be used with machine learning libraries such as Scikit-learn by using NumPy arrays. Moreover, Polars offers some statistical and machine learning-related functions, allowing for some simple analysis and modeling tasks. Therefore, Polars is an excellent tool for combining with machine learning libraries such as Scikit-learn for large-scale data processing.
3. Example of using Scikit-learn with Polars
Here's an example of using Polars and Scikit-learn together for the iris dataset.
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import polars as pl
# load iris dataset
iris = load_iris()
# convert iris dataset to pandas DataFrame
iris_df = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['target'])
# convert pandas DataFrame to polars DataFrame
iris_pl = pl.from_pandas(iris_df)
# split data into training and testing sets
train, test = train_test_split(iris_pl, test_size=0.2)
# convert polars DataFrame to numpy arrays
X_train, y_train = train[:, :-1].to_numpy(), train[:, -1].to_numpy()
X_test, y_test = test[:, :-1].to_numpy(), test[:, -1].to_numpy()
# create Gradient Boosting classifier
gb = GradientBoostingClassifier()
# train classifier
gb.fit(X_train, y_train)
# evaluate classifier
score = gb.score(X_test, y_test)
print(score)
4. Basic Syntax of Polars
4.1 Read CSV
import polars as pl
pl_df = pl.read_csv('data.csv')
4.2 Filtering Data with Polars DataFrame (where)
# select columns
pl_df = pl_df.select(['name', 'age'])
# filter rows
pl_df = pl_df.filter(pl.col('age') > 30)
# add a new column
pl_df = pl_df.with_column(pl.col('age') * 2, 'age_doubled')
4.3 Polars DataFrame Aggregation (group by)
# calculate mean
mean_age = pl_df['age'].mean()
# calculate sum
sum_age = pl_df['age'].sum()
# calculate count
count_rows = pl_df.count()
# group by city
grouped_df = pl_df.groupby('city')
# calculate mean age for each group
mean_age_by_city = grouped_df.mean('age')
4.4 Polars DataFrame Joining
4.4.1 DataFrame Example
import polars as pl
left = pl.DataFrame({
'id': [1, 2, 3],
'left_value': ['a', 'b', 'c']
})
right = pl.DataFrame({
'id': [2, 3, 4],
'right_value': ['d', 'e', 'f']
})
4.4.2 DataFrame Joins
4.4.2.1 Inner Join
# perform inner join
inner_join = left.join(right, on='id')
# print result
print(inner_join)
4.4.2.2 Left Join
# perform left join
left_join = left.join(right, on='id', how='left')
# print result
print(left_join)
4.4.2.3 Right Join
# perform right join
right_join = left.join(right, on='id', how='right')
# print result
print(right_join)
The syntax of Polars is not significantly different from pandas. However, it can be a bit inconvenient to convert to numpy in the AI/ML field. Nevertheless, it has clear advantages for analyzing large datasets. It would be a good alternative to consider as a solution to the memory issues in pandas.
'it' 카테고리의 다른 글
Kafka 로그(log) 관리 방법 및 설정 (0) | 2023.03.19 |
---|---|
How to query Redis Sorted Set value range with Python (with zrange)? (0) | 2023.03.17 |
polars 기초 문법 및 데이터 분석 샘플(Scikit-learn) (0) | 2023.03.16 |
python datetime to unix time, convert to string (1) | 2023.03.15 |
Review of AWS Certified SAA Exam (Dump usage prohibited) (0) | 2023.03.15 |