Introduction
Handling large datasets is a common challenge in data analysis and processing tasks. In this blog post, we will discuss our findings on comparing the performance of five different frameworks for concatenating CSV files into dataframes and analyzing their CPU consumption. The frameworks we evaluated are go-dataframe
(Go), gota
(Go), pandas
(Python), petl
(Python), and pyspark
(Python).
Task Overview
The task at hand involves three primary steps
- Reading multiple CSV files and converting them into dataframes
- Concatenating these dataframes into a single dataframe
- Writing the consolidated data back to an output file
To compare the performance of these frameworks, we used benchmarking tools like Hyperfine and CMDBench. Hyperfine allowed us to measure the execution time for each framework, while CMDBench provided insights into the CPU and memory consumption during the task execution.
CSV File Description
Before diving into the comparison, let’s take a brief look at the structure of the CSV files we used for our analysis. These files contain records with columns such as first_name
, last_name
, email
, ssn
, job
, country
, phone_number
, user_name
, zipcode
, invalid_ssn
, credit_card_number
, credit_card_provider
, credit_card_security_code
, and bban
.
CPU Consumption Analysis
We evaluated the CPU consumption of each framework for different scenarios involving varying numbers of CSV files with 100,000 records each. Below are the summarized results:
2 Small Files
(10 runs |
10 records each) |
Framework |
Mean [s] |
Min [s] |
Max [s] |
Relative |
go-dataframe |
1.148 ± 0.081 |
1.070 |
1.288 |
7.74 ± 0.62 |
gota |
0.752 ± 0.096 |
0.690 |
1.021 |
5.07 ± 0.68 |
pandas |
0.485 ± 0.024 |
0.456 |
0.515 |
3.27 ± 0.20 |
petl |
0.148 ± 0.006 |
0.137 |
0.156 |
1.00 |
pyspark |
13.741 ± 2.424 |
11.671 |
20.077 |
92.61 ± 16.70 |
2 Files
(10 runs |
100,000 records each) |
Framework |
Mean [s] |
Min [s] |
Max [s] |
Relative |
go-dataframe |
1.689 ± 0.089 |
1.579 |
1.874 |
1.00 |
gota |
2.282 ± 0.130 |
2.124 |
2.507 |
1.35 ± 0.10 |
pandas |
2.816 ± 0.100 |
2.693 |
2.997 |
1.67 ± 0.11 |
petl |
2.294 ± 0.049 |
2.257 |
2.404 |
1.36 ± 0.08 |
pyspark |
14.142 ± 0.643 |
13.850 |
15.927 |
8.37 ± 0.58 |
5 Files
(10 runs |
100,000 records each) |
Framework |
Mean [s] |
Min [s] |
Max [s] |
Relative |
go-dataframe |
2.583 ± 0.204 |
2.349 |
3.086 |
1.00 |
gota |
4.884 ± 0.316 |
4.462 |
5.374 |
1.89 ± 0.19 |
pandas |
6.527 ± 0.349 |
6.209 |
7.235 |
2.53 ± 0.24 |
petl |
8.117 ± 0.616 |
7.290 |
9.225 |
3.14 ± 0.34 |
pyspark |
21.108 ± 2.345 |
18.462 |
26.281 |
8.17 ± 1.11 |
10 Files
(10 runs |
100,000 records each) |
Framework |
Mean [s] |
Min [s] |
Max [s] |
Relative |
go-dataframe |
3.847 ± 0.253 |
3.585 |
4.280 |
1.00 |
gota |
9.127 ± 0.231 |
8.828 |
9.450 |
2.37 ± 0.17 |
pandas |
14.185 ± 1.546 |
12.161 |
15.737 |
3.69 ± 0.47 |
petl |
24.496 ± 3.328 |
21.238 |
30.702 |
6.37 ± 0.96 |
pyspark |
23.638 ± 0.915 |
22.122 |
24.999 |
6.14 ± 0.47 |
25 Files
(10 runs |
100,000 records each) |
Framework |
Mean [s] |
Min [s] |
Max [s] |
Relative |
go-dataframe |
8.212 ± 0.559 |
7.521 |
9.105 |
1.00 |
gota |
32.663 ± 1.640 |
30.131 |
35.154 |
3.98 ± 0.34 |
pandas |
30.790 ± 0.572 |
30.060 |
31.684 |
3.75 ± 0.26 |
petl |
107.274 ± 3.292 |
100.848 |
111.170 |
13.06 ± 0.98 |
pyspark |
35.659 ± 0.719 |
34.524 |
37.100 |
4.34 ± 0.31 |
50 Files
(5 runs |
100,000 records each) |
Framework |
Mean [s] |
Min [s] |
Max [s] |
Relative |
go-dataframe |
16.669 ± 1.990 |
14.858 |
19.575 |
1.00 |
gota |
91.149 ± 4.158 |
88.691 |
98.500 |
5.47 ± 0.70 |
pandas |
60.217 ± 2.036 |
58.558 |
62.916 |
3.61 ± 0.45 |
pyspark |
65.071 ± 1.078 |
64.086 |
66.612 |
3.90 ± 0.47 |
100 Files
(5 runs |
100,000 records each) |
Framework |
Mean [s] |
Min [s] |
Max [s] |
Relative |
go-dataframe |
26.075 ± 0.522 |
25.426 |
26.518 |
1.00 |
pandas |
116.856 ± 0.334 |
116.315 |
117.220 |
4.48 ± 0.09 |
Memory Consumption Analysis
For the memory consumption analysis, we created graphs using CMDBench. Here are the key insights
2 Files
(100,000 records each)
go-dataframe |
gota |
pandas |
petl |
pyspark |
 |
 |
 |
 |
 |
5 Files
(100,000 records each)
go-dataframe |
gota |
pandas |
petl |
pyspark |
 |
 |
 |
 |
 |
10 Files
(100,000 records each)
go-dataframe |
gota |
pandas |
petl |
pyspark |
 |
 |
 |
 |
 |
25 Files
(100,000 records each)
go-dataframe |
gota |
pandas |
petl |
pyspark |
 |
 |
 |
 |
 |
50 Files
(100,000 records each)
go-dataframe |
gota |
pandas |
petl |
pyspark |
 |
 |
 |
 |
 |
100 Files
(100,000 records each)
go-dataframe |
pandas |
 |
 |
Conclusion
From our analysis, we can draw several conclusions
- For small datasets,
petl
outperforms all other frameworks in terms of CPU consumption, closely followed by pandas
.
- As the dataset size increases,
pyspark
consistently shows the highest CPU consumption, making it less suitable for large datasets in a single-node setup.
- Overall,
go-dataframe
outperforms all other frameworks in terms of CPU and memory consumption
Choosing the right framework depends on the specific requirements of your task, including dataset size, available resources, and the desired level of parallelization. We hope this analysis helps you make an informed decision when working with CSV concatenation and dataframe comparison tasks. Happy data processing!