Introduction

Handling large datasets is a common challenge in data analysis and processing tasks. In this blog post, we will discuss our findings on comparing the performance of five different frameworks for concatenating CSV files into dataframes and analyzing their CPU consumption. The frameworks we evaluated are go-dataframe (Go), gota (Go), pandas (Python), petl (Python), and pyspark (Python).

Task Overview

The task at hand involves three primary steps

Reading multiple CSV files and converting them into dataframes
Concatenating these dataframes into a single dataframe
Writing the consolidated data back to an output file

To compare the performance of these frameworks, we used benchmarking tools like Hyperfine and CMDBench. Hyperfine allowed us to measure the execution time for each framework, while CMDBench provided insights into the CPU and memory consumption during the task execution.

CSV File Description

Before diving into the comparison, let’s take a brief look at the structure of the CSV files we used for our analysis. These files contain records with columns such as first_name, last_name, email, ssn, job, country, phone_number, user_name, zipcode, invalid_ssn, credit_card_number, credit_card_provider, credit_card_security_code, and bban.

CPU Consumption Analysis

We evaluated the CPU consumption of each framework for different scenarios involving varying numbers of CSV files with 100,000 records each. Below are the summarized results:

2 Small Files

(10 runs

10 records each)

Framework	Mean [s]	Min [s]	Max [s]	Relative
`go-dataframe`	1.148 ± 0.081	1.070	1.288	7.74 ± 0.62
`gota`	0.752 ± 0.096	0.690	1.021	5.07 ± 0.68
`pandas`	0.485 ± 0.024	0.456	0.515	3.27 ± 0.20
`petl`	0.148 ± 0.006	0.137	0.156	1.00
`pyspark`	13.741 ± 2.424	11.671	20.077	92.61 ± 16.70

2 Files

(10 runs

100,000 records each)

Framework	Mean [s]	Min [s]	Max [s]	Relative
`go-dataframe`	1.689 ± 0.089	1.579	1.874	1.00
`gota`	2.282 ± 0.130	2.124	2.507	1.35 ± 0.10
`pandas`	2.816 ± 0.100	2.693	2.997	1.67 ± 0.11
`petl`	2.294 ± 0.049	2.257	2.404	1.36 ± 0.08
`pyspark`	14.142 ± 0.643	13.850	15.927	8.37 ± 0.58

5 Files

(10 runs

100,000 records each)

Framework	Mean [s]	Min [s]	Max [s]	Relative
`go-dataframe`	2.583 ± 0.204	2.349	3.086	1.00
`gota`	4.884 ± 0.316	4.462	5.374	1.89 ± 0.19
`pandas`	6.527 ± 0.349	6.209	7.235	2.53 ± 0.24
`petl`	8.117 ± 0.616	7.290	9.225	3.14 ± 0.34
`pyspark`	21.108 ± 2.345	18.462	26.281	8.17 ± 1.11

10 Files

(10 runs

100,000 records each)

Framework	Mean [s]	Min [s]	Max [s]	Relative
`go-dataframe`	3.847 ± 0.253	3.585	4.280	1.00
`gota`	9.127 ± 0.231	8.828	9.450	2.37 ± 0.17
`pandas`	14.185 ± 1.546	12.161	15.737	3.69 ± 0.47
`petl`	24.496 ± 3.328	21.238	30.702	6.37 ± 0.96
`pyspark`	23.638 ± 0.915	22.122	24.999	6.14 ± 0.47

25 Files

(10 runs

100,000 records each)

Framework	Mean [s]	Min [s]	Max [s]	Relative
`go-dataframe`	8.212 ± 0.559	7.521	9.105	1.00
`gota`	32.663 ± 1.640	30.131	35.154	3.98 ± 0.34
`pandas`	30.790 ± 0.572	30.060	31.684	3.75 ± 0.26
`petl`	107.274 ± 3.292	100.848	111.170	13.06 ± 0.98
`pyspark`	35.659 ± 0.719	34.524	37.100	4.34 ± 0.31

50 Files

(5 runs

100,000 records each)

Framework	Mean [s]	Min [s]	Max [s]	Relative
`go-dataframe`	16.669 ± 1.990	14.858	19.575	1.00
`gota`	91.149 ± 4.158	88.691	98.500	5.47 ± 0.70
`pandas`	60.217 ± 2.036	58.558	62.916	3.61 ± 0.45
`pyspark`	65.071 ± 1.078	64.086	66.612	3.90 ± 0.47

100 Files

(5 runs

100,000 records each)

Framework	Mean [s]	Min [s]	Max [s]	Relative
`go-dataframe`	26.075 ± 0.522	25.426	26.518	1.00
`pandas`	116.856 ± 0.334	116.315	117.220	4.48 ± 0.09

Memory Consumption Analysis

For the memory consumption analysis, we created graphs using CMDBench. Here are the key insights

2 Files

(100,000 records each)

go-dataframe	gota	pandas	petl	pyspark

5 Files

(100,000 records each)

go-dataframe	gota	pandas	petl	pyspark

10 Files

(100,000 records each)

go-dataframe	gota	pandas	petl	pyspark

25 Files

(100,000 records each)

go-dataframe	gota	pandas	petl	pyspark

50 Files

(100,000 records each)

go-dataframe	gota	pandas	petl	pyspark

100 Files

(100,000 records each)

go-dataframe	pandas

Conclusion

From our analysis, we can draw several conclusions

For small datasets, petl outperforms all other frameworks in terms of CPU consumption, closely followed by pandas.
As the dataset size increases, pyspark consistently shows the highest CPU consumption, making it less suitable for large datasets in a single-node setup.
Overall, go-dataframe outperforms all other frameworks in terms of CPU and memory consumption

Choosing the right framework depends on the specific requirements of your task, including dataset size, available resources, and the desired level of parallelization. We hope this analysis helps you make an informed decision when working with CSV concatenation and dataframe comparison tasks. Happy data processing!