Merging CSV files in command line
Merging CSV files from the command line is a practical skill that can streamline your data processing tasks, especially when dealing with multiple datasets. Here’s a guide on how to do it using various command-line tools.
Using the cat
Command in Unix-based Systems
This command concatenates file1.csv
and file2.csv
into a new file called merged.csv
:
cat file1.csv file2.csv > merged.csv
If you have headers in the files and you want to remove repeated headers from subsequent files, use:
cat file1.csv > merged.csv && tail -n +2 file2.csv >> merged.csv
This command ensures that the header from file1.csv
is kept, but it skips the header from file2.csv
.
Using awk
for More Control
awk
is a powerful text-processing tool, which is handy for merging files while removing duplicate headers.
awk 'FNR==1 && NR!=1{next;}{print}' *.csv > merged.csv
This command tells awk
to skip the first row of each file except the first one processed (FNR
is the record number of the current file, NR
is the total record number). It merges all CSV files in the directory into merged.csv
.
Using csvkit
for Robust CSV Manipulations
csvkit
is a suite of utilities specifically designed for converting and manipulating CSV files.
Install csvkit
if you haven’t already:
pip install csvkit
Use csvstack
to merge files:
csvstack file1.csv file2.csv > merged.csv
csvstack
handles columns and headers intelligently, stacking rows from multiple files.
Using PowerShell in Windows
For those on Windows, PowerShell provides a method to merge CSV files similar to Unix commands.
Get-Content file1.csv, file2.csv | Select-Object -Unique > merged.csv
This command concatenates file1.csv
and file2.csv
, and redirects the output to merged.csv
.
Using command line tools to merge CSV files is efficient and can be adapted to various needs, from simple concatenations to handling large datasets with specific merging rules.