Merging CSV files in command line

Merging CSV files from the command line is a practical skill that can streamline your data processing tasks, especially when dealing with multiple datasets. Here’s a guide on how to do it using various command-line tools.

Using the cat Command in Unix-based Systems

This command concatenates file1.csv and file2.csv into a new file called merged.csv:

cat file1.csv file2.csv > merged.csv

If you have headers in the files and you want to remove repeated headers from subsequent files, use:

cat file1.csv > merged.csv && tail -n +2 file2.csv >> merged.csv

This command ensures that the header from file1.csv is kept, but it skips the header from file2.csv.

Using awk for More Control

awk is a powerful text-processing tool, which is handy for merging files while removing duplicate headers.

awk 'FNR==1 && NR!=1{next;}{print}' *.csv > merged.csv

This command tells awk to skip the first row of each file except the first one processed (FNR is the record number of the current file, NR is the total record number). It merges all CSV files in the directory into merged.csv.

Using csvkit for Robust CSV Manipulations

csvkit is a suite of utilities specifically designed for converting and manipulating CSV files.

Install csvkit if you haven’t already:

pip install csvkit

Use csvstack to merge files:

csvstack file1.csv file2.csv > merged.csv

csvstack handles columns and headers intelligently, stacking rows from multiple files.

Using PowerShell in Windows

For those on Windows, PowerShell provides a method to merge CSV files similar to Unix commands.

Get-Content file1.csv, file2.csv | Select-Object -Unique > merged.csv

This command concatenates file1.csv and file2.csv, and redirects the output to merged.csv.

Using command line tools to merge CSV files is efficient and can be adapted to various needs, from simple concatenations to handling large datasets with specific merging rules.