Bootstrap

Bash (Scripting) - Remove duplicate lines in a file

by Jeremy Canfield | Updated: July 22 2024 | Bash (Scripting) articles

Let's say you have a file named foo.txt that contains the following.

Line 1
Hello
Line 2
Hello
Line 3

awk can be used to return the lines that have been "seen" in a file. In this example, awk will not return any stdout the first time it parses the line containing "Hello". However, awk will return "Hello" when it parses the second occurrence of the line containing "Hello" because an identical line has already been "seen".

~]# awk 'seen[$0]++' foo.txt
Hello

Similarly, including an exclamation point can be used to return lines that have not been "seen" in the file. In this example, awk does not return the second occurrence of the line containing "Hello" because an identical line has already been "seen".

~]$ awk '!seen[$0]++' foo.txt 
Line 1
Hello
Line 2
Line 3

The prior awk command will not make any changes to the original file. Instead, this just returns stdout of the lines that have or have not been "seen".

You could use redirection to redirect the output to a different file.

awk '!seen[$0]++' /tmp/foo.txt > /tmp/bar.txt

Or, you could use gawk which has the -i inplace option to update the original file.

gawk -i inplace '!seen[$0]++' /tmp/foo.txt

I also had a coworker use the following which preserves the order of the lines in the file. I think the prior awk and gawk commands also preserve the order of the file so perhaps the follow command does the same as the prior commands but I wanted to at least make note of this as something to try if the above commands don't do what you want them to do.

awk 'NF{x[$0]++; print (x[$0]>1?"<REMOVE>"$0:$0); next}1' /tmp/foo.txt | sed "s/^<REMOVE>#/#/" | sed "/^<REMOVE>/d" > /tmp/foo.txt.new; mv /tmp/foo.txt.new /tmp/foo.txt

Did you find this article helpful?

If so, consider buying me a coffee over at

Did you find this article helpful?

Comments

Add a Comment