(cache)EmacsWiki: Duplicate Lines

The following gives multiple approaches in EmacsLisp code for managing duplicate lines in a file. Try them out by EvaluatingExpressions or by adding to your InitFile. For duplicating lines see CopyFromAbove.

delete-duplicate-lines
Interactive search and replace won't work
Search and replace in Lisp that works
A Lisp command using a list containing each line
The uniq command with consecutive lines
The sort -u command for non-consecutive duplicates
Lisp commands removing consecutive duplicates
Lisp command to retrieve duplicates
Indexed lines

delete-duplicate-lines

See this blog post by Bozhidar Batsov on how to use ‘delete-duplicate-lines’:

http://emacsredux.com/blog/2014/03/01/a-peek-at-emacs-24-dot-4-delete-duplicate-lines/

Interactive search and replace won't work

You can try to remove duplicate lines with ReplaceRegexp or `C-M-%’ (‘query-replace-regexp’), by replacing \([^ C-q C-j ]+ C-q C-j \)\1+ with \1. A newline is entered with a ‘C-q C-j’. The Lisp version of this command would be

  (replace-regexp "\\([^\n]+\n\\)\\1+" "\\1")

Since it doesn’t backtrack, there’s no way remove a line with more than one duplicate.

Duplicate line 1
Unique line 1
Duplicate line 1
Duplicate line 1

It will leave duplicate lines.

Duplicate line 1
Unique line 1
Duplicate line 1

Search and replace in Lisp that works

See bug #13032 for delete-duplicate-lines in core emacs.

Here’s some Lisp to find duplicate lines and keep only the first occurrence by starting each search and replace at the start of the last duplicate – (goto-char start).

  (defun uniquify-all-lines-region (start end)
    "Find duplicate lines in region START to END keeping first occurrence."
    (interactive "*r")
    (save-excursion
      (let ((end (copy-marker end)))
        (while
            (progn
              (goto-char start)
              (re-search-forward "^\\(.*\\)\n\\(\\(.*\n\\)*\\)\\1\n" end t))
          (replace-match "\\1\n\\2")))))
  
  (defun uniquify-all-lines-buffer ()
    "Delete duplicate lines in buffer and keep first occurrence."
    (interactive "*")
    (uniquify-all-lines-region (point-min) (point-max)))

So for this buffer:

Duplicate line 1
Unique line 1


Duplicate line 1
Unique line 2
Unique line 3

Duplicate line 1


Duplicate line 2
Duplicate line 2
Unique line 4

running ‘M-x uniquify-all-lines-buffer’ produces:

Duplicate line 1
Unique line 1

Unique line 2
Unique line 3
Duplicate line 2
Unique line 4

A Lisp command using a list containing each line

Another implementation that adds distinct lines to a temporary list, and checks each line in the list with ‘assoc’ will give the same result as the previous.

  (defun uniquify-all-lines-region (start end)
    "Find duplicate lines in region START to END keeping first occurrence."
    (interactive "*r")
    (save-excursion
      (let ((lines) (end (copy-marker end)))
        (goto-char start)
        (while (and (< (point) (marker-position end))
                    (not (eobp)))
          (let ((line (buffer-substring-no-properties
                       (line-beginning-position) (line-end-position))))
            (if (member line lines)
                (delete-region (point) (progn (forward-line 1) (point)))
              (push line lines)
              (forward-line 1)))))))

The uniq command with consecutive lines

The unix utility called uniq removes duplicate consecutive lines, keeping only one instance. The output of uniq on the example above is:

Duplicate line 1
Unique line 1

Duplicate line 1
Unique line 2
Unique line 3

Duplicate line 1

Duplicate line 2
Unique line 4

Note that non-consecutive duplication of the first line are not removed.

The sort -u command for non-consecutive duplicates

The unix utility sort and its -u argument can give the same result of unique lines as ‘M-x uniquify-all-lines-buffer’ . However, the lines are sorted rather than kept in the order of first appearance.

The output of sort -u on the example above is:

Duplicate line 1
Duplicate line 2
Unique line 1
Unique line 2
Unique line 3
Unique line 4

Lisp commands removing consecutive duplicates

The command ‘M-x uniquify-buffer-lines’ will remove identical adjacent lines in the current buffer, similar to what is obtained with the unix uniq command.

  (defun uniquify-region-lines (beg end)
    "Remove duplicate adjacent lines in region."
    (interactive "*r")
    (save-excursion
      (goto-char beg)
      (while (re-search-forward "^\\(.*\n\\)\\1+" end t)
        (replace-match "\\1"))))
  
  (defun uniquify-buffer-lines ()
    "Remove duplicate adjacent lines in the current buffer."
    (interactive)
    (uniquify-region-lines (point-min) (point-max)))

It is important to note that functions which find duplicate lines don’t always sort lines before looking for dups as this may or may not be what one expects or desires of a particular function.

Lisp command to retrieve duplicates

Where the lines of a file are presorted it can be convenient to use something like this:

  (defun find-duplicate-lines (&optional insertp interp)
    (interactive "i\np")
    (let ((max-pon (line-number-at-pos (point-max)))
	  (gather-dups))
      (while (< (line-number-at-pos) max-pon) (= (forward-line) 0)
	     (let ((this-line (buffer-substring-no-properties (line-beginning-position 1) (line-end-position 1)))
		   (next-line (buffer-substring-no-properties (line-beginning-position 2) (line-end-position 2))))
	       (when  (equal this-line next-line)  (setq gather-dups (cons this-line gather-dups)))))
      (if (or insertp interp)
	  (save-excursion (new-line) (princ gather-dups (current-buffer)))
	gather-dups)))

This function, while inefficient (note cons in tail of while form) is quite handy for locating duplicates before removing them, i.e. situations of type: ‘uniquify-maybe’. Extend ‘find-duplicate-lines’ by comparing its result list with one or more of the list comparison procedures ‘set-difference’, ‘union’, ‘intersection’, etc. from the CL package (require ‘cl).

Indexed lines

I had a file that looks like

0001 line1 original
0001 line1 modified
0002 line2 
0003 line3 original
0003 line3 modified

Use this regexp in query-replace-regexp to find the duplicated lines : \(\([0-9]\{6,7\}\).*\)<newline, insert it with C-q C-j>\(\2.*\) then you can replace it with \1 to keep the original lines, or \3 to keep the modified ones

CategoryEditing

DuplicateLines