A question was posted on c-sharpcorner forums asking how you would remove duplicate lines from a text file. It got me thinking about how this problem could be addressed using LINQ. Before I get into the solution, here is the problem statement.

The problem

I am trying to remove duplicate lines from a text file. To make things difficult the lines contain non unique timestamps but a unique reference number. Some of the duplicates amount to 10 lines whereas others can only be 2 lines.

1. Here are some examples of duplicates lines:<timestamp>,<reference>,<error message>
08:47:22,95847170050,Problem inputting data.
08:48:28,96672540040,More problems inputting data.
08:49:29,95847170050,Problem inputting data.
08:55:28,106622510040,Extra issues inputting data.
08:56:35,95847170050,Problem inputting data.
08:57:35,106622510040,Extra issues inputting data.
09:02:35,96672540040,More problems inputting data.
09:03:41,96672540040,More problems inputting data.
09:04:41,106622510040,Extra issues inputting data.

I want to delete all but KEEP the most recent duplicate line.

On the forum there is also some Java code which has been used as a solution. But I will ignore it so that I can approach it from a fresh perspective.

 

Solution The LINQ Way

Here is my approach to remove duplicate lines from a text file. Firstly, I read the contents of the file into a collection of Type List<Error> using an approach I described in an earlier post .

var query = from e in
  (from line in File.ReadAllLines(originalFile)
    let errorRecord = line.Split(',')
    let timestamp = errorRecord[0].Split(':')
    select new Error()
    {
      TimeStamp = new TimeSpan(
        Convert.ToInt32(timestamp[0]),
        Convert.ToInt32(timestamp[1]),
        Convert.ToInt32(timestamp[2])),
      Reference = errorRecord[1],
      Message = errorRecord[2]
    })
  select e;

 

This is what my Error class looks like. This class maps to a record in the file.

public class Error
{
  public TimeSpan TimeStamp { get; set; }
  public string Reference { get; set; }
  public string Message { get; set; }
}

Once I have the collection loaded up from the file. I then use a LINQ query to select the results I want. This query uses combination of group by and aggregates to select the most current record for a reference number as per the original problem statement.

var query = from e in errorCollection
  group e by new { e.Reference, e.Message } into g
  let MaxDate = g.Max(x => x.TimeStamp)
  select new Error
  {
    TimeStamp = MaxDate,
    Reference = g.Key.Reference,
    Message = g.Key.Message
  };

 

Finally I write the collection I get from the query above to another text file.

List<Error> newList = query.ToList();

using (StreamWriter sw = new StreamWriter(processedFile))
{
  newList.ForEach(x => sw.WriteLine(x.TimeStamp+","+x.Reference+","+x.Message));
  sw.Close();
}

 

The code which loads the collection from the file can be optimized for performance but it does a good job when dealing with a reasonably small file. Below is the full code for my two methods.

public void RemoveDuplicates(string originalFile, string processedFile)
{
  List<Error> errorCollection = ReadFileIntoCollection(originalFile);

  var query = from e in errorCollection
    group e by new { e.Reference, e.Message } into g
    let MaxDate = g.Max(x => x.TimeStamp)
    select new Error
    {
      TimeStamp = MaxDate,
      Reference = g.Key.Reference,
      Message = g.Key.Message
    };

  List<Error> newList = query.ToList();

  using (StreamWriter sw = new StreamWriter(processedFile))
  {
    newList.ForEach(x => sw.WriteLine(x.TimeStamp+","+x.Reference+","+x.Message));
    sw.Close();
  }
}

public List<Error> ReadFileIntoCollection(string originalFile)
{
  List<Error> errorCollection = null;

  var query = from e in
    (from line in File.ReadAllLines(originalFile)
    let errorRecord = line.Split(',')
    let timestamp = errorRecord[0].Split(':')
    select new Error()
    {
      TimeStamp = new TimeSpan(
        Convert.ToInt32(timestamp[0]),
        Convert.ToInt32(timestamp[1]),
        Convert.ToInt32(timestamp[2])),
      Reference = errorRecord[1],
      Message = errorRecord[2]
    })
    select e;

  errorCollection = query.ToList<Error>();
  return errorCollection;
}

 

So this is my approach to remove duplicates from a text file using LINQ. This is just one example of how LINQ can be used to create creative solutions to some common programming problem. The power of LINQ lies in its simplicity which you can use to write beautiful code.

kick it on DotNetKicks.com

Tagged with:
 

3 Responses to Remove Duplicate Lines From A Text File Using LINQ

  1. Liran says:

    in the method ReadFileIntoCollection, whats the point by doing
    var query = from e in (linq the retrievs collection) select e ?

    couldn’t you just do it like this:
    var query = linq the retrievs collection
    ?

    Thank you.

  2. Rob says:

    For simply duplicate lines, you could easily do:

    File.WriteAllLines(path, File.ReadAllLines(path).Distinct());

    For your problem, I’d suggest that a simpler method might be available, since you’re already creating another class:
    public class ErrorEqualityComparer : IEqualityComparer {
    public bool Equals(string a, string b) {
    return a.Substring(a.LastIndexOf(‘,’) + 1) == b.Substring(b.LastIndexOf(‘,’));
    }
    public int GetHashCode(string a) { return a.GetHashCode(); }
    }

    Now I can call:
    File.WriteAllLines(path, File.ReadAllLines(path).Reverse().Distinct(new ErrorEqualityComparer()).Reverse());

    This is certainly less LINQ-y, but it makes convenient use of extension methods. :)

    Thanks for the blog – helped me find an answer I needed for master/detail data.

  3. Rob says:

    Bah, it occurs to me that GetHashCode should probably be implemented as a.Substring(a.LastIndexOf(‘,’) + 1).GetHashCode() – but I’m sure you get the point. Cheers!

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>