Remove Duplicate Lines From A Text File Using LINQ
A question was posted on c-sharpcorner forums asking how you would remove duplicate lines from a text file. It got me thinking about how this problem could be addressed using LINQ. Before I get into the solution, here is the problem statement.
The problem
I am trying to remove duplicate lines from a text file. To make things difficult the lines contain non unique timestamps but a unique reference number. Some of the duplicates amount to 10 lines whereas others can only be 2 lines.
1. Here are some examples of duplicates lines:<timestamp>,<reference>,<error message>
08:47:22,95847170050,Problem inputting data.
08:48:28,96672540040,More problems inputting data.
08:49:29,95847170050,Problem inputting data.
08:55:28,106622510040,Extra issues inputting data.
08:56:35,95847170050,Problem inputting data.
08:57:35,106622510040,Extra issues inputting data.
09:02:35,96672540040,More problems inputting data.
09:03:41,96672540040,More problems inputting data.
09:04:41,106622510040,Extra issues inputting data.
I want to delete all but KEEP the most recent duplicate line.
On the forum there is also some Java code which has been used as a solution. But I will ignore it so that I can approach it from a fresh perspective.
Solution The LINQ Way
Here is my approach to remove duplicate lines from a text file. Firstly, I read the contents of the file into a collection of Type List<Error> using an approach I described in an earlier post .
var query = from e in (from line in File.ReadAllLines(originalFile) let errorRecord = line.Split(',') let timestamp = errorRecord[0].Split(':') select new Error() { TimeStamp = new TimeSpan( Convert.ToInt32(timestamp[0]), Convert.ToInt32(timestamp[1]), Convert.ToInt32(timestamp[2])), Reference = errorRecord[1], Message = errorRecord[2] }) select e;
This is what my Error class looks like. This class maps to a record in the file.
public class Error { public TimeSpan TimeStamp { get; set; } public string Reference { get; set; } public string Message { get; set; } }
Once I have the collection loaded up from the file. I then use a LINQ query to select the results I want. This query uses combination of group by and aggregates to select the most current record for a reference number as per the original problem statement.
var query = from e in errorCollection group e by new { e.Reference, e.Message } into g let MaxDate = g.Max(x => x.TimeStamp) select new Error { TimeStamp = MaxDate, Reference = g.Key.Reference, Message = g.Key.Message };
Finally I write the collection I get from the query above to another text file.
List<Error> newList = query.ToList(); using (StreamWriter sw = new StreamWriter(processedFile)) { newList.ForEach(x => sw.WriteLine(x.TimeStamp+","+x.Reference+","+x.Message)); sw.Close(); }
The code which loads the collection from the file can be optimized for performance but it does a good job when dealing with a reasonably small file. Below is the full code for my two methods.
public void RemoveDuplicates(string originalFile, string processedFile) { List<Error> errorCollection = ReadFileIntoCollection(originalFile); var query = from e in errorCollection group e by new { e.Reference, e.Message } into g let MaxDate = g.Max(x => x.TimeStamp) select new Error { TimeStamp = MaxDate, Reference = g.Key.Reference, Message = g.Key.Message }; List<Error> newList = query.ToList(); using (StreamWriter sw = new StreamWriter(processedFile)) { newList.ForEach(x => sw.WriteLine(x.TimeStamp+","+x.Reference+","+x.Message)); sw.Close(); } } public List<Error> ReadFileIntoCollection(string originalFile) { List<Error> errorCollection = null; var query = from e in (from line in File.ReadAllLines(originalFile) let errorRecord = line.Split(',') let timestamp = errorRecord[0].Split(':') select new Error() { TimeStamp = new TimeSpan( Convert.ToInt32(timestamp[0]), Convert.ToInt32(timestamp[1]), Convert.ToInt32(timestamp[2])), Reference = errorRecord[1], Message = errorRecord[2] }) select e; errorCollection = query.ToList<Error>(); return errorCollection; }
So this is my approach to remove duplicates from a text file using LINQ. This is just one example of how LINQ can be used to create creative solutions to some common programming problem. The power of LINQ lies in its simplicity which you can use to write beautiful code.
3 Responses to Remove Duplicate Lines From A Text File Using LINQ
Leave a Reply Cancel reply
Top Posts
- LINQ To SQL Tutorial
- LINQ To SQL Join On Multiple Conditions
- Code Sample: Programmatically Download File Using C#
- Windows 7 Control Panel In Classic Mode
- More Details Emerge On Microsoft Master Certification
- Use SqlConnection With LINQ To SQL
- Free Icons And Images With Visual Studio 2008
- Capture XML In WCF Service
- Dynamic Sort With LINQ
- StyleCop Tutorial
Tags
.Net 2010 ADO.NET ASP.NET Azure Blogging Books Browsers C# Certification Cloud Computing Code Snippets Community Data Services Eclipse Entity Framework Google IDE Java LINQ Mac Microsoft Museum NetBeans Office Oracle REST SharePoint Silverlight SQL Server T-SQL Tips Tools Training Visual Studio Visual Studio 2010 WCF Web Windows Windows 7 Windows Forms Windows Live WMI WPF XAML


in the method ReadFileIntoCollection, whats the point by doing
var query = from e in (linq the retrievs collection) select e ?
couldn’t you just do it like this:
var query = linq the retrievs collection
?
Thank you.
For simply duplicate lines, you could easily do:
File.WriteAllLines(path, File.ReadAllLines(path).Distinct());
For your problem, I’d suggest that a simpler method might be available, since you’re already creating another class:
public class ErrorEqualityComparer : IEqualityComparer {
public bool Equals(string a, string b) {
return a.Substring(a.LastIndexOf(‘,’) + 1) == b.Substring(b.LastIndexOf(‘,’));
}
public int GetHashCode(string a) { return a.GetHashCode(); }
}
Now I can call:
File.WriteAllLines(path, File.ReadAllLines(path).Reverse().Distinct(new ErrorEqualityComparer()).Reverse());
This is certainly less LINQ-y, but it makes convenient use of extension methods. :)
Thanks for the blog – helped me find an answer I needed for master/detail data.
Bah, it occurs to me that GetHashCode should probably be implemented as a.Substring(a.LastIndexOf(‘,’) + 1).GetHashCode() – but I’m sure you get the point. Cheers!