|
Remove Duplicate Lines From A Text File Using LINQ
A question was posted on c-sharpcorner forums asking how you would remove duplicate lines from a text file. It got me thinking about how this problem could be addressed using LINQ. Before I get into the solution, here is the problem statement.
The problem
I am trying to remove duplicate lines from a text file. To make things difficult the lines contain non unique timestamps but a unique reference number. Some of the duplicates amount to 10 lines whereas others can only be 2 lines.
1. Here are some examples of duplicates lines:<timestamp>,<reference>,<error message>
08:47:22,95847170050,Problem inputting data.
08:48:28,96672540040,More problems inputting data.
08:49:29,95847170050,Problem inputting data.
08:55:28,106622510040,Extra issues inputting data.
08:56:35,95847170050,Problem inputting data.
08:57:35,106622510040,Extra issues inputting data.
09:02:35,96672540040,More problems inputting data.
09:03:41,96672540040,More problems inputting data.
09:04:41,106622510040,Extra issues inputting data.
I want to delete all but KEEP the most recent duplicate line.
On the forum there is also some Java code which has been used as a solution. But I will ignore it so that I can approach it from a fresh perspective.
Solution The LINQ Way
Here is my approach to remove duplicate lines from a text file. Firstly, I read the contents of the file into a collection of Type List<Error> using an approach I described in an earlier post .
var query = from e in (from line in File.ReadAllLines(originalFile) let errorRecord = line.Split(',') let timestamp = errorRecord[0].Split(':') select new Error() { TimeStamp = new TimeSpan( Convert.ToInt32(timestamp[0]), Convert.ToInt32(timestamp[1]), Convert.ToInt32(timestamp[2])), Reference = errorRecord[1], Message = errorRecord[2] }) select e;
This is what my Error class looks like. This class maps to a record in the file.
public class Error { public TimeSpan TimeStamp { get; set; } public string Reference { get; set; } public string Message { get; set; } }
Once I have the collection loaded up from the file. I then use a LINQ query to select the results I want. This query uses combination of group by and aggregates to select the most current record for a reference number as per the original problem statement.
var query = from e in errorCollection group e by new { e.Reference, e.Message } into g let MaxDate = g.Max(x => x.TimeStamp) select new Error { TimeStamp = MaxDate, Reference = g.Key.Reference, Message = g.Key.Message };
Finally I write the collection I get from the query above to another text file.
List<Error> newList = query.ToList(); using (StreamWriter sw = new StreamWriter(processedFile)) { newList.ForEach(x => sw.WriteLine(x.TimeStamp+","+x.Reference+","+x.Message)); sw.Close(); }
The code which loads the collection from the file can be optimized for performance but it does a good job when dealing with a reasonably small file. Below is the full code for my two methods.
public void RemoveDuplicates(string originalFile, string processedFile) { List<Error> errorCollection = ReadFileIntoCollection(originalFile); var query = from e in errorCollection group e by new { e.Reference, e.Message } into g let MaxDate = g.Max(x => x.TimeStamp) select new Error { TimeStamp = MaxDate, Reference = g.Key.Reference, Message = g.Key.Message }; List<Error> newList = query.ToList(); using (StreamWriter sw = new StreamWriter(processedFile)) { newList.ForEach(x => sw.WriteLine(x.TimeStamp+","+x.Reference+","+x.Message)); sw.Close(); } } public List<Error> ReadFileIntoCollection(string originalFile) { List<Error> errorCollection = null; var query = from e in (from line in File.ReadAllLines(originalFile) let errorRecord = line.Split(',') let timestamp = errorRecord[0].Split(':') select new Error() { TimeStamp = new TimeSpan( Convert.ToInt32(timestamp[0]), Convert.ToInt32(timestamp[1]), Convert.ToInt32(timestamp[2])), Reference = errorRecord[1], Message = errorRecord[2] }) select e; errorCollection = query.ToList<Error>(); return errorCollection; }
So this is my approach to remove duplicates from a text file using LINQ. This is just one example of how LINQ can be used to create creative solutions to some common programming problem. The power of LINQ lies in its simplicity which you can use to write beautiful code.
One Response to “Remove Duplicate Lines From A Text File Using LINQ”
Leave a Reply
Get Updates By Email
Popular Post
- LINQ To SQL Tutorial
- LINQ To SQL Join On Multiple Conditions
- Code Sample: Programmatically Download File Using C#
- Free Icons And Images With Visual Studio 2008
- Windows 7 Control Panel In Classic Mode
- Dynamic Sort With LINQ
- Use SqlConnection With LINQ To SQL
- StyleCop Tutorial
- Write To Vista Event Log Using C#
- More Details Emerge On Microsoft Master Certification
Tag Cloud
Code Snippets
- Get Current Windows User In C#
- Get Width And Height Of Image In C#
- Get Windows Registry Size With WMI And C#
- Reverse Array Elements Using C#
- Convert Hexadecimal To Number In C#
- Get Free Disk Space Using T-SQL
- SQL Server 2008 – Get All Indexes In A Database
- Get Name Of Current Executing Assembly In C#
- Get CD Or DVD Drive Information Using WMI And C#
- Get Last Row From Table Using LINQ To SQL


December 28th, 2008 at 7:24 am
in the method ReadFileIntoCollection, whats the point by doing
var query = from e in (linq the retrievs collection) select e ?
couldn’t you just do it like this:
var query = linq the retrievs collection
?
Thank you.