Data Extractor for .NET – Scalable Performance with IDataReader
Extracting Data with .NET
This post is for .NET developers who need to extract data from files and is a software architecture discussion and applies to .NET framework 2.0 or above.
The Problem
Say you have a file that you want to extract certain data from such as a text file that contains the data that you are interested in but surrounded by stuff that is irrelevant to your needs.
A specific example of this cropped us for me recently in my company’s email checker service. We offer customers the ability to upload text files containing email addresses. From these files, we extract just the email addresses and then process them to check if they are valid. Conceptually, this is pretty simple to by just scanning the file using regular expression pattern matches to pick out text that looks like email addresses. However, I found that whilst the approach and concept are good, there are some issues with scalability and performance.
The easiest thing to do is load the whole thing into memory and then parse over it with the .NET Regex library. If you’re thinking along these lines, think again! This approach is wholly dependent on system resources (i.e. server memory and paging file). Loading a file of indeterminate size completely into memory at run-time leaves your systems wide open to performance problems and possible system crashes as resources start to be gobbled up trying to load huge files into memory.
The Solution – IDataReader
Microsoft recognised this problem years ago in it’s Windows and Office software and came up with solutions based around streaming or chunking files into manageable sections that wouldn’t then trash system resources (primarily memory). Fortunately for .NET developers, Microsoft’s approach to handling files in a scalable way is alive and well in the .NET framework. However, it’s good news / bad news time.
The good news is that the ability to stream and process files is built into the .NET framework and it’s available as an interface (IDataReader). The bad news is that, as it’s an interface, it’s up to you to provide your own implementation. I’ve done it and, trust me, at 700 lines of code for implementing a custom text to SQL reader, it’s not simple, trivial or easy.
Benefits of implementing IDataReader
There are two benefits of going to the effort of implementing IDataReader in your custom data extraction software:
- Scalability – It simply doesn’t matter what size of file you process, you’ll never run out of memory (files should be loaded and then parsed using .NET StreamReader)
- High performance SQL Server import – As we’ve implemented IDataReader, you can plug your software directly into the .NET SqlBulkCopy library for fast, stream based data import into SQL server. For anyone not familiar with SqlBulkCopy, it’s the .NET equivalent of BCP command line but wrapped in managed, .NET code.
Lightening fast performance, ultimate scalability
By implementing IDataReader, it’s possible to develop custom file parsers that can not only throw data into SQL server as fast as possible but also do it in a scalable way that won’t crash your server with large files.
Your opinion – should I bother selling this?
Here is where I need some feedback.
I have generic .NET component that implements IDataReader and can extract data from any text file using externally defined regular expressions. As well as extraction, it can be plugged directly into SqlBulkCopy to provide the fastest, most scalable way of getting data from a “random” (e.g. user input) text file into SQL server field.
The software is complete and is used as we use it extensively internally and as part of the software service that we provide online. So, I know that it works.
This software can save you about a days worth of coding and test.
My dilemma is this. It takes effort, time and money to take software from “internal use only” to commercially saleable. What I’m trying to understand is if the effort is worth it (i.e. is there enough demand?). This is where you come in. If you would like to see this software generally available, please indicate your support in the comments box on this page.
No, or limited comments mean that the software stays “internal” and you’re on your own to implement IDataReader;)
Related posts:





