This blog contains reflections and thoughts on my work as a software engineer

onsdag den 13. maj 2009

Excel can’t count?

I’m currently working on a data-extraction gig – basicly we have 4 different data providers to persist information about attendees to a summit in July with some 25.000 participants – and I’m currently in the process of creating a small, shortlived console app which generates an Excel-file with information about each participants so we are able to print out identitcal badges used to identify participants at the gates during the summit.

I encountered one of the scariest things I’ve seen in months while working today on the application. First things first: One of the datastores for participants is an Excel spreadsheet – one line per guest that is. Easy enough, I found some code using the OleDB Jet-engine:

string connectionString = string.Format("Provider=Microsoft.Jet.OLEDB.4.0;Data Source={0};Extended Properties=\"Excel 8.0;HDR=YES;\"", _xlsFile);
DbProviderFactory factory = DbProviderFactories.GetFactory("System.Data.OleDb");
using (DbConnection connection = factory.CreateConnection())
{
connection.ConnectionString = connectionString;

using (DbCommand command = connection.CreateCommand())
{
command.CommandText = "SELECT * FROM [Ark1$]";
connection.Open();

using (DbDataReader dr = command.ExecuteReader())
{
while (dr.Read())
{

//do stuff

}
}
}
}


During an ongoing testphase (yes, we actually DO this agile stuff here) I noticed that one of the columns in my output spreadsheed didn’t match the expected values. Quite an error actually because one of the things people at the summit do care a lot about is that the food they’d paid deerly for in advance ends up as a crossed checkbox in the right place on their ID-card. Otherwise they will not be allowed access to the dining area… You can imagine what people think of the IT-guy in charge if things like this doesn’t work like a charm.



So I started debugging. First I suspected that I simply counted wrong when extracting values – namely the output from these columns



image 



As far as I’m concerned M is the 12. letter and N is the 13. However – debugging this stuff showed me the following output for the first and second row:



image



and row #2:



image 



The birth date matches (as with names and other stuff) but notice the the 13. and 14. column which doesn’t match at all with the data in the spreadsheet… For one every cell has a value so how can row #2 suddenly be empty in the debugger???



I suspected I was working on another file than the one I thought – triplechecking that by inserting my own name and seeing it come up in the debugger proved that theory wrong.



I’ll personally buy the first person to explain this mystery a beer for an bulletproof explanation (other than my own which of course includes a major malfunction in Excel 2003 / .NET / OleDB / pick your own). What’s going on here? I’m not even close to being overworked (don’t tell my boss) and the code is simple… I tried converting the source spreadsheet to CSV to check the values in a plain textfile and they are what they should be according to the source spreadsheet so that’s the approach I’m taking now. Instead of parsing the Excel directly I’m converting it to CSV and using that as my datasource instead. Not very elegant but it makes me feel a lot safer.

Ingen kommentarer: