We (as in the company I work for) have a product which we've named Bounce. Bounce is, as one might guess from the name, a product which reboots servers. It's a solution-in-a-box type thing, that comes in a 1u chassis for server racks.
This product is actual in production at a number of sites, including ours. Additionally, our site is the most complex of the various configurations out there, just because we have a couple of different production networks, our corporate networks, and a "development" network where those servers that use developers torture reside.
Regardless, there was a bug in Bounce, where it would randomly report a failure in one of the servers. Well, this little bug had been bothering the Network guys for a bit, and they finally brought it to my attention as well as the boss' attention--which resulted in some time scheduled to work on the issues, as well as add a few additional enhancements that had been hanging out on the drawing board for a while.
So, I started work, and tested the bounce group that the issue usually popped up in, and found, absolutely nothing. Everything worked perfectly.
So, I chalked it up to gremlins, and implemented the system additions.
Well, I was doing some final system tests, when, lo and behold, the issue cropped up. I opened up the event log, and saw it somewhat flooded with the same message over, and over again, the only difference between those messages were the fact that it was being generated for every server configured in the system.
That error message read: row could not be found or updated.
Talk about useful.
So, I went to Google, and found some discussions that blamed one of two things:
- Concurrency issues generated by the "no count" flag being set on the SQL Server's default connection options
- A difference between the DBML definition of the table, and the actual underlying table
Since I was certain that the DBML looked just like the underlying tables (having just dragged them over) I looked more closely into the concurrency issues that folks are reporting.
And I realized that this isn't a SQL concurrency issue, basically, it's not a race condition where two requests are both trying to modify the table at the same time. This is one of those other definitions for the word concurrency, specifically things being in accordance or agreement.
The below is basically what was happening:
Get List of Devices in BOUNCE QUEUE For Each Device in ListWhat is important is the bolded line there. That Perform System Checks Function was updating the COMPUTER DETAILS table via an EXECUTECOMMAND call on the DBML object.
get ComputerDetails as LINQ objectNext
Perform System Checks Function on ComputerObject
Update properties of ComputerDetails
Submit Changes on the DBML
Basically, it was modifying the underlying data table, without updating the CompterDetails LINQ object.
This was fine until I actually updated the ComputerDetails LINQ object and then submitted it back to the database. When I did that, the system performed a concurrency check (in the "in agreement" definition) against the actual row using those properties that were not being updated. Since they had been modified elsewhere, outside of the normal LINQ-to-SQL paradigm, LINQ was unable to find the row--or at least upon finding it decided that "no, this wasn't really the row I was looking for."
This means, that it happily spat out a "Row not found" error.
I immediately thought up two possible solutions for what was happening, the first was to re-work the code to use the LINQ object in all those places where in-line SQL was being used. The second was to re-work the one place I was submitting the LINQ object to use in-line SQL. Being a lazy programmer, I happily took the latter option.
Lo an behold, my event logs are clean, the error has stopped presenting itself and all is happy and right with the world. At least until the next bug.