I have been sitting on my porch debugging a software problem in Asia.
This is always an interesting experience due to language, time and cultural issues.
We have a large, complex system running as part of a cell phone billing application. It takes about 100K small PDF cell phone bills and combines them into a single PDF file in about 8 minutes.
Every nine months or so something goes wrong and I get an email.
Now there are two ways to look at the issues here. There is the micro view and the macro view.
At the micro level there are specific things in log files indicating problems of one form or another. These PDF files arrive on a network directory and wait until a JDML file arrives that tells the system how to put them together. So I can review the log files and detect errors. Some program runs and exists with a bad status, fails to load a file, that sort of thing.
While I am doing this I am also receiving interesting "out of band" data about the macro state of the system. This particular system has a client/server architecture. Both client and server have a UI that aggregates log data down to SUCCESS/FAIL. As part of this system the clients and server keeps track of who is connected as a client and so forth.
So I find out that, in addition to the bad PDF results, there are other types of errors on the system. Specifically I/O errors and client/server errors. At this point it is very hard to tell what stream of information is really telling me about the problems.
When I hear that programs that normally work are reporting I/O errors I become concerned that there are system level issues causing problems: disk errors, bad sectors, network drive issues, etc. To me these are usually better harbingers of issues that specific file issues. All of these tens of millions of PDFs are created the same way by an application that doesn't change (though that's been an issue as well). So I struggle with what to tell the customer.
Usually I have to carefully tease issues out of them - both because of language and technology. In this particular case my customer is very good at getting me reliable information about his customer (who is the ultimate user of the software) so its reasonably easy.
At the same time I have no direct control over the environment or testing. An employee installed the system 18 months ago and we have not been on site since. It runs on a Windows machine in a foreign language that we cannot read.
So basically I have to read the "body language" (macro information) as carefully as the underlying micro issues to solve the problem. In this case the software seems to work here but specific files fail on-site. Today, however, IO error reports began to arrive so I am now more suspicious that something in the environment has changed.
Particularly worrisome is that the main server is reporting additional client connections. Since this reporting can only happen if a client app attempts to make a TCP/IP connection to the server its unlikely this is happening on its own. More than likely someone is doing this - so if they are doing this what else might they be doing?
No comments:
Post a Comment