Original in Russian: http://programmingmindstream.blogspot.ru/2014/09/blog-post_23.html
So, “third time unlucky”…
More precisely, for the third time the launch of internal product failed.
My fault.
Although it had “all the trimmings”, tests, factories and so no and so forth.
Tests of all sorts – pass.
The real soft – does not work in the real conditions.
READ_ERROR and then WRITE_ERROR. Under distributed access.
Generally, I will not write to blog until I puzzle it out, reconsider it and give “a treatment plan”.
For there is no point in it.
For “theory without practice is useless”.
There’s no point in writing about “cows in vacuum” when “your own cows do not work”.
For now – we’ve created a “load test”.
I left it working for a night.
I would look at it the next morning.
It is possible we’ll write “another load test”.
For quite the time I haven’t seen such epic fails.
I may tell you one thing “not about myself” – hResult and other “ErrorCode’s” are in a way “not a great gimmick”. Of course, it depends on how you use it… I have not always used it properly…
I had:
I have:
- it is a surprise that without SysCheck – the check passed.
It means, the error has been returned but the CHECK passed.
It works in this way:
It means, “sometimes what’s intended to be read is read”, but with an error.
WITHOUT checking for errors everything worked but “malfunctioned”, with the “check for errors” it began to “fall more often”.
It was so in terms of distributed heterogeneous environment, using “dead” or “half-dead” computers, different Windows versions up to the “prehistoric” ones.
Why? Yet not clear.
Anyway, I blame my own “butter fingers” instead of “Microsoft guys”.
For it is easy to “point fingers at others”.
Simple, but not constructive.
And, by the way, if we write:
Again, it “falls less often”.
The reason “for thinking”.
Unfortunately, it does not repeat on “synthetic tests”.
In order to top it off, LockRegion/UnlockRegion also engage in it.
Naturally, they are “correctly established” and “properly” framed SysCheck etc.
BUT it looks like their presence is the issue.
Without them, it “seems to work” but with “concurrent access” it works bad, which is obvious.
We’ll move to a new testing level.
P.S. The whole code given above is, of course, “pseudocode” used to illustrate problems. More likely, “comas” are put there incorrectly. SysCheck at FileWrite is also skipped on purpose. Believe me, we have it.
In addition, WrittenSize and ReadSize are checked there.
However, these details are also skipped – on purpose.
P.P.S. The code without SysCheck and OleCheck works for already 15 years, but malfunctions (now and then, I mean what I say, we get incorrect data; it’s unpleasant, but “possible to live”). That is actually the reason why I went in for puzzling it out and writing SysCheck and OleCheck.
As a result, I have got “the whole nine yards”.
Nine yards of what? I don’t know yet.
Once again, it does not repeat at all client stations. May be due to the “butter fingers”.
In short – “do not forget about errors codes”, though “processing” of them is “not always clear”.
When I establish the guilt of my “butter fingers” I will surely write.
P.P.P.S. I’ve also added logging of problem operations. And… And… I’ve got a “Schrödinger's cat”. Logging began to influence “business logic”, at least in terms of “timing delays”, which is obvious.
Again – a “separate issue”.
P.P.P.P.S. One more thing. The code given above is for one client. One writes, the other one reads is not the case. I understand using it for different clients, but not for one client. I hope, I will soon understand.
P.P.P.P.S. Another more thing.
I have already checked the version of this kind:
“At a rough guess” there are no “uninitialized variables”.
P.P.P.P.P.S. By the way, my tests are still working… No errors :-( It “inspires sadness”. I’ll see how it ends in the morning.
P.P.P.P.P.S. As it turned out, two things ensure problems:
1. Access through the UNC paths, i.e. paths like - \\server\resource\path\filename.
2. Using of LockFile as a must.
P.P.P.P.P.P.S. Yesterday tests worked without errors. Data in the 3 Gbs area has been processed.
Today I launched tests on two computers. Tomorrow I’ll see the results.
Then I will launch on 3, 4, 5 and so on.
P.P.P.P.P.P.P.S. Today I also found two “bottlenecks” - AllocNewFATAlement and AllocNewCluster. In their turn, they lead to LockFile. We lock the header that stores the information about storage structure. All users “beat against these blockings” while writing.
I already know how to solve it.
We should pro-actively allocate a number (five, ten, twenty) of FATElement and Cluster at once instead of one. As one piece. Given that a file usually contains more than one or two or even ten clusters, it is effective. We should also keep a list of free ones locally by a client. Then, when a client closes, they should be put back to the list of free, “not used” ones. This is done so that other client could use it afterwards.
Sure, there is a possibility that we lose items if a client provokes the fail.
But we get to the “bottleneck” less often, because we can write:
It is the same for clusters.
Even if the elements will “hang”, t will not get worse and the storage will not be broken. It will have “holes”.
Taking into account the fact that at night the “night Update” takes place (if it is possible), the storage is repacked - in any event. It means the “holes” will disappear - in any case – by morning.
As a result, we’ll have a repacked persistent part with no holes and the “empty” variable part.
During the day clients will write their documents’ versions into the variable part.
The process repeats the next night.
This is, of course, in case there are no working users connected to the base.
A special post is - here. http://programmingmindstream.blogspot.ru/2014/09/blog-post_25.html
Note that this is “all about internal products”. I won’t tell about third-party products.
For those who’ve read to the end, the task looks as follows:
“Deficit” :-(
So, “third time unlucky”…
More precisely, for the third time the launch of internal product failed.
My fault.
Although it had “all the trimmings”, tests, factories and so no and so forth.
Tests of all sorts – pass.
The real soft – does not work in the real conditions.
READ_ERROR and then WRITE_ERROR. Under distributed access.
Generally, I will not write to blog until I puzzle it out, reconsider it and give “a treatment plan”.
For there is no point in it.
For “theory without practice is useless”.
There’s no point in writing about “cows in vacuum” when “your own cows do not work”.
For now – we’ve created a “load test”.
I left it working for a night.
I would look at it the next morning.
It is possible we’ll write “another load test”.
For quite the time I haven’t seen such epic fails.
I may tell you one thing “not about myself” – hResult and other “ErrorCode’s” are in a way “not a great gimmick”. Of course, it depends on how you use it… I have not always used it properly…
I had:
SetFilePos(hFile, aPos); FileWrite(hFile, @SomeValue, SizeOf(SomeValue)); SetFilePos(hFile, aPos); FileRead(hFile, @SomeOtherValue, SizeOf(SomeOtherValue)); Assert(SomeValue = SomeOtherValue);
I have:
SetFilePos(hFile, aPos); FileWrite(hFile, @SomeValue, SizeOf(SomeValue)); SetFilePos(hFile, aPos); SysCheck(FileRead(hFile, @SomeOtherValue, SizeOf(SomeOtherValue))); // - here it began to fall “sometimes”, despite the fact that SomeOtherValue = SomeValue Assert(SomeValue = SomeOtherValue);
- it is a surprise that without SysCheck – the check passed.
It means, the error has been returned but the CHECK passed.
It works in this way:
SetFilePos(hFile, aPos); FileWrite(hFile, @SomeValue, SizeOf(SomeValue)); SetFilePos(hFile, aPos); try SysCheck(FileRead(hFile, @SomeOtherValue, SizeOf(SomeOtherValue))); // - here it began to fall “sometimes”, despite the fact that SomeOtherValue = SomeValue finally Assert(SomeValue = SomeOtherValue); // - here it does NOT FALL, although it “falls” more, on SysCheck end;
It means, “sometimes what’s intended to be read is read”, but with an error.
WITHOUT checking for errors everything worked but “malfunctioned”, with the “check for errors” it began to “fall more often”.
It was so in terms of distributed heterogeneous environment, using “dead” or “half-dead” computers, different Windows versions up to the “prehistoric” ones.
Why? Yet not clear.
Anyway, I blame my own “butter fingers” instead of “Microsoft guys”.
For it is easy to “point fingers at others”.
Simple, but not constructive.
And, by the way, if we write:
SetFilePos(hFile, aPos); FileWrite(hFile, @SomeValue, SizeOf(SomeValue)); l_TryCount := 0; while (l_TryCount < 100) do begin Inc(l_TryCount); SetFilePos(hFile, aPos); try SysCheck(FileRead(hFile, @SomeOtherValue, SizeOf(SomeOtherValue))); // - here it began to fall “sometimes”, despite the fact that SomeOtherValue = SomeValue except if (l_TryCount < 100) then continue else raise; end;//try..except break; end; Assert(SomeValue = SomeOtherValue);
Again, it “falls less often”.
The reason “for thinking”.
Unfortunately, it does not repeat on “synthetic tests”.
In order to top it off, LockRegion/UnlockRegion also engage in it.
Naturally, they are “correctly established” and “properly” framed SysCheck etc.
BUT it looks like their presence is the issue.
Without them, it “seems to work” but with “concurrent access” it works bad, which is obvious.
We’ll move to a new testing level.
P.S. The whole code given above is, of course, “pseudocode” used to illustrate problems. More likely, “comas” are put there incorrectly. SysCheck at FileWrite is also skipped on purpose. Believe me, we have it.
In addition, WrittenSize and ReadSize are checked there.
However, these details are also skipped – on purpose.
P.P.S. The code without SysCheck and OleCheck works for already 15 years, but malfunctions (now and then, I mean what I say, we get incorrect data; it’s unpleasant, but “possible to live”). That is actually the reason why I went in for puzzling it out and writing SysCheck and OleCheck.
As a result, I have got “the whole nine yards”.
Nine yards of what? I don’t know yet.
Once again, it does not repeat at all client stations. May be due to the “butter fingers”.
In short – “do not forget about errors codes”, though “processing” of them is “not always clear”.
When I establish the guilt of my “butter fingers” I will surely write.
P.P.P.S. I’ve also added logging of problem operations. And… And… I’ve got a “Schrödinger's cat”. Logging began to influence “business logic”, at least in terms of “timing delays”, which is obvious.
Again – a “separate issue”.
P.P.P.P.S. One more thing. The code given above is for one client. One writes, the other one reads is not the case. I understand using it for different clients, but not for one client. I hope, I will soon understand.
P.P.P.P.S. Another more thing.
I have already checked the version of this kind:
function DoRead(...): LongBool; begin FileRead(hFile ...); // - we “forgot” to return the result end; ... SysCheck(DoRead(...)); // - we check for “garbage”
“At a rough guess” there are no “uninitialized variables”.
P.P.P.P.P.S. By the way, my tests are still working… No errors :-( It “inspires sadness”. I’ll see how it ends in the morning.
P.P.P.P.P.S. As it turned out, two things ensure problems:
1. Access through the UNC paths, i.e. paths like - \\server\resource\path\filename.
2. Using of LockFile as a must.
P.P.P.P.P.P.S. Yesterday tests worked without errors. Data in the 3 Gbs area has been processed.
Today I launched tests on two computers. Tomorrow I’ll see the results.
Then I will launch on 3, 4, 5 and so on.
P.P.P.P.P.P.P.S. Today I also found two “bottlenecks” - AllocNewFATAlement and AllocNewCluster. In their turn, they lead to LockFile. We lock the header that stores the information about storage structure. All users “beat against these blockings” while writing.
I already know how to solve it.
We should pro-actively allocate a number (five, ten, twenty) of FATElement and Cluster at once instead of one. As one piece. Given that a file usually contains more than one or two or even ten clusters, it is effective. We should also keep a list of free ones locally by a client. Then, when a client closes, they should be put back to the list of free, “not used” ones. This is done so that other client could use it afterwards.
Sure, there is a possibility that we lose items if a client provokes the fail.
But we get to the “bottleneck” less often, because we can write:
if AllocatedFatElements.Empty then begin // - there is inter-PROCESS “bottleneck” Lock; try Result := AllocNewFatElement; for l_Index := 0 to 10 do AllocatedFatElements.Add(AllocNewFatElement); finally Unlock; end//try..finally end else // - here is ONLY inter-THREADED “bottleneck” (because AllocatedFatElements is – naturally – protected by a number of threads) Result := AllocatedFatElements.GetLastAndDeleteIt; Instead of: Lock; // - there is ALWAYS a multi-PROCESS “bottleneck” try Result := AllocNewFatElement; finally Unlock; end;//try..finally
It is the same for clusters.
Even if the elements will “hang”, t will not get worse and the storage will not be broken. It will have “holes”.
Taking into account the fact that at night the “night Update” takes place (if it is possible), the storage is repacked - in any event. It means the “holes” will disappear - in any case – by morning.
As a result, we’ll have a repacked persistent part with no holes and the “empty” variable part.
During the day clients will write their documents’ versions into the variable part.
The process repeats the next night.
This is, of course, in case there are no working users connected to the base.
A special post is - here. http://programmingmindstream.blogspot.ru/2014/09/blog-post_25.html
Note that this is “all about internal products”. I won’t tell about third-party products.
For those who’ve read to the end, the task looks as follows:
“Deficit” :-(
Комментариев нет:
Отправить комментарий