After I open up all my datasets into the program, select websdm.xml as the configuration file, check both report settings and then click start, it beings generating report and seems to stop at "Generating Lookup Caches". What step is this and does it have anything to do with the file size of the datasets. Is there a max file size that the program can handle?
Hello,
You're right that this is related to the size of your datasets. Please note that the Validator itself can process files of any size without issue. However, there are a lot of validation rules that work with data from more than one dataset, and it is possible that the space required to reference those datasets temporarily (for those particular rules) is too great for the memory on your computer that is available to the program.
For instance, the DM domain, which is likely being processed first in your case, performs checks (IR4005, IR4506, IR4507, IR4508) that require data from every SDTM Findings domain, in addition to the EX, DS, and TA domains. This requires extra temporary memory, so if the datasets mentioned are very large, there might be an issue.
We're busy updating the code that's related to this to improve on this issue, but as we're still also experimenting with dataset sizes versus memory / computer requirements, we'd appreciate it if you could help us.
If possible, could you please provide us with the record sizes of all of your study's Findings datasets? You're welcome to just give an estimate of the total sum of all of the datasets combined, if that's easier. We'd also like to know how much memory is available on your computer, because it may be possible to increase the maximum memory the program requests so that you are able to complete your validation successfully.
Thank you for your support feedback, and we apologise for any inconvenience it may have caused. We hope to help you resolve this issue quickly, if at all possible.
Regards,
Tim
Hi Tim,
At first, we tried to load all of our XPT files into the program. This was a total of 1.39GB worth of data (including the supplemental datasets). Since this seemed to freeze the validation report, we then removed the supplemental domains and ran the main datasets through. This was a total of 216MB. This still froze the report at that step. I was thinking this was a little odd since we have in the past ran the report on 117MB and it went rather quickly. However, since we have upgraded to the new beta, the reporting seems to take longer. For instance, this morning I ran the report on 24MB and it took 18 minutes.
The computers we are running reports on have 1.5GB of memory.
Thanks for the help!
Generally the supplemental qualifiers should be alright. However, if you have many Findings datasets with a substantial number of records (hundreds of thousands of records per dataset), there might be an issue with the memory demands.
If you aren't running too many other programs while running the Validator, you can increase the memory available to it and see if that helps at all. Right-click on the client.bat file, and click Edit (if you get a security dialog, press Run). Then you can change the two "512m"'s to "1g", and save the file. This will double the maximum amount of memory available, which I hope will resolve any issues you're currently having.
As we work on Beta 2, we'll try to address this problem a little more effectively so that the demands for large files are not so substantial. Naturally there are always going to be resource limitations based on the host system, but we're striving to generate the smallest footprint possible.
In regards to performance, our own tests haven't revealed any noticeable decrease. Running the CDISC Pilot Study SDTM datasets against the provided WebSDM configuration took 14 minutes, 25 seconds for me with the included client.bat file (View Screenshot). This is comparable to the approximated 10 minutes required by Alpha 2, with the extra few minutes coming from the addition of the rules (Such as IR4005) that were not performed in the previous release. If you find that running the Pilot takes a significantly longer amount of time, please let us know so we can investigate why that might be.
Regards,
Tim
Hi Tim
I am the person having the issue mentioned above. I was asking a coleague to post for me since I did not have a user account at that time.
Yesterday I updated client.bat with your suggestion of "1g" and ran the domains through. This was excluding the supplemental datasets since they are the majority of the file size. So, the total filesize being sent through the program was 216MB. However, the program got past the first domain so I let it continue.
It ran to completion for a total of 34 datasets. However... it took 10 hours and 55 minutes!!
Also - I noticed on the screen that pops up after validation (the one that tells you how long the report took) it only showed up as 55 minutes and XX seconds. Does it ignore hours? It is assumed that a report will never take more than 1 hour? Mine certainly did. :)
Perhaps the new beta will be a bit quicker because 10+ hours is a bit much.
Here is some more background info about my datasets. Note that some of these are not standard domains and therefore should not get validated, right? I am assuming LB is what took the most time for lookup rules that involve checking across domains.
NOTE: Sorry about the format, this comment area strips all white space. :(
Domain/Filesize (KB)/# Records
AE/5,490/5,464
CM/7,092/8,621
CO/334/1,165
DA/5,727/11,058
DM/245/603
DS/313/667
DV/620/652
EG/977/1,983
EX/3,600/6,950
FA/2,370/3,215
IE/4/3
LB/145,477/170,244
MH/6,669/7,525
PE/5,842/7,550
SC/3,745/10,826
SE/589/1,206
SV/3,128/9,469
TA/4/3
TE/3/3
TI/9/26
TS/10/30
TV/17/32
VS/9,357/17,384
YO/2/0
YX/21/25
ZA/441/669
ZC/166/612
ZD/1,569/4,982
ZL/1,564/5,440
ZM/2,593/6,000
ZP/9,510/18,368
ZX/25/72
ZY/3,883/7,612
ZZ/92/244
Hi,
First of all, I apologize for you having to wait so incredibly wrong to see results. This is definitely not what's supposed to be happening, and we really do appreciate your patience.
What likely happened this is that the increase in memory was enough for you to complete the first set, but the program reached a saturation point where it was only getting rid of enough memory to process a few more records at a time. If you noticed that the processed record count was going very slowly, this is would be why.
Your study seems similar to the CDISC Pilot, at least as far as total number of records and number of findings observations (most validation intensive), so it's really surprising that there such a discrepancy in performance. So just to rule out environment issues, would you be able to run the CDISC Pilot study on your PC and we'll compare our metrics?
It does indeed ignore hours. Clearly that's a mistake, I'll make a note of that. :)
If your domain is a custom domain (Findings, Events, Interventions, etc.), the Validator will recognize its metadata format and generate validation rules for it that correspond to that domain class automatically, so that you get the most complete validation possible without having to create your own configuration.
Thank you for the information. Hopefully we'll be able to figure out what could possibly be going wrong here.
Regards,
Tim
Hi Tim
Sorry it has been a few days - I have been a little busy.
I ran the validator on the CDISC pilot study data. I kept the memory to 1g so that it is the same as the previous example done on my data. The pilot SDTMs total 23 datasets, about 116MB.
FYI - The validator is running the checks off of XPT files located on a networked drive and not from the local disk. This is how the 10h 55m test was ran also. I asked a colleague to run the validator on the same set of data described in the thread above after copying the files to his local disk. It went alot faster.
We had previously been running the validator (pre-beta) on network drive data and it was running quickly. This seems to be an issue with the latest beta.
The results:
My data [216MB] (on networked drive): 10h 55m
My data [216MB] (on local disk): 12m
CDISC Pilot Data [116MB] (on networked drive): CANCELLED
CDISC Pilot Data [116MB] (on local disk): 14m
I cancelled the CDISC Pilot on network drive because it was on par to take as long as before, too long for me to wait. :P
So it seems like using this validator on networked drives is a bad way to go, unless the data is small in size. Perhaps the modified algorithm you are working on will address this? Regardless, I will continue to use the validator because of the fact that it is a very useful tool and helpful to the entire community. I will just copy the files to local disk before starting validation.
I hope this helps. :)
Regards,
Trevor
Hi Trevor,
No problem. As always, we appreciate any time you're willing to give providing feedback.
Ah, you were accessing data from a networked drive. That makes more sense, and is something I should have considered before. Networked operations are always going to be much slower, and most of the reasons for this are due to operating system behaviour that we have no control over, and not anything to do with the Validator's engine itself.
We'll be sure to update the documentation to make it clear that there will likely be significant performance degradation over networked drives while using the desktop client.
In an environment where the Validator is used extensively, it would ideally be mounted on an application server in a way that would take care of inefficiencies like this. The desktop client is primarily intended for demonstration and evaluation purposes (and other light-to-moderate usage situations), so it's not equipped for every deployment scenario.
Regards,
Tim
Pinnacle 21 uses cookies to make our site easier for you to use. By continuing to use this website, you agree to our use of cookies. For more info visit our Privacy Policy.
After I open up all my datasets into the program, select websdm.xml as the configuration file, check both report settings and then click start, it beings generating report and seems to stop at "Generating Lookup Caches". What step is this and does it have anything to do with the file size of the datasets. Is there a max file size that the program can handle?