Saturday, September 18, 2010

Regression runners: Whot! No Job Arrays?

I went from running regressions consisting of greater than 50K simulations nightly using Mentors QuestaSim, where I had written the regression framework and results preparation myself - to my first real Cadece vManager controlled regression project.

I should explain to my readers who usually see Python focused blog entries, that this is about the running of large regressions on compute farms for verifying ASICs (although my regression runner used Python extensively).

I did like the more minimal, HTML-based GUI on the tool - I've spent too much time creating work-arounds for flashy GUI's that don't work outside a narrow area, and so know they can be a double-edged sword. The tool needs a lot of configuration behind the scenes, and they have chosen to do this using their own vsif file format which seems to me to be a design fault. If they had gone for XML (gosh is that me advocating the use of XML), or YAML, then the format would be as readable and more easily used in other tools - no tool is an island in a design flow.

The main problem with vManager, is with its use of LSF. LSF is a separate tool used to harness a network of computers into a compute cluster. LSF runs jobs on the cluster, managing the cluster to give maximum job throughput. vManager can turn each test that needs to run on the design into a possible series of jobs for LSF, but it fails to use job arrays! Job arrays allow for many similar jobs to be submitted and controlled as one array of jobs where the indexing of the individual job in the array is used to introduce a variation in the test.

In my regression framework I created arrays of over 100K jobs, that took several weeks to run, and all the time that the job array at the heart of my regression was running, IT could change the parameters of my job array to control its use of resources according to changing priorities; scale back the maximum number of concurrent simulations during the 9-till-5; ramp up during weekends, (or stop then resume for a 'planned' license outage). IT had all the information they need to show the full extent of the regression and LSF will give the throughput of the array from which I can estimate completion times. Another engineer wishing to start a regression has immediate feedback of what else is running rather than a huge list of hundreds of jobs that it is difficult to make sense of. In short, ASIC verification regressions map naturally to Job Arrays.

The other problem I have with vManager is a missing feature: Randomization of the order that tests are run. If you have 100 tests to run on five areas of the design then it is natural to compose the vsif file that describes each test in some sort of orderly progression through the tests to run. When the regression is running and you are monitoring partial results then it is easier to get a sense of the overall result if the individual tests are run in a random order. In my regression framework I generated tests in an order, then randomized the set of tests based on Pythons excellent Mersenne Twister implementation. By monitoring results during previous runs I could judge how much of a regression had to run if we needed results in a specified time.

A bit of background (re?)search showed that Mentor have their own regression running framework, that also doesn't use job arrays; and that the Grid Engine cluster management tool supports job arrays.