Saturday, September 18, 2010

Regression runners: Whot! No Job Arrays?

I went from running regressions consisting of greater than 50K simulations nightly using Mentors QuestaSim, where I had written the regression framework and results preparation myself - to my first real Cadece vManager controlled regression project.

I should explain to my readers who usually see Python focused blog entries, that this is about the running of large regressions on compute farms for verifying ASICs (although my regression runner used Python extensively).

I did like the more minimal, HTML-based GUI on the tool - I've spent too much time creating work-arounds for flashy GUI's that don't work outside a narrow area, and so know they can be a double-edged sword. The tool needs a lot of configuration behind the scenes, and they have chosen to do this using their own vsif file format which seems to me to be a design fault. If they had gone for XML (gosh is that me advocating the use of XML), or YAML, then the format would be as readable and more easily used in other tools - no tool is an island in a design flow.

The main problem with vManager, is with its use of LSF. LSF is a separate tool used to harness a network of computers into a compute cluster. LSF runs jobs on the cluster, managing the cluster to give maximum job throughput. vManager can turn each test that needs to run on the design into a possible series of jobs for LSF, but it fails to use job arrays! Job arrays allow for many similar jobs to be submitted and controlled as one array of jobs where the indexing of the individual job in the array is used to introduce a variation in the test.

In my regression framework I created arrays of over 100K jobs, that took several weeks to run, and all the time that the job array at the heart of my regression was running, IT could change the parameters of my job array to control its use of resources according to changing priorities; scale back the maximum number of concurrent simulations during the 9-till-5; ramp up during weekends, (or stop then resume for a 'planned' license outage). IT had all the information they need to show the full extent of the regression and LSF will give the throughput of the array from which I can estimate completion times. Another engineer wishing to start a regression has immediate feedback of what else is running rather than a huge list of hundreds of jobs that it is difficult to make sense of. In short, ASIC verification regressions map naturally to Job Arrays.

The other problem I have with vManager is a missing feature: Randomization of the order that tests are run. If you have 100 tests to run on five areas of the design then it is natural to compose the vsif file that describes each test in some sort of orderly progression through the tests to run. When the regression is running and you are monitoring partial results then it is easier to get a sense of the overall result if the individual tests are run in a random order. In my regression framework I generated tests in an order, then randomized the set of tests based on Pythons excellent Mersenne Twister implementation. By monitoring results during previous runs I could judge how much of a regression had to run if we needed results in a specified time.

A bit of background (re?)search showed that Mentor have their own regression running framework, that also doesn't use job arrays; and that the Grid Engine cluster management tool supports job arrays.

3 comments:

  1. P.S. this made good background reading on what can be built on top of vManagers current capabilities

    ReplyDelete
  2. Hi,

    I am very curious about your regression process.

    Do you use multiple job arrays per regression, or do you want the regression as a single job array?

    How do you track which tests run in a regression in order to tell what's left to run as the regression progresses?

    How do you track how many regressions have been run on the project and how do you relate the regressions back to your source code management?

    For concurrent simulations, I have tried job preemption with success, and I want to fully deploy it soon. Regression jobs will be preempted by higher priority jobs (people sitting at their desk waiting for a job are the higher priority than long running regressions).

    I don't plan to use job arrays. In my regression system, users specify the name of the test they wish to run, and the number of iterations for the test. Everything else is automatically set for them (mainly seed and output file location). They can query the regression to see the status of each test, and they can see which seed was used (in case they need to re-run a test with more verbosity or waveform dumping).

    To feed the Mersenne Twister with an initial value, I use /dev/random. I used to use time of day, but that was disastrous.

    Thanks for your post!
    Martin

    ReplyDelete
  3. Hi Martin,
    In the regression flow I created, each job in the job array set up the run environment in a separate directory, ran a test; then did initial scanning of logs to produce extended info on the success of the test (outrputting a .py file of a dict of the results. All but the raw simulation log and the .py file were then deleted, and the log gzipped.

    At any time I could run a Python script that walked the run directories, loading the .py files and aggregating the results. LSF has the 'bjobs -a' which shows how the job array is progressing, or the bhist command can give stats on the job arrays throughput.

    We use Clearcase for SCM and the regression flow checks config spec, checked out files, date and time into a regression log as part of the compilation and export flow prior to running a regression. i am made aware of anything left checked out and have enough details to rerun past regressions from Clearcase.

    I don't know if I'm spoilt, but we get good use of our compute cluster by tuning LSF and submitting all jobs via LSF. Its just that we could do better by using job arrays more as I was able to work with IT and get increased throughput as well as easier modifications to my regression by using job arrays.

    ReplyDelete