Softpanorama

May the source be with you, but remember the KISS principle ;-)
Contents Bulletin Scripting in shell and Perl Network troubleshooting History Humor

Son of Grid Engine

News SGE implementations Recommended Links Usage of NFS in Grid Engine Installation of Son of Grid Engine 8.1.8 RPMs for Master Host Installation of the Son of Grid Engine 8.1.8 RPMs of Execution Host
Migrating to Son of Grid Engine 8.1.8 Sun SGE 6.2u5 SGE 6.2u7 (Oracle Grid engine)

UNIVA Grid Engine

Installation of Grid Engine Master Host Installation of the Grid Engine Execution Host
SGE Troubleshooting Gridengine diag tool Duke University Tools SGE History Humor Etc

Introduction

Son of Grid Engine (SoGE) is a community continuation of the open source version of Sun Grid Engine. This, currently the most reliable and usable, open source distribution of SGE was created by Dave Love.

SoGE is a superset of the apparently dead (and withdrawn from the Web) open source "base" Univa distribution.  It also integrates useful existing patches, utilities, and several enhancements of the codebase,  being currently the best up-to-date open source implementation of SGE.

Actually this is the only one actively maintained open source implementation of SGE that I managed to find. So you do not have many option to choose from  ;-).  But you might be able still use SGE 6.2u5 with Debian if you wish. Or compile it for your own distribution. 

As its codebase originated from open source version produced by Univa, it inherits some of Univa introduced changes. Some of them are questionable, like changed in comparison with classic version 6.2u5 defaults for qhost command.

The current version 8.1.8  created 2014-11-03

Here is description from Dave Love site:

Son of Grid Engine is a community project to continue Sun's old gridengine ​free software project that used to live at ​http://gridengine.sunsource.net after Oracle shut down the site and stopped contributing code. (Univa now own the copyright — see below.) It will maintain copies of as much as possible/useful from the old site.

The idea is to encourage sharing, in the spirit of the original project, informed by long experience of free software projects and scientific computing support. Please contribute, and share code or ideas for improvement, especially any ideas for encouraging contribution.

This effort precedes Univa taking over gridengine maintenance and subsequently apparently making it entirely proprietary, rather than the originally-promised ‘open core’. What's here was originally based on ​Univa's free code and was intended to be fed into that.

See also the ​gridengine.org site, in particular the ​mail lists hosted there. The gridengine.org users list is probably the best one to use for general gridengine discussions and questions which aren't specific to this project.

Currently most information you find for the gridengine v6.2u5 release will apply to this effort, but the non-free documentation that used to be available from Oracle has been expurgated and no-one has the time/interest to replace it. See also Other Resources, particularly ​extra information locally, and the ​download area.

This wiki isn't currently generally editable, but will be when spam protection is in place; yes it needs reorganizing and expanding. If you're a known past contributor to gridengine and would like to help, please get in touch for access or to make any other contributions.

The Trac site  hosts the source repository. There is a download area for source and binary releases including RPMs for RHEL 5 and 6.

There are two mailing lists:

RPMs for use for RHEL 6.5

RPMs for use for RHEL 6.5

RPMs shipped with SGE are not real RPMs. Additional installation using SGE installer is still required. They are proxy for tar files with some built-in checking.  And after you install them you can move the directory to necessary location and proceed with installation. for example real installation of master daemon should be performed using special installation script install_qmaster, as described in Installation of SGE Master Host. The key value of using RPMs its ability to check presence of necessary libraries. For example attempt to install RPMs compiled for RHEL fails on SLES because of libraries problems.

Content  of RPM can be viewed by using rpm -qlp file command

If you don't like default target directory which is /opt/sge, you might try to RPM installation to different directory using RPM option --prefix, for example

rpm -iv --prefix=/apps/sge gridengine-8.1.7-1.el6.x86_64.rpm

But usually /opt/sge is good enough and corresponds to the standard location of a large complex software package.

You can't install RPM's directly as on stock RHEL 6.5 system there are unresolved dependencies. You need to add EPEL depository and get one package for perl (XML::Simple) for RHEL add-on channel

[0]root@sandbox: # yum install gridengine-8.1.8-1.el6.x86_64.rpm
Loaded plugins: product-id, refresh-packagekit, rhnplugin, security, subscription-manager
This system is receiving updates from RHN Classic or RHN Satellite.
Setting up Install Process
Examining gridengine-8.1.8-1.el6.x86_64.rpm: gridengine-8.1.8-1.el6.x86_64
Marking gridengine-8.1.8-1.el6.x86_64.rpm to be installed
Resolving Dependencies
--> Running transaction check
---> Package gridengine.x86_64 0:8.1.8-1.el6 will be installed
--> Processing Dependency: perl(XML::Simple) for package: gridengine-8.1.8-1.el6.x86_64
--> Processing Dependency: libhwloc.so.5()(64bit) for package: gridengine-8.1.8-1.el6.x86_64
--> Processing Dependency: libjemalloc.so.1()(64bit) for package: gridengine-8.1.8-1.el6.x86_64
--> Running transaction check
---> Package gridengine.x86_64 0:8.1.8-1.el6 will be installed
--> Processing Dependency: perl(XML::Simple) for package: gridengine-8.1.8-1.el6.x86_64
--> Processing Dependency: libjemalloc.so.1()(64bit) for package: gridengine-8.1.8-1.el6.x86_64
---> Package hwloc.x86_64 0:1.5-3.el6_5 will be installed
--> Finished Dependency Resolution
Error: Package: gridengine-8.1.8-1.el6.x86_64 (/gridengine-8.1.8-1.el6.x86_64)
           Requires: perl(XML::Simple)
Error: Package: gridengine-8.1.8-1.el6.x86_64 (/gridengine-8.1.8-1.el6.x86_64)
           Requires: libjemalloc.so.1()(64bit)
 You could try using --skip-broken to work around the problem
 You could try running: rpm -Va --nofiles --nodigest

For details about installing RPM and dependencies see

Migration issues

See Migrating to Grid Engine 8.1.8


Top Visited
Switchboard
Latest
Past week
Past month

NEWS CONTENTS

Old News ;-)

Installing and Setting Up Sun Grid Engine on a Single Multi-Core PC

To embark on the installation of the gridengine packages, run the following command on your terminal:
sudo apt-get install \

gridengine-master gridengine-exec gridengine-common gridengine-qmon gridengine-client

Instead, you can run the shorter, and perhaps more error-prone, command

sudo apt-get install gridengine-*

A pop-up window will appear within the terminal during installation, with title "Configuring gridengine-common". A series of questions show up sequentially in this window:

  1. Question: "Configure SGE automatically?" Answer: highlight "<Yes>" and press "Enter".
  2. Question: "SGE cell name:" Answer: type "default", then press "Tab" to highlight "<Ok>" and press "Enter".
    Note here that you are free to choose any name you want for your SGE cell instead of "default", such as "sge_cell" for example. If you alter the SGE cell name, you will have to subsequently set the SGE_CELL variable in your ~/.bashrc file accordingly (assuming that bash is your default shell). For instance, if you set the SGE cell name to be sge_cell, you will add the following line in your ~/.bashrc:

    export SGE_CELL="sge_cell"

    Furthermore, you will need to add the above line of code in your /root/.bashrc file so that the SGE cell is also known to the root. It is advised that you leave the SGE cell name as it is, holding the "default" value.

  3. Question: "SGE master hostname:" Answer: type "localhost", then press "Tab" to highlight "<Ok>" and press "Enter".
    Instead of "localhost", you can choose the hostname of your computer, which can by found by running the "hostname" command from the terminal:
    hostname

After answering these three questions, the pop-up window closes and the installation continues on the terminal. If for any reason you need to reconfigure the gridengine-master package, you can do so by invoking the following command:

sudo dpkg-reconfigure gridengine-master

The installation of gridengine is now complete, yet this does not mean that you are necessarily ready to use SGE. First of all, check whether sge_qmaster and sge_execd are running by using the command

ps aux | grep "sge"

The output I got verified that sge_qmaster and sge_execd are running:

sgeadmin 1310 0.0 0.1 135968 5376 ? Sl 13:41 0:00 /usr/lib/gridengine/sge_qmaster

sgeadmin 1336 0.0 0.0 54760 1544 ? Sl 13:41 0:00 /usr/lib/gridengine/sge_execd

1000 3171 0.0 0.0 7780 860 pts/0 S+ 13:54 0:00 grep --colour=auto sge

If this is not the case for you, then start up sge_qmaster and sge_execd by executing the following three commands:

sudo su

sge_qmaster

sge_execd

Once you ensure that sge_qmaster and sge_execd are running, try to start qmon, the graphical user interface (GUI) for the administration of SGE:

sudo qmon

It is likely that the qmon window will not load, but instead you will get an error message. This is what I got:

Warning: Cannot convert string "-adobe-courier-medium-r-*--14-*-*-*-m-*-*-*" to type FontStruct

Warning: Cannot convert string "-adobe-courier-bold-r-*--14-*-*-*-m-*-*-*" to type FontStruct

Warning: Cannot convert string "-adobe-courier-medium-r-*--12-*-*-*-m-*-*-*" to type FontStruct

X Error of failed request: BadName (named color or font does not exist)

Major opcode of failed request: 45 (X_OpenFont)

Serial number of failed request: 643

Current serial number in output stream: 654

The error message indicates that some fonts are missing. The package which contains the necessary fonts is called xfonts-75dpi. In my case, xfonts-75dpi was installed automatically alongside the installation of the gridengine packages. Nevertheless, I got the error message because the fonts were not loaded after their installation. So, I merely restarted my computer. After rebooting, the "sudo qmon" command loaded the qmon window. If xfonts-75dpi is not installed on your system, then install it using the following command and then reboot:

sudo apt-get install xfonts-75dpi

After having resolved any possible font-related issues "sudo qmon" should load the SGE admin window. If you let the window remain idle or if you try to press any of its buttons, such as "Job Control", the most likely event will be the appearance of a message pop-up window with the text "cannot reach qmaster". Click on the "Abort" button of the pop-up window to terminate qmon. Try also the qstat command, which in my case gave the following error message:

error: commlib error: access denied (client IP resolved to host name "localhost". This is not identical to clients host name "russell")

error: unable to contact qmaster using port 6444 on host "russell"

It is useful to delve in the error message in conjunction with the /etc/hosts file of my system:

127.0.0.1 localhost

127.0.1.1 russell

# The following lines are desirable for IPv6 capable hosts

::1 localhost ip6-localhost ip6-loopback

fe00::0 ip6-localnet

ff00::0 ip6-mcastprefix

ff02::1 ip6-allnodes

ff02::2 ip6-allrouters

ff02::3 ip6-allhosts

The hostname of my computer is "russell". According to the error message, SGE set the client hostname to "russell", whose LAN IP address is 127.0.1.1, while it set the client IP to 127.0.0.1, which is the LAN IP designated to the hostname "localhost". To resolve this ambiguity, I changed the first two lines of my /etc/hosts so that both hostnames "localhost" and "russell" share the same LAN IP (as a word of warning, make a backup of your /etc/hosts file before making any changes to it). To be more specific, I deleted the second line and appended the "russell" hostname to the end of the first line. My /etc/hosts file thus became:

127.0.0.1 localhost russell

# The following lines are desirable for IPv6 capable hosts

::1 localhost ip6-localhost ip6-loopback

fe00::0 ip6-localnet

ff00::0 ip6-mcastprefix

ff02::1 ip6-allnodes

ff02::2 ip6-allrouters

ff02::3 ip6-allhosts

Moreover, it is possible that your /etc/hosts file contains by default the string "localhost.localdomain" in the first line, for example as in

127.0.0.1 localhost localhost.localdomain russell

If that's the case, make sure you remove "localhost.localdomain" so that only "localhost" and your machine's hostname ("russell" is my hostname), are tied to the LAN IP 127.0.0.1:

127.0.0.1 localhost russell

You may restart sge_qmaster and sge_execd, although it is not advised given that you made a fundamental change to your system's state by reconfiguring the association between IPs and hostnames in the /etc/hosts file.

Instead, you are advised to restart your computer before you proceed any further. After rebooting, "qstat" and "sudo qmon" should run without returning any error messages.

Installing Debian Linux, DRBL and the Sun Grid Engine (SGE) - Woojay Jeon

sites.google.com/site/woojay
V. Installing the Sun Grid Engine

I basically followed the installation instructions on the Grid Engine website to install qmaster via "./inst_sge -m". I used Padraig's specifications, which I am going to quote here:

I will just add that if the installer asks if you want to enable a JMX MBean server, you can answer no.

After installation, I ran:

   source /opt/oge/default/common/settings.sh
to configure various environment variables. I also added this command to my .bashrc file.

Installation instructions for Sun of Grid engine 8.1.6.

A useful installation instructions for Sun of Grid engine 8.1.6.
fsl.fmrib.ox.ac.uk

This is a quick walk through to get Grid Engine going on Linux for those who would like to use it for something like FSL. This documentation is a little old, being written when the Grid Engine software was owned by Sun and often referred to as SGE (Sun Grid Engine). However, this covers the basic requirements. A quick start guide for Ubuntu/Debian is available here, but more detailed setup can be found on this page.

Since the demise of the open source (Sun) Grid Engine, various ports have sprung up. Ubuntu/Debian package the last publicly available release (6.2u5), but users of Red Hat variants (CentOS, Scientific Linux) or Debian/Ubuntu users wishing to use a more modern release should look to installing Son of Grid Engine which makes available RPM and DEB packages and is still actively maintained (last update November 2013).

Grid Engine generally consists of one master (qmaster) and a number of execute (exec) hosts, note that the qmaster machine can also be an exec host which is fine for small deployments, but large clusters should look to keeping these functions separate.

This documentation was originally produced by A. Janke (a.janke@gmail.com) and is now maintained by the FSL team.

NFS

Although Grid Engine can be configured such that all machines are self contained, the instructions here assume that at least some of the Grid Engine folders are shared amongst the controller (qmaster) and clients (exec hosts). To achieve this you will typically need to setup one or more NFS shares, typically at least the configuration files (see http://arc.liv.ac.uk/SGE/howto/nfsreduce.html). Further, the FSL binaries and datasets to be operated on should be made available to all exec hosts in the same filesystem location. In the case of the FSL software, you could install this to the same location on all execution hosts or install to one location and NFS mount this to the same location on all hosts. In the case of datasets, the instructions here assume you are using NFS mounts, but through prolog and epilog scripts it is possible to setup Grid Engine to copy data to/from exec hosts.

Setting up NFS shares is beyond the scope of this document.

Name services

Grid Engine needs to be able to locate exec hosts/qmasters based on host name. Assuming all of your hosts are known to your DNS service then you will have to do no work to set this up. If you don't have a DNS zone then you may need to configure the local /etc/hosts file to resolve hostnames or look into host aliases (man host_aliases) configuration

User accounts

Grid Engine runs the scheduled job as the user who submitted it, using the textual name form (not numeric ID). Consequently, all exec hosts need to know about all users who are going to submit jobs. In a very small scale setup you may wish to add the required users directly to each exec host, but this quickly becomes unmanageable, so we would recommend setting up some kind of centralised user database, e.g. LDAP, Active Directory.

Setting this up shared user accounts is beyond the scope of this document.

Admin account

The Grid Engine software has to run as a privileged user in order to be able to run jobs as the submitting user. However, as this is a potential security issue, the grid software that communicates with the network can be run under an admin account that doesn't have root access. This account needs to be available on all cluster hosts, so either set this up locally, or add it to your central LDAP/user account system.

If you decide to have a locally defined daemon account then set this up as follows (run as the root user) (this is Red Hat dialect, for Ubuntu/Debian use the interactive adduser command).

useradd --home /opt/sge --system sgeadmin

which will add a system account (e.g. no home folder creation, no ageing of the account etc). This should be run on the qmaster and all exec hosts.

Service ports

Grid Engine communicates over two statically configured ports. These ports have to be the same on all computers, and can be configured in the file /etc/services or by changing the Grid Engine configuration setup files that all users need to source to be able to use the software. The latter option is best where you need to have more than one cluster in a location, as each qmaster/exec host has to communicate with the different clusters on different ports. Modern Linux distributions are already setup with entries for Grid Engine (use grep sge_qmaster /etc/services to confirm). If your distribution does not include entries, then you need to add the following to this file:

sge_qmaster     6444/tcp                # Grid Engine Qmaster Service
sge_qmaster     6444/udp                # Grid Engine Qmaster Service
sge_execd       6445/tcp                # Grid Engine Execution Service
sge_execd       6445/udp                # Grid Engine Execution Service

commenting out any prior definitions for the ports 6444 and 6445.

... ... ...

Installation

Where we refer to $SGE_ROOT, when using the Son Of Grid Engine packages, this will be /opt/sge.

QMaster

Red Hat Enterprise etc

Installation of the RPMs should be carried out using YUM as any additional software dependancies will be automatically resolved. A Grid master can be installed using:

yum install gridengine-8.1.6-1.el6.x86_64.rpm gridengine-qmaster-8.1.6-1.el6.x86_64.rpm gridengine-execd-8.1.6-1.el6.x86_64.rpm gridengine-qmon-8.1.6-1.el6.x86_64.rpm gridengine-guiinst-8.1.6-1.el6.noarch.rpm

Set an environment variable and then install the qmaster as such:

export SGE_ROOT=/opt/sge
cd $SGE_ROOT
./install_qmaster

Now go through the interactive install process:

Now that we are back to a shell (finally) we need to add a few things to our root .bashrc so that we can access the SGE binaries. Add the following lines to /root/.bashrc

   # SGE settings
   export SGE_ROOT=/usr/sge
   export SGE_CELL=default
   if [ -e $SGE_ROOT/$SGE_CELL ]
   then
      . $SGE_ROOT/$SGE_CELL/common/settings.sh
   fi

And then be sure to re-source your .bashrc

. /root/.bashrc

Now we can add our own username as an admin so that we can manage the system without becoming root.

qconf -am <myusername>

e.g qconf -am jbloggs if your username is jbloggs.

Exec Host

The process for installing exec hosts is as follows

  1. Add the exec host to the master host as an admin host. If your exec host is called client.foo.com then run this on your master host:
    • qconf -ah client.foo.com
  2. On the client (client.foo.com)
    1. Add the sgeadmin username as per above
    2. Add the lines to /etc/services if required
    3. Add the SGE bits to /root/.bashrc and re-source it (. /.bashrc)
    4. Ensure the binaries have been installed
  3. Set an environment variable and then install the exec host (this might be the same machine as the queue master, for example if you only have one computer)
    • export SGE_ROOT=/opt/sge
      cd $SGE_ROOT
      ./install_execd
  4. Now go through the interactive install process:
  5. The installer will ask that you check that this host has been added as an administrative host with the qconf -ah <hostname> command. Ensure this is the case (you can remove it as an admin host after the install if you wish), then press enter to continue
  6. Make sure the Grid Engine root matches that configured on the Qmaster (/opt/sge)
  7. Ensure the cell name matches that configured on the master (default is usually fine "default")
  8. Accept the age_execd port setting
  9. Accept the message about the host being known as an admin host
  10. Make a decision about the spool directory. For medium to large clusters local spool directories are the best option, for small (this should be an NFS mount) or stand-alone installs the default is fine. An appropriate local spool folder name might be /var/spool/sge. If you choose to have a local spool folder you will now receive a warning that the change of 'execd_spool_dir' not being effective before execd has been restarted - you will have to stop/start the execd after completing the install for this to take effect.
  11. press "y" to install the startup scripts
  12. confirm you have read the following messages
  13. When asked about adding a default queue instance for this host answer "n" - FSL requires specific queues, so it is better to define these rather than the default queue.
  14. press enter to accept the next message and "n" to refuse to see the previous screen again and then finally enter to exit the installer

Repeat this installation procedure on all of the execution hosts...

[Nov 04, 2014] Son of Grid engine 8.1.8 released

Oct 24, 2014 | arc.liv.ac.uk

This is Son of Grid Engine version v8.1.8.

See http://arc.liv.ac.uk/repos/darcs/sge-release/NEWS for information on recent changes. See https://arc.liv.ac.uk/trac/SGE for more information.

The .deb and .rpm packages and the source tarball are signed with PGP key B5AEEEA9.

More RPMs (unsigned, unfortunately) are available at http://copr.fedoraproject.org/coprs/loveshack/SGE/
Version 8.1.8
-------------

* Building

  * Fix build with recent GNU binutils
  * Fix rpm build on Fedora and maybe use packaged swing-layout
  * Fix dpkg build on recent Debian/Ubuntu
  * uid_t/gid_t are unsigned on NetBSD
  * Install "work" in utilbin
  * Move qsched back to main package

* Bug fixes

  * Man fixes
  * Fix buffer overflow in qmake [#1507]
  * Avoid possible link-time error (Debian bug # 749413)
  * Fix some global messages from qstat -j
  * Avoid truncating usage values in accounting records
  * Fix execd remembering existing core binding when restarted [#1511]

* Enhancements

  * Nagios and Ganglia monitoring scripts
  * Python JSV implementation
  * Upstart and systemd startup files
  * Define DRMAA_LIBRARY_PATH in setup scripts

[SGE-discuss] A year of cpusets-cgroups with Son of Grid Engine

Mark Dixon m.c.dixon at leeds.ac.uk
Fri Oct 10 11:37:14 BST 2014

Hi there,

At Leeds University, we've been using the new(ish) cpuset feature in SoGE to bind jobs to cores for over a year now. I figured that some feedback to the list would be useful :)

We enabled it on a machine available for use by all academics here; a 3040 core cluster with InfiniBand, Lustre, RHEL6 and SoGE 8.1.6 (plus patches - most or all have made it into 8.1.7).

Our experience has been very positive: it is a big improvement compared to the original core binding mechanism in gridengine 6.2u5, because processes can no longer break out of their assigned cores by reassigning themselves a different taskset.

This is important: recent versions of most (all?) popular MPI implementations bind to cores by default to improve performance. It is desirable for both gridengine and MPI to do some level of core binding but, in a pre-cpuset world, it is all too easy for them to fight - generally resulting in some cores in a box being used by more than one job, while others are left idle.

Using cpusets makes that pain simply vanish: all the MPIs I have used behave well when presented with a cpuset, without any configuration. That alone makes me very happy!

There is always room for improvement, though: at least in 8.1.6, it can be a bit fiddly to setup the cpuset controller correctly and SoGE doesn't fail very gracefully when you get it wrong. Core binding decisions are still being made by the execd and not the qmaster, which is probably the root cause of bugs like #1479 and #1511.

All in all, a big thumbs up and thanks from us :)

[Jun 3, 2014] [SGE-discuss] SGE 8.1.7 available

SoGE 8.1.7 released (2014-06-03)

Version 8.1.7 of the Son of Grid Engine distribution is available from <http://arc.liv.ac.uk/downloads/SGE/releases/8.1.7/>, with various bug fixes and enhancements. A notable fix is for the longstanding problem of occasional major space leaks in qmaster with schedd_job_info=true.

Please report bugs, patches and suggestions for enhancement https://arc.liv.ac.uk/trac/SGE#mail.

Release notes:

Building

Bug fixes

Enhancements

Potentially incompatible changes

README

This is Son of Grid Engine version v8.1.7.

See  for information on
recent changes.  See  for more
information.

The .deb and .rpm packages and the source tarball are signed with PGP
key B5AEEEA9.  For some reason the el5 signatures won't verify on
RHEL5, but they can be verified by transferring the rpms to an RHEL6
system.

* sge-8.1.7.tar.gz, sge-8.1.7.tar.gz.sig:  Source tarball and PGP signature

* sge-8.1.7a.tar.gz:  Source with a fix for building with recent GNU
  binutils (RHEL7 and recent Fedora, at least)

* RPMs for Red Hat-ish systems, installing into /opt/sge with GUI
  installer and Hadoop support:

  * gridengine-8.1.7-1.src.rpm:  Source RPM for Red Hat 5/6, Fedora

  * gridengine-*8.1.7-1.el5.x86_64.rpm:  RPMs for Red Hat 5 (and
    CentOS, SL)

  * gridengine-*8.1.7-1.el6.x86_64.rpm:  RPMs for Red Hat 6 (and
    CentOS, SL)

  See  for source and
  binary RPMs of hwloc-1.4.1 for building/installing the RPMs above,
  if necessary.

* Debian packages, installing into /opt/sge, not providing the GUI
  installer or Hadoop support:

  * sge_8.1.7.dsc, sge_8.1.7.tar.gz:  Source packaging.  See
    , and see
     if you need (a more
    recent) hwloc.

  * sge-common_8.1.7_all.deb, sge-doc_8.1.7_all.deb,
    sge_8.1.7_amd64.deb, sge-dbg_8.1.7_amd64.deb: Binary packages
    built on Debian Wheezy.

* sge-8.1.7-common.tar.gz:  Common files to accompany binary tarballs
  (including GUI installer and Hadoop integration)

* arco-8.1.6.tar.gz:  ARCo source (unchanged from last version)

* dbwriter-8.1.6.tar.gz:  compiled dbwriter component of ARCo
  (unchanged from last version)

More (S)RPMS may be available at http://jur-linux.org/rpms/el-updates/,
thanks to Florian La Roche.

[SGE-discuss] Ballooning qmaster memory usage followed by a crash.

Adam Tygart mozes at ksu.edu
Fri May 16 14:50:17 BST 2014
Hello all,

I'm running SoGE 8.1.6 compiled from source. This has been running
fine for a few months, however overnight the memory usage of the
qmaster ballooned from about 1GB to ~60GB before the qmaster dies. I
can restart it, but within 2-3 minutes it happens again.

I've placed debug output from the scheduler and a qstat output, and a
sample from perf top here:
https://people.beocat.cis.ksu.edu/~mozes/sge/

We routinely have more more jobs in the queue than we currently do, so
I am not sure what could be causing the issue.

Anyone have any thoughts on what is happening? Is there any other
information you would need?

Thanks,
Adam

[SGE-discuss] Ballooning qmaster memory usage followed by a crash.

John Foley jfoley at motorola.com
Fri May 16 14:56:24 BST 2014


Hi Adam - had the same thing happen to me yesterday (on UGEv8.1.7, but I'm
guessing it might be the same thing)- -

after you restart the qmaster, before it locks up, run a qconf -msconf --
and check the line "schedd_job_info" - if it's set to "true", change it to
"false" and restart qmaster.

Give that a try and see if it helps.

   John

[SGE-discuss] Advice sort on SGE Queue Configurations

Reuti reuti at staff.uni-marburg.de
Wed May 21 19:34:07 BST 2014
Hi,

Am 21.05.2014 um 11:27 schrieb Dr Andrew Smith:


> We currently building a small cluster that is mainly for users to develop their code. What we aim to provide is a system that supports:
> a fast turn around for small time limited jobs - for development and debugging
> the ability to run longer jobs overnight and at the weekends
In principle also short jobs can run in a queue without any time limit. There are two options:

a)

Having two queues would allow to set up a resource quota set (RQS) to have a fixed number of maximum jobs of each type. During submission, it's best to request the necessary wall clock time for the job, and SGE will select an appropriate queue for this job. There is no need to specify a queue in the `qsub ...` command.

While there is a calendar to switch on/off/suspend queues, there no such thing in RQS. But you could adjust the number specified there with a cron job issuing `qconf -mattr resource_quota ...` to have a different number for each type at the weekends.

b)

In case you prefer one queue, you could set up two custom consumable complexes and the jobs have to request them like "short" and "long" which were attached to the global exechost configuration with an arbitrary high value. The consumption of each consumable could then be limited by an RQS again.

I would go for a), as requesting time is more natural and users can't bypass it to run more long jobs although they shouldn't. You could of course attach an h_rt setting to each job in JSV. The users would learn in case b) to request the correct consumable.


NB: short jobs can also start in the long queue. This is normal as there is no minimum request setting for any limit (unless you implement this by an JSV and route jobs to dedicated queues on your own).


Sorry if this sounds confusing, but there are often several ways to achieve a goal, and it might be personal taste what one prefers.

> the ability to prioritise jobs for groups who "own" a share of the nodes
You can set up a share-tree policy, so that over a certain time frame the group gets the computing time equivalent to their share of the complete cluster.

-- Reuti

> I am looking for best practice on how we can achieve this. Are separate queues to address the three aims the way to go, or it is possible to have only one queue configured for this?
> Any advise, or pointers to documentation on how to do this would be much appreciated.
> Thank you,
> Andrew
> <andrew_smith.vcf>_______________________________________________
> SGE-discuss mailing list
> SGE-discuss at liv.ac.uk
> https://arc.liv.ac.uk/mailman/listinfo/sge-discuss

Recommended Links

Softpanorama hot topic of the month

Softpanorama Recommended

Son of Grid Engine -- the official site by Dave Love.

Installation of the Son of Grid Engine 8.1.8 RPMs of Execution Host

Installation of Son of Grid Engine 8.1.8 RPMs for Master Host

SGE 6.2u5 documentation (freely available and mostly generally applicable)



Etc

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available in our efforts to advance understanding of environmental, political, human rights, economic, democracy, scientific, and social justice issues, etc. We believe this constitutes a 'fair use' of any such copyrighted material as provided for in section 107 of the US Copyright Law. In accordance with Title 17 U.S.C. Section 107, the material on this site is distributed without profit exclusivly for research and educational purposes.   If you wish to use copyrighted material from this site for purposes of your own that go beyond 'fair use', you must obtain permission from the copyright owner. 

ABUSE: IPs or network segments from which we detect a stream of probes might be blocked for no less then 90 days. Multiple types of probes increase this period.  

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least


Copyright © 1996-2016 by Dr. Nikolai Bezroukov. www.softpanorama.org was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License.

The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to make a contribution, supporting development of this site and speed up access. In case softpanorama.org is down you can use the at softpanorama.info

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the author present and former employers, SDNP or any other organization the author may be associated with. We do not warrant the correctness of the information provided or its fitness for any purpose.

Last modified: May, 07, 2015