Details About Condor Stats

condor_stats, KEEP_POOL_HISTORY, CondorView, viewhist

Wisdom on the use and operation of condor_stats, based on e-mail by Alan De Smet during December 2003 and January 2004.

  • condor_stats requires that the collector that you're querying have KEEP_POOL_HISTORY turned on.

  • The first field in the output from resourcequery is a percentage through the requested data set. Thus the first entry will have a value close to 0.0, while the last will be close to 100.0.

  • If you want time stamps (instead of percentage of data sets), use "-orgformat" which will present the timestamps in seconds since the Unix epoch. The fields are: Timestamp in seconds, machine name, ":", idle time in seconds, load, and the machine state encoded as a number. The machine state coding is:

    1. unclaimed
    2. matched
    3. claimed
    4. pre-empting
    5. owner
    6. shutdown
    7. delete
    8. backfill

  • Note that the machine state coding is replicated in several locations. In addition to adding a new machine state in src/ condor_includes/ condor_state.h, the new machine state must also be added to the Collector View server in src/condor_collector.V6/view_server.h, src/condor_collector.V6/view_server.C; and to condor_stats in src/condor_tools/stats.C.

  • The -to and -from options measure time from the start of the date. So "-from 11 30 2003 -to 12 1 2003" will show data for the 30th of November.

  • Actually generating the results is done in the collector code, not in condor_stats. In most cases whatever condor_collector sends back is dumped directly to the output. The only exception is -resourcequery but not -orgformat, in that one case the the output is tweaked (to convert machine states from numbers to strings). condor_collector.V6/view_server.C is where most of the logic is.

  • The " CondorView server" shell scripts that generate the HTML pages on the View Server pages were re-written in C a long time ago by a student hourly. I think the source is here: /p/condor/workspaces/jepsen/src_java/condor/condorview/viewNT

      (see the original email from Todd about it in /p/condor/workspaces/jepsen/src_java/condor/condorview/todd.inst)

  • "Query type" consisists of one of the following options: -resourcelist, -resourcequery, -resgrouplist, -resgroupquery, -userlist, -userquery, -usergrouplist, -usergroupquery, -ckptlist, -ckptquery.
    • You must have one query type specified.
    • You can specify only one query type. If multiple queries are specified, only the last one takes effect. (In the future it is likely that condor_stats will exit with an error in this case.)

  • The non-"list" options require another argument specifying the query. There doesn't appear to be a way to default to the local machine or the current user. The argument is the exact second field in the record. See the -orgformat notes below, or this summary. The examples below are confirmed to work on our pool (which is why I chose them)

    • -userquery email_address/submit_machine
      • Example: adesmet@cs.wisc.edu/puffin.cs.wisc.edu

    • -resourcequery hostname
      • Example: p22.cs.wisc.edu

    • -resgroupquery Architecture/Operating System or href="/wiki-archive/pages/Total"
      • Example: INTEL/LINUX
      • Example: Total

    • -usergroupquery email_address or href="/wiki-archive/pages/Total"
      • Example: adesmet@cs.wisc.edu
      • Example: Total

    • -ckptquery hostname
      • Example: toucan.cs.wisc.edu

  • Things that will cause condor_stats to abort with the usage message:
    • Failure to specify a query type.
    • Failure to pass additional information to arguments that require it (Most arguments demand this. For example, resourcequery and from require an additional argument.)
    • A start date prior to the Unix epoch (Midnight UTC, Jan 1, 1970). This would typically be set with -from
    • A finish date in that is in the future. This would typically be set with -to
    • A finish date before the start date.

  • All queries have a time range. If not specified, the end time defaults to "now", the start time defaults to 1 day (86,400 seconds) ago. Thus, "-lastday" is effectively the default time range.
    • You can only specify the start time once. Similarlly with the end time. If multiple times are specified, only the last one takes effect. -to sets the start time. -from sets the end time. The following set both start and end time: lastday, lastweek, lastmonth, lasthours.

-orgformat

-orgformat only affects those query types which do not end with "list". The only difference between -orgformat and the default is the first column. To determine what is in the default, look at the orgformat, remove everything up to and including the first colon, and replace it with the percentage of time. So, for example, the -resourcequery -orgformat might include the line:
  1074095821      puffin.cs.wisc.edu      :       37590     1.000 3

That's time in seconds since the epoch, machine name, ":", idle time in seconds, load, and machine state as an integer. Going back to the default (removing the orgformat), we get:

  79.779999       37590   1.000000        CLAIMED

Everything in up and including the colon has been replaced with the percentage time. (You may also notice that the machine state has been converted from a number to a string. This is a special case in the condor_stats code and shouldn't happen for other queries.)

The -orgformat output to various query types directly correspond to log files in POOL_HISTORY_DIR on the view collector. You can effectively replicate the query by grepping through the appropriate file. The mappings are as such:

Command Data file
-userlist viewhist.0.*
-userquery viewhist.0.*
-resourcelist viewhist.1.*
-resourcequery viewhist.1.*
-resgrouplist viewhist.2.*
-resgroupquery viewhist.2.*
-usergrouplist viewhist.3.*
-usergroupquery viewhist.3.*
-ckptlist viewhist.4.*
-ckptquery viewhist.4.*

The second number is the granularity of data. The *.0 file is the highest sampling frequency but shortest period covered while the *.2 is the lowest sampling frequency but the longer period covered. The *.0 file contains samples every 4*POOL_HISTORY_SAMPLING_INTERVAL seconds. The *.1 files contain samples 1/4th as often as the *.0 files, while the *.2 files contain samples 1/4th as often as the *.1 files (or 1/16th as often as the *.0 files).

As a given written sample represents at least 4 samples and as many as 64, the sub samples (taken every POOL_HISTORY_SAMPLING_INTERVAL seconds) are averaged together. So a single entry in a *.0 file is the average of 4 samples, while a single entry in the *.2 file is the average of 64 samples.

File format

This is the format of the various viewhist.*.* files. Because -orgformat returns the same information, this is also the format of -orgformat's output. In the actual output fields are seperated by spaces, records are seperated by newlines.

viewhist.0.* / -userquery -orgformat

1071109949      adesmet@cs.wisc.edu/puffin.cs.wisc.edu      :       16      0
  • Timestamp measured in seconds since the Unix epoch
  • user_email_address/submit_machine
  • :
  • Average JobsRunning as integer
  • Average JobsIdle as integer

viewhist.2.* / -resgroupquery -orgformat

1055836559      Total   :       55.0    0.8     729.8   0.8     83.8
1055836559      INTEL/LINUX     :       43.8    0.8     578.8   0.8     20.0
  • Timestamp measured in seconds since the Unix epoch
  • machine type (Architecture/Operating System) or "Total" for all machines
  • :
  • Average Machines reporting unclaimed state as floating point number with one decimal place
  • Average Machines reporting matched state as floating point number with one decimal place
  • Average Machines reporting claimed state as floating point number with one decimal place
  • Average Machines reporting preempting state as floating point number with one decimal place
  • Average Machines reporting owner state as floating point number with one decimal place

viewhist.1.* / -resourcequery -orgformat

1074101829      p66.cs.wisc.edu :       30179368          0.130 3
  • Timestamp measured in seconds since the Unix epoch
  • startd machine name
  • :
  • Average Keyboard Idle in seconds as integer
  • Average Load Average as floating point number with 3 decimal places
  • Last Machine State as integer

viewhist.4.* / -ckptquery -orgformat

1057703428      toucan.cs.wisc.edu      :       45.379  136.138 1106.393        8196.154
  • Timestamp measured in seconds since the Unix epoch
  • checkpoint machine name
  • :
  • Average Bytes Received as floating point number with 3 decimal places
  • Average Bytes Sent as floating point number with 3 decimal places
  • Average Receive Bandwidth as floating point number with 3 decimal places
  • Average Send Bandwidth as floating point number with 3 decimal places

viewhist.3.* / -usergroupquery -orgformat

1072743565      matthew@cs.wisc.edu     :       3       22
  • Timestamp measured in seconds since the Unix epoch
  • user address
  • :
  • Average Jobs Running as integer
  • Average Jobs Idle as integer

Query Types

Command Query name Data Name File
Line QUERY_HIST_* *Data viewhist.#
-userlist SUBMITTOR_LIST Submittor 0
-userquery SUBMITTOR Submittor 0
-resourcelist STARTD_LIST Startd 1
-resourcequery STARTD Startd 1
-resgrouplist GROUPS_LIST Groups 2
-resgroupquery GROUPS Groups 2
-usergrouplist SUBMITTORGROUPS_LIST SubmittorGroups 3
-usergroupquery SUBMITTORGROUPS SubmittorGroups 3
-ckptlist CKPTSRVR_LIST Ckpt 4
-ckptquery CKPTSRVR Ckpt 4

(The file viewhist entry is the first number in the file. The second number is the archive number used when the logs roll over.)