This is a pretty wimpy description of Active Threads. It should be
enough to start using/testing. The current (May 1998) version is
1.2 Beta (matches the Sather compiler version).

Active Threads is a portable very light-weight thread package. 
See doc for detail description.

Most code is written from scratch. Some assembly for Sparc register
window save/restore and valuable insights are due to David Keppel, UW.

See examples in ./tests for usage.




Basic Types
=============================================================================
at_thread_t
	Active Threads thread type
at_bundle_t
	Active Threads bundle type
at_mutex_t
	Blocking mutual exclusion lock
at_sema_t
	Blocking semaphore
at_spinlock_t
	Spinning mutual exclusion lock (the current implementation first 
	spins on a local copy to minimize bus traffic)
at_hybridlock_t
	A hybrid implementation of a mutual exclusion lock. This is 
	essentially a two-phase mutual exclusion lock that combines spinning 
	with blocking. The semantics is the same as at_mutex_t, but the 
	calling thread may spin for a while before blocking. The current 
	implementation uses an exponential back-off policy for the spinning 
	interval.
at_userf_x_t 
	where x is currently between 0 and 6. A type for a user function used 
	in thread creation. A function takes x arguments of type at_word_t.
at_word_t
	A type that maps to a word size entity of the underlying architecture.

There is no requirement for the number of bits.

Scheduler types
=============================================================================
typedef struct at_scheduler {
  void (*thread_created)(at_bundle_t *b, at_thread_t *t);
  void (*thread_terminated)(at_bundle_t *b, at_thread_t *t);
  void (*thread_started)(at_bundle_t *b, at_thread_t *t);
  void (*thread_blocked)(at_bundle_t *b, at_thread_t *t);
  void (*thread_unblocked)(at_bundle_t *b, at_thread_t *t);
  void (*bundle_created)(at_bundle_t *parent, at_bundle_t *b);
  void (*bundle_terminated)(at_bundle_t *parent, at_bundle_t *b);
  void (*processor_idle)(at_bundle_t *b, int proc);
} at_scheduler_t;

This is the only type that a user-supplied scheduler must implement to 
extend the Active Thread scheduling library. Any scheduler implementation 
must provide event handlers with the above interfaces for the eight events 
vectored by the Active Threads runtime: thread created, thread terminated, 
thread started, thread blocked, thread unblocked, bundle created, bundle ter
minated and processor idle. Active Threads imposes no restrictions on inter
nals of scheduler data structures.


Basic Thread Operations
=============================================================================
at_thread_t *at_create_x(at_bundle_t *b, int affinity, at_userf_x_t *func, 
	at_word_t arg0,...)

	Create a new thread of control that will execute a user supplied 
	function func with the supplied arguments. The current implementation 
	supports up to 6 arguments. The thread is added to a specified bundle 
	b. If b supports locality-based scheduling, affinity may be used as 
	a virtual processor affinity annotation for a thread. All schedulers 
	that implement some form of locality-based scheduling must accept a 
	full range of virtual processors. AT_UNBOUND can be used if the bundle
	does not support affinity scheduling, or if the thread does not intend
	to take advantage of it.

void at_yield()
	yield execution to another thread.

void at_exit()
	terminate the calling thread

at_thread_t *at_self()
	return a pointer to the thread structure of a calling thread

void at_setlocal(void* addr)
	set the local memory base address for the currently running thread 
	to a specified value
void *at_getlocal()
	get the local memory base address for the currently running thread
void at_stop()
	stop execution of any new threads until a call to at_continue(). 
	Threads in progress are affected until they block or terminate.
void at_continue()
	resume thread execution
int at_get_affinity()
	return virtual processor affinity of a calling thread.
void at_set_affinity(int vproc)
	Changes the affinity of a calling thread to a specified virtual 
	processor. This is conceptually equivalent to thread blocking and 
	resuming on a physical processor to which a specified virtual 
	processor is mapped.
void at_create_local(at_thread_t *t)
	creates a new local storage of the size specified at Active thread 
	initialization for a thread t. Can be used, for instance, to 
	implementing a lazy allocation policy.
void at_destroy_local(at_thread_t *t)
	returns thread's local storage to a pool maintained by Active Threads.
void at_create_stack(at_thread_t *t)
	creates a new thread stack for a specified thread.
at_destroy_stack(at_thread_t *t)
	returns a supplied thread's stack to a pool of stacks maintained 
	by the Active Threads runtime.



Bundle Operations
=============================================================================
at_bundle_t *at_bundle_create(at_bundlle_t *parent, int type)
	create a new bundle of a specified type as a child of a parent bundle.
void *at_bundle_destroy(at_bundle_t *b)
	destroy the bundle which is no longer needed.
at_bundle_t* at_get_focus()
	obtain a bundle that has a current execution focus. A bundle with 
	a focus obtains all events that the Active Threads runtime vectors 
	to thread schedulers. It may handle them or pass them up or down 
	the bundle activation tree for handling.
void at_set_focus(at_bundle_t *b)
	set execution focus to a supplied bundle b.
void at_destroy_bundle(at_bundle_t *bundle)
	destroy the bundle (there better be no threads attached to it!)

Synchronization Objects
=============================================================================
Blocking Mutual Exclusion Locks
-------------------------------
at_mutex_t* at_mutex_create()
	create a new mutual exclusion lock
void at_mutex_init(at_mutex_t *m)
	initialize (possibly statically allocated) mutex. Mutex is 
	initialized to the unlocked state
at_mutex_destroy(at_mutex_t *mutex)
	the destroy the mutex which is no longer needed
void at_mutex_lock(at_mutex_t *mutex)
	lock the mutex pointer to by "mutex". If the mutex is already 
	locked, the calling thread blocks until the mutex becomes available.
void at_mutex_unlock(at_mutex_t *mutex)
	unlock the mutex pointed to by "mutex"
at_mutex_trylock(at_mutex_t *mutex)
	attempt to lock the mutex. If successful, locks the mutex and 
	returns 1, otherwise returns 0.

Readers/Writer Locks
-------------------------------
at_rw_t* at_rw_create()
	create a multiple reader, single writer lock 
	(in the unlocked state).
void at_rw_init(at_rw_t *rw)
	initialize (possibly statically allocated) readers/writer lock. 
	The lock is initialized to the unlocked state.
void at_rw_destroy(at_rw_t *rw)
	destroy the readers/writer lock
void at_rw_rd_lock(at_rw_t *rw)
	Acquire a read lock on the readers/writer lock. If rw is 
	already locked for writing, the calling thread blocks until 
	the writer lock is released. Many threads can acquire the 
	reader lock of rw at the same time.
int at_rw_rd_trylock(at_rw_t *rw)
	Attempt to acquire a read lock on rw. Returns 1 if the lock is 
	acquired, 0 otherwise.
void at_rw_wr_lock(at_rw_t *rw)
	Acquire a write lock on the readers/writer lock. If rw is 
	already locked for either reading or writing, the calling 
	thread blocks until all readers or the writer release the 
	lock. Only a single thread can hold a write lock of rw at any time.
int at_rw_wr_trylock(at_rw_t *rw)
	Attempt to acquire a write lock on rw. Returns 1 if the lock 
	is acquired, 0 otherwise.
void at_rw_rd_unlock(at_rw_t *rw)
	unlock the read lock of the readers/writer lock pointed to by rw.
void at_rw_wr_unlock(at_rw_t *rw)
	unlock the write lock of the readers/writer lock pointed to by rw.


Blocking Semaphores
-------------------------------
at_sema_t *at_sema_create(int count)
	create a counting semaphore and set it to a specified value. count 
	must be non-negative.
void at_sema_init(at_sema_t *s, int count)
	initialize (possibly statically allocated) semaphore to 'count'
void at_sema_destory(at_sema_t *sema)
	destroy a counting semaphore
void at_sema_wait(at_sema_t *sema)
	a calling thread may proceed only if the value of the semaphore is 
	currently greater than 0. If the semaphore value is positive, it 
	is decremented and the calling thread continues. Otherwise, the 
	calling thread blocks until the semaphore counter becomes positive.
int at_sema_trywait(at_sema_t *sema)
	a nonblocking version of the previous call. If the semaphore counter 
	is positive, its semantics is equivalent to that of at_sema_wait, but 
	it also returns 1. Otherwise, it returns 0 and does not change the 
	semaphore counter.
void at_sema_signal(at_sema_t *sema)
	increment the count of a semaphore. If prior to the call, the value 
	of sema was 0, and there were threads blocked on the semaphore, one 
	of them is unblocked and allowed to return from its call to 
	at_sema_wait().

Blocking Barrier
-------------------------------
at_barrier_t *at_barrier_create(int size)
	create a barrier object that becomes "open" when size threads 
	try to enter it. As long as the number of such threads is below 
	size, the threads are all blocked on the barrier. 
void at_barrier_init(at_barrier_t *barrier, int size)
	initialize a (possibly statically allocated) barrier to size.
void at_barrier_destroy(at_barrier_t *barrier)
	destroy a barrier object.
void at_barrier_enter(at_barrier_t *barrier)
	If the number of threads that have reached a barrier 
	(including the calling thread) is `size' (specified during 
	barrier initialization), all threads sleeping on a barrier are 
	unblocked. Otherwise, the calling thread blocks on the barrier.

Condition Variables
-------------------------------
Condition variables enable threads to block until an arbitrary condition 
is satisfied. The condition must always be tested under the protection of 
a mutex. When the condition is false, the thread blocks on the condition 
variable by calling at_cond_wait() and mutex is released by for the thread 
by the Active Threads runtime. Blocking on the condition variable and 
releasing the mutex is atomic. Any thread that changes the condition can 
signal the condition variable the change by calling at_cond_signal() or 
at_cond_broadcast()

at_cond_t *at_cond_create()
	create a new condition variable
void at_cond_init(at_cond_t *c)
	initialize (possibly statically allocated) condition variable
void at_cond_destory(at_cond_t *c)
	destroy a condition variable
void at_cond_wait(at_cond_t *c, at_mutex_t *mx)
	atomically releases the mutex pointed to by "mx" and causes 
	the calling thread to block on the condition variable pointed 
	to by "c". The blocked thread may be subsequently awaken by 
	at_cond_signal() or at_cond_broadcast(). Any change of the 
	associated condition must be reevaluated after a signal unblock 
	a thread.
void at_cond_signal(at_cond_t *c)
	unblocks one thread that is blocked on the condition variable
	 pointed to by "c"
void at_cond_broadcast(at_cond_t *c);
	unlcoks all threads that are blocked on the condition variable 
	pointed to by "c"

Spinning Mutual Exclusion Locks
-------------------------------
AT_SPINLOCK_DEC(s)
	declare a spinlock. Spinlocks don't need to be explicitly created 
	or deleted, but they do need to be explicitly initialized before use.
	The special type at_spinlock_t may be used for declarations or 
	typedefs, but BR_SPINLOCK_DEC is preferred when possible.
AT_SPINLOCK_INIT(s)
	initialize the spinlock
AT_SPINLOCK_LOCK(x)
	lock the spinlock. This may trigger busy-waiting if the spinlock 
	is already locked by another thread. The current implementation 
	first waits on a local cached copy of a spinlock to minimize the 
	bus traffic
AT_SPINLOCK_UNLOCK(x)
	unlock the spinlock. If prior to the call there were threads busy 
	waiting on the spinlock, a single thread is allowed to acquire the 
	spinlock and return from a call to AT_SPINLOCK_LOCK()
AT_SPINLOCK_TRY(x)
	a non-blocking version. If successful, lock the spinlock and 
	returns 1, otherwise returns 0.

Hybrid Implementation of Blocking Mutual Exclusion Lock
--------------------------------------------------------
	The semantics of hybridlocks is equivalent to that of mutual 
	exclusion locks (but different from that of spinlocks!). Hybridlock 
	perform some busy-waiting if the lock is already locked in an attempt 
	to avoid a context switch. If the lock remains locked, a calling 
	thread eventually blocks. The current implementation uses an 
	exponential back-off waiting policy for the spinning phase.
AT_HYBRIDLOCK_DEC(s)
AT_HYBRIDLOCK_INIT(s)
AT_HYBRIDLOCK_LOCK(x)
AT_HYBRIDLOCK_UNLOCK(x)
AT_HYBRIDLOCK_TRY(x)


Miscellaneous
--------------------------------------------------------
void at_init(unsigned int concurrency, unsigned int stack_size, 
	unsigned int local_size)
	initialize the Active Threads package. Use specified concurrency 
	level, stack size and local storage size.
	Concurrency is between 1 and the number of physical processors.

void at_do_when_idle(void (*func)())
	register a function to be called when a processor is out of work and 
	there are no work thread to run. This could be used, for example, 
	to periodically service the network or perform incremental garbage 
	collection.

int at_ncpus() 
	returns the number of physical processors

int at_vproc()
	returns the current  virtual processor number, or -1 if the
	thread is unbound. Could be thought of as an alias for
	at_get_affinity(). The virtual processor number has no relation
	to the physical processor number. When a thread is created,
	it can be assigned to a virtual processor. If the bundle
	supprts memory-conscious scheduling (default does not), it 
	will try to run threads bound to the same virtual cpu to
	run on the same physical cpu. Virtual processor numbers can
	be arbitrary large.

int at_cpu()
	returns the physical processor on which the calling thread is 
	executing

at_join_all()
	blocks until the calling thread is a unique *user* thread in the
	system (thus all others have terminated. Note that calling this
	from two threads causes an instant deadlock as they cannot
	join with each other.

at_thread_count()
	returns the *current* number of threads in the system



=============================================================================
Some numbers (Feb 24, '97), time in us:
SS10, 4CPU HyperSparc, 50Mhz

(Ping-pong time should be divided by 2 since each iteration deals 
 with two threads (see Sun reference)

                            Solaris 2.5 threads     Active Threads
-------------------------------------------------------------------
thread create                         1620                 4.3
null thread			      1715		  13.5
thread context switch                 30.0                 4.3
uncontested mutex (lock/unlock)        1.5                 0.7
uncontested semaphore (signal/wait)    6.6                 0.8
mutex try			       0.6		   0.45 
semaphore  try			      27.8		   0.5
mutex ping-pong                         90		    17
semaphore ping-pong                     90                  18


UltraSPARC-1, 167Mhz
                            Solaris 2.5 threads     Active Threads
-------------------------------------------------------------------
thread create                          276                 1.4  
null thread			       320		   5.2
thread context switch                 13.0   		   1.7 
uncontested mutex (lock/unlock)        1.0                 0.4
uncontested semaphore (signal/wait)    3.2                 0.4
mutex try			       0.3		   0.2 
semaphore try			      14.5		   0.2
mutex ping-pong                         38		   6.0
semaphore ping-pong                     45                 7.0


Pentium Pro, 200MHz
                                    Posix (Sun)     Active Threads
-------------------------------------------------------------------
thread create                          420                 1.4  
null thread			       460		   4.4
thread context switch                  4.8 		   1.5 
uncontested mutex (lock/unlock)        0.6                 0.5
uncontested semaphore (signal/wait)    2.3                 0.5
mutex try			       0.3 		   0.2 
semaphore try			      11.6		   0.2
mutex ping-pong                       16.5		   3.4
semaphore ping-pong                   22.4                 3.7



DEC Alpha 2100A, 250-MHz 
                                    OSF1 DCE         Active Threads
-------------------------------------------------------------------
thread create                        400.0                 1.0
null thread			     500.0		   2.9
thread context switch                  6.3   		   1.1 
uncontested mutex (lock/unlock)        0.6                 0.3
uncontested semaphore (signal/wait)    3.2                 0.3
mutex try			       0.3		   0.1 
semaphore try			       2.5		   0.1
mutex ping-pong                       19.0		   2.9
semaphore ping-pong                  112.0                 2.9



HPPA 9000/755, 99 MHZ, HP-UX 10.20 
                                    HP DCE         Active Threads
-------------------------------------------------------------------
thread create                        400.0                 2.0
null thread			     450.0		   7.2
thread context switch                 13.8   		   3.0 
uncontested mutex (lock/unlock)        1.6                 1.0
uncontested semaphore (signal/wait)    1.6                 1.0
mutex try			       0.8		   0.3 
semaphore try			       0.8		   0.3
mutex ping-pong                       27.0		   7.9
semaphore ping-pong                   27.0                 8.5

* semaphores for HPUX are implemented with mutexs. HPUX does not
  seem to have "normal" POSIX semaphores as defined by the POSIX 
  runtime standard.

Latencies for uncontested semaphores and mutexes could be father
improved by inlining, but I haven't done so far - acquire and release
functions do have some code.

This is a currently provided Active Threads interface:


