<< | Index | Exercise 2 >>

To add a question because something is unclear or was not understood, just insert the question and add the prefix %q% for each addition (q like question). This is the "question-style". Like this:

* %q% What kind of problems could have decentralized nature?
  • What kind of problems could have decentralized nature?

If you want to answer a question or add a comment please put a %a% in front. This is thea "answer-style" (a lilke answer). An example:

* %a% This is an addition to something that I consider important.
  • This is an addition to something that I consider important.

For citations or references to the slides of Prof. Suri pleas add the lecture and slide number in braces: (<lecture>.<slide>).

Please make sure that you enter an author name, else your changes will not be saved!

Exercise 1


  1. Explain each of the following terms in a few short sentences:
    1. Dependability?
      Dependability is the amount of trust into a system that it delivers its services within a desired/reasonable time.
      Dependability can also be defined as "the measure in which reliance can justifiably be placed on the service delivered by a system" (Laprie).
    2. Fault Tolerance/Reliability?
      A fault tolerant system continues to provide its services (fully functional or with a certain level of quality degradation) even if errors occur in the system or some of its components.

      The reliability? of a system can be defined as:
      (a) The probability of the system being operational at a given time instant
      (b) The probability of the system being operational over a desired time-interval

      To increase the reliability of a system some sort of redundancy (extra resources) can be added to provide the desired level of service sustainable.
    3. Fault Avoidance
      Fault avoidance tries to avoid the occurrence of faults. Perhaps by means of very high-quality hardware and software components? Don't know what else to say here.
      Fault avoidance is the combination of 2 concepts:
      * fault removal: detecting and removing faults during design phase (implementation).
      * fault prevention: elimination of conditions for fault occurrence, i.e. use of high quality components, rigorous design techniques, etc.
    4. Availability?
      Availability is the quotient of MTTF over MTTR (total up-time over total up and down-time). Availability does not tell anything about the quality of the downtime.

      If for example an airplane flight controller has an availability of 99.999% percent, it is not acceptable if it takes its 0.001% downtime while crossing the sea.

      In terms of availability, a system is up and running until it fails, gets repaired and is up again.
    5. Safety?
      Something bad never takes place. (2.30)
      Degree that sytem failing is not catastrophic (6.9).
      Amount of time in safe condition / Total amount of time.
    6. Distributed System?
      A distributed system is a collection of computers that are connected through a network and share a global state to fullfil some common task.
    7. Performability?
      Combined performance and dependability analysis, quantifies how a system's performance decreases when errors occur (6.9)
      Is this a correct interpretation??? yes, in other words, it's a quantification of the capacity of graceful degradation of a system.
    8. Maintainability?
      The time which is needed to restore a system in case of a failure (MTTR).
  2. Explain the difference between the terms Fault?, Error? and Failure?.
    • A fault is some kind of "stumbling block"*? that could possibly lead a system into an illegal state.
    • An error is the actual execution or arising of a fault (reach of an incorrect system state).
    • A failure is an un-recoverable error that causes a decrease of the system performance (worst case: crash).
  3. Explain some of the properties that an error (Fault, Failures?) can have.
    • A fault can be:
      • permanent, transient, intermittent
      • a data fault, a timing fault
    • An error can be:
      • (un)detectable
      • (un)recoverable
    • A Failure can be:
      • a crash (produces some meaningful error)
      • silent (just stops)
      • (non-)critical ("No access rights for file" vs. "hard disk failed")
  4. If the maximum down time allowed for a system is 5 minutes per year, what is the Availability requirement for the system (in %)?
    Availability = MTTF / (MTTF + MTTR)
    MTTR = 5 min
    MTTF = (365 × 24 × 60 - 5) min = 525.595 min
    => Availability = 525.595 / (365 × 24 × 60) = 0,999991%
  5. Dependable systems are based on the addition of resources. Explain the difference between the following types of redundancy techniques by the use of examples:
    1. Spatial redundancy
      Having multiple parts of hardware with the same function or data with the same information.
    2. Temporal redundancy
      Executing the same task twice
      • The same way or a different way
      • Parallel or sequential
        I think it must be the same way and thus sequential, or? If it were in a different way, you'd have n-version sw for instance, not temporal redundancy. If it were parallel you'd have spatial redundancy.
    3. Information redundancy
      Add additional information about data, e. g. the possible range for a value
      Combination of spatial and temporal redundancy e.g. error detection/correction codes in communication systems (e.g. Parity Check or CRC), where you send additional message fields (spatial) after the normal msg fields (timing).
  6. Using a TMR for increasing the dependability of a system seems appealing. Why isn't this technique used in more systems? Why is it not suitable for SW do you think?
    • It has low overall performance
    • The voter is a SPF which is always undesireable
    • Why it is not suitable for SW? Good question.
      If we have no n-version programming and the software would all be the same then it would always come to the same common result, unless a hardware error occurs.

Distributed Systems

  1. Give a few reasons why one would want to build a distributed system instead of a centralized one. Give a few examples where it might not be suitable.
    • A distributed system is of advantage if:
      • the problem itself is distributed
      • the system needs to be accessible from many different locations
      • the system's modularity helps to reflect the business structure in the system and thus keep up with changes and expansions
    • A distributed system would not be suitable for:
      • providing system access to "road warriors"
      • a centralized problem at a single location (no need to distribute)
  2. Explain shortly how the types of faults possible in a system changes when it is distributed instead of centralized.
    • The system state needs to be shared among all participants and kept in sync
    • Communication is unreliable, messages could get lost (usually in a LAN or even within a single system message loss rate is very low)
    • Network can get partitioned
    • The system has to be prepared for the case the "leader(s)" or current coordinator(s) die (re-organize itself)
    • basically all types of faults for centralized systems (spec, design, interaction faults) are translated into communication faults (timing and data faults), because the comm channel is the only way nodes can "see" each other.
  3. What is the main difference between an Asynchronous and a Synchronous system?
    • I assume this does not mean clock synchrony?! (3.18)
    • In an asynchronous system the access to a resource is not coordinated, every client can access it at any time (non-blocking).
    • In a synchronous system there is the possibility to get an exclusive access to some resource. A client can request to access it, and all others are blocked from it while the client holds it (blocking).
    • I think it's just the time constraint (timeout) in msg communication: in async systems there aren't timeouts, in sync systems there are, so timing limits must be respected.
  4. What is TDMA? and CSMA/CD?? What is CDMA??
    All three protocols are for coordination of access on a shared communication media.
    • TDMA = Time Division Multiple Access
      • TDMA can be further devided into Synchronous Time Division (STD) and Asynchronous Time Division (ATD).
      • Every client gets fixed (synchronous) or variable (asynchronous) time slots in which he can send data on the media. Clients are only allowed to send data during their assigned time slots.
      • This protocol is used for ISDN, DSL, ATM and with some variations for GSM.
      • See http://de.wikipedia.org/wiki/Multiplexverfahren
    • CSMA/CD = Carrier Sense Mmultiple Access / Collission Detection
      • To avoid collisions on the media a client that wants to send data first has to make sure that there is no transmission on the media. If the media is clear, it starts to transmit its data in packets or blocks. If another client has started a transmission at the same time the collisson is detected, the transmissions are stopped, a special jam signal is sent and both clients wait some random time before the try to re-transmit.
      • It's used in the Ethernet Protocol (IEEE 802.3).
      • See http://de.wikipedia.org/wiki/CSMA/CD
    • CDMA = Code Division Mmultiple Access
      • Every client is assigned a special code pattern. This code pattern represents a "1", the inverse of this pattern represents a "0". The patterns a chosen in a special way such that they don't correlate (influence each other). These patterns are called orthogonal. Under these conditions every client can send data at the same time. A receiver can filter out the signal of a specific client by computing the correlation of the overall signal with the client's code pattern.
      • CDMA is used in the UMTS standard.
      • See http://de.wikipedia.org/wiki/Multiplexverfahren
  5. In distributed systems the concept of failure semantics is important. Explain what it means and why it is important. Also give a few examples of failure semantics used in real systems.
    • The failure semantics describe how components fail. If the effect is known, it helps to develop counter measures to detect the errors that lead to failures.
      Failure semantics is a categorization (organization) of the possible failures a system can have. Knowing in which failure category the occured problem fits, recovery becomes faster and more precise.
    • Examples:
      • - see here?. Are this real-life semantics?
      • - fail-stop, fail-omission, fail-safe = is a generalized fail semantics.
  6. Explain briefly using a picture how the Two-Phase Commit protocol works.
    • see (1.33)
    1. Commit is performed completely (see picture)
      1. Coordinator (green) sends a request to commit (COMMIT-REQ) to all cohorts (gray).
      2. All cohorts are ready, send an AGREE and write some undo/redo logs.
      3. The coordinator receives an AGREE from all cohorts, logs the commit and sends a COMMIT to all cohorts.
      4. All cohorts acknowledge the commit by sending an ACK to the coordinator.
      5. Done.
    2. Commit is not performed completely (not in picture)
      1. Coordinator (green) sends a request to commit (COMMIT-REQ) to all cohorts (gray).
      2. Not all cohorts are ready, at least one sends an ABORT.
      3. The coordinator receives an ABORT and sends an ABORT to all cohorts.
      4. All cohorts perform an undo action according to their logs.
      5. Done.

Nach oben

Recent Changes

Nach oben

Zuletzt geändert am 16 Februar 2006 11:39 Uhr von chrschn