Tutorials - Anti-Debugger & Anti-Emulator

Lord Julus'
Anti-Debugger & Anti-Emulator
Lair
                          D I S C L A I M E R                               
                                                                            
   The following document is a study. It's only purpose is to be used in    
   the virus research only. The author of this article is not responsible   
   for any misuse of the things written in this document. Most of the       
   things published here are already public and they represent what the     
   author gathered across the years.                                        
     The author is not responsible for the use of any of these information  
   in any kind of virus.                                                    
                               Lord Julus.                                  


 .--------------------------------.
 |   Before F0reW0rd - W0rd ;-)   |
 '--------------------------------'

        Due to the fact that I was very anxious to release this, and the fact
 that  while  writing  it my computer got burned, and that, anyway I was sick
 and  tired  of  looking at it anymore, I released it in a, let's say for now
 Version  1.0.  As  soon as I'll feel again ready to write, I shall come with
 more  ideas  and stuff. For now just read this and don't kick me if you find
 any  mistakes I didn't have time to correct... Anyway, during the writing of
 this  I  kinda  felt a little more on the encryption side, which actually is
 the  basis of a good fight with an AV. You got an unbeatable encryption, you
 rule!  So,  don't  be  frightened  by  the math involved here: everything is
 explained.  Secondly, also while writing this article I got involved in Win32
 programing.  This  made me leave the mortal's world for a while ;-) and go in
 higher circles. So, just read along...


     .------------.
     |  Foreword  |
     '------------'


        Well,  my  dearest  friends  and  enemies (;-)), here I am again, not
 really  having much to do these days but work and code... Alongside these, I
 would  name,  "high"  things  to do, I still have time to study, analyze and
 check  various stuff around. Since a little while back, I started a campaign
 of  writing  anti-anti-viral  programs.  These  would  be  like memory TSR's
 bypassers  and  memory patchers and searchers. Well, I looked deep and now I
 decided it's time I put it down in words, black on white (or more like white
 on blue, as I see it now ;-)).

        Anyway,  for  those  of  you  who know jam about what's debugging and
 emulating,  I  will  try to make a short description here in the foreword on
 the debug process, emulating process and some other stuffs.

        Here  come  the  descriptions  of  the  terms  about  to be used here
 (definitions taken out of the Webster, and additional explanations by me):

       DEBUG = "To detect and remove defects or errors from smth."
       DEBUGGER (in comp. sense) = A tool to debug code

               In common  language,  this  'debug'  term  has  enlarged  the
               specter,  no  longer  meaning  only  detecting  and  removing
               errors,  but also simply looking over the code.  We'll take a
               look later at the most common debuggers.

       EMULATE = "Try to equal  or  excel;  imitate with effort to equal or
                  surpass"
       EMULATION = "Effort or desire to equal or excel with others"
       EMULATOR (in comp. sense) = A tool to emulate code

                This  term  also  has a different connotation in the computer
                business.  It  doesn't really means making a program in order
                for  it  to  be  able to imitate someone else's program. This
                would  be,  of  course,  stupid. If you want to copy the best
                tool  is 'copy`n`paste' ;-)). Anyway, the emulator is a piece
                of   code,  usually  very  complicated,  actually  much  more
                complicated  then  the  code to be emulated itself, which has
                the  capability  to take a program instruction by instruction
                and  imitate  what that program would do if it were ran. BUT,
                the  emulator will never allow a program to really do what it
                should  do.  It only tries to come to that program's goal and
                guess it. This comes to the next definition:

       HEURISTIC   =   "stimulating   interest   as   means   of  furthering
                        investigation"

               Actually, the best definition for this term can be  found  in
               the  polymorphic  tutorial  written  by  The Black Baron.  It
               reads:  'heuristic = A set of  well defined rules to apply to
               a problem in the hope of achieving a known result'.  Hope you
               got it...

        Anyway, since the beginning of the viral activity, somewhere in 1987,
 the  anti-virus-writers  had  to  use  some powerful tools. We all know that
 it's  much  harder  to build, restore or repair something then to destroy or
 damage  it  (take  a  life  example:  hit your TV screen with a hammer... it
 only  takes  a  second...  than try to repair it ;-))), and also it's always
 much  easier to prevent something bad but to restore it's damage afterwards.
 Just  like  Confucius  said:  "Those  who do not see the danger coming shall
 surely suffer from the urge approaching". Therefore, the antiviral community
 have  started  to  build certain types of tools in order to come in a not so
 fair fight (thousands of virus writers and a couple of AV guys ;-)).

       Mainly, the developed tools are these:

       1) TSR blockers/checkers
       2) String scanners for memory/files/places on disks
       3) Heuristic scanners / code emulators

       Let's define them quickly and start the real thing:


======| TSR Blockers/Checkers |===============================================

        This  category of AV utilities is largely used and was made famous by
 VSAFE  (one  of  the  most  known  TSR  utilities).  Their  main  purpose is
 prevention.  They  do not clean viruses, but stay there in memory and before
 any  operation  is  done  they check... If something strange pops up (like a
 write-to-disk,  an  executable  change) they have the nasty habit to flush a
 (usually)  red  window  on  the  screen warning about the danger. Others are
 blockers for different viruses. Usually a virus checks whether it's resident
 or  not  by  calling  an  interrupt  with  certain  values  and  waiting for
 something  in return. A TSR blocker will simulate the virus by returning the
 'already  resident'  values.  In  this way, even if your files are infected,
 the virus will never go resident.

======| String scanners |=====================================================

        Any  virus,  like  any  piece of code actually is made of those tiny,
 little  0's and 1's called bits, which form those pretty nice 8 bit thingies
 called  bytes,  which  put in pairs form those really nifty words (and no, I
 don't  think  you're  stupid  ;-)).  Anyway,  a code has this thing called a
 signature.  A  string  scanner will search in a file/memory or anywhere else
 for  a  set  of  bytes  and  will  decide whether it's a virus or not. Smart
 scanners allow smart wildcards, like scan x bytes, then jump over the next 3
 bytes,  scan  other 2, and so on... Anyway, even with the growing popularity
 of  the polymorphic viruses, most of the viruses around can be detected with
 a signature.

======| Heuristic Scanners / Code Emulators |=================================

        Let's imagine you write a new virus... Of course, there is no one who
 knows  any  signature  for  your  virus for the simple reason that it's new.
 Here  the  heuristic  analyzers come around. These 'look' into your code and
 set  some  levels  of  danger  for  that  particular  code. For example if a
 heuristic  scanner  finds  a  check  for  'load  & execute' command, it will
 probably  warn  the user. The code emulator does more. It simulates the code
 execution by putting 'by hand' values into the registers and trying to 'see'
 what  the  code  does.  This  method  is  essential  for new viruses and for
 polymorphic viruses.


     .-----------------------.
     |    First approach     |
     '-----------------------'

        Ok,  now  we  have defined our 'environment', sort of speak, so let's
 start talking a little deeper about each of the above AV-types.

        The  TSR-blocker...  Yeah... This one is the easiest type of AV to go
 around.  There are a lot of TSR-blockers out there... If you feel threatened
 by  any  one  of  them,  simply  disassemble the darn thing an check out the
 method  it uses in order to check. There are several ways they use. The most
 common  is  to  monitor interrupts 21h, 13h, 76h, 25h, 26h. All these can be
 overridden  by  simple  tunneling/tracing routines. But, many TSR shits have
 some  anti-tunneling  routines  that  might warn the user about tunneling in
 progress. But, this is not a matter to speak about in this article.

        Another  largely used method is the monitoring of the INT 03/01. This
 means  that  the TSR checks every command your program does and decides if a
 couple  of  commands  are  dangerous or not. These are usually crap, because
 they  slow down everything. However, this type of TSR blocker gets killed by
 the Prefetching Queue.

        Another method is the monitoring of the INT 08 (the clock interrupt).
 Using this interrupt the AV checks various things. A simple CS:IP check type
 tunneling and you override it's checks. However, be careful to leave it like
 it  would  work. You may need to disassemble it and find particular bytes to
 patch  (some  of  them  use  INT 08 also to trigger keyboard event, like the
 pop-up  of  the options menu; and if the options menu doesn't work, the user
 may notice).

        Anyway,  the  TSR  blockers can be easily killed as you saw. However,
 the  AV guy might create a specific TSR-blocker for your specific virus (for
 example  by  making  a  TSR  that  returns  your  virus'  `already resident`
 signature,  but this means your virus is really good or you really pissed of
 some guy...)

        In  order  to kill the string scanner all you need to do is to put in
 your  virus a well random-oriented polymorphic decryptor. In this way you're
 safe from any kind of string scanning. Right after the polymorphic decryptor
 has  finished  it's  job,  it's  time for another decryptor. You have got to
 create a well balanced set of decryptors, kinda like this:

        -the  more complicated the poly decryptor is, the less it polymorphic
        -the  longest  the poly decryptor is, the hardest is for the emulator
 to  go  thru  (don't  forget that there exist code emulators + scan strings;
 they can go thru your poly decryptor and scan string the second decryptor)
        -the  more  complicated  the  second decryptor is, it's break becomes
 more difficult for the emulator

        So,  you  need a balance. A well scan-string based AV will not have a
 very  good  anti-poly  routine. This because the loading of the scan strings
 and  searching for them takes long time. The same for the emulator. In order
 to  create  a  good  poly  decryptor,  check out the article I wrote on poly
 decryptors at http://members.tripod.com/~lordjulus.

        So, the scan strings go down the drain too...

        And finally I reached where I wanted to... The heuristic scanners and
 the code emulators. These are the most dangerous AV ever and they seem to be
 written  by  some  smart  guys  (some  of  them  ;-))...  Anyway,  the  main
 disadvantage  for  the code emulator (called CE from here) is it's speed. As
 it  must  'emulate'  each instruction, it has to kinda do what the CPU would
 do,  still  however  using  the  CPU...  Therefore it's slow. Also, a lot of
 instructions  are  not  emulated  by  them  or  some  of  them  are emulated
 incorrectly.  Further  I will try to put up a set of methods I recommend you
 to  use  inside  the  second  decryptor.  In case the CE goes past your poly
 decryptor,  it *must* hang in the second decryptor, otherwise, your virus is
 disclosed.


     .------------------------.
     |  Anti - emulator code  |
     '------------------------'

        One  of the best methods in the fight with the AV code emulator is to
 find  out  pieces  of  code  that  generate a certain known result. The same
 result must be retrieved through another method, which also can be or can be
 not emulated by the AV. Having the two results one should do some operations
 with them over the most important registers. The idea around this is that if
 the  AV  is not capable to emulate one of the ways you retrieve a result, it
 will  for  sure  use  it's  own  result and will render to a fault. However,
 beware  of comparisons and conditional jumps. What I'm referring to is this:
 say you made two routines and one of them is for sure impossible to emulate.
 The  two  routines  return the same parameter. You put them in the AX and BX
 registers. If you do something like:

        CMP AX, BX
        JNE I_AM_BEING_EMULATED

        The code emulator for sure will jump to the conditional jump address,
 as  it  cannot  compute one of the arguments correctly. One could think that
 this  solves the problem. No way ! As we think of devious methods, so do the
 AV  writers.  Anytime a conditional jump is encountered a good emulator will
 save  it's place. If the condition is met it will jump there. The above jump
 would  probably be one to an infinite loop, or program terminate, or halt or
 stuff  like  that.  The  good emulator will not stop, but will return to the
 prior  conditional  jump and will try to continue emulating the code like if
 the condition was not met. This gives the emulator tremendous powers.

        However,  we  can  solve that too. Instead of comparing AX with BX we
 can  add  both  of  them  let's say to a register that holds the key for the
 encryption.  Or we can subtract one from the other and increment the DS data
 segment with the amount. Normally, if the code is executed correctly, in the
 first  case  the  key  would  be  incremented with a known number and in the
 second  case  the  DS  will  remain  unchanged. BUT if the emulator fails in
 computing  one of the two values, or even worse both, of them, the emulating
 process will fail altogether.

        Now let's see some ways we can fool the emulator.


 GETTING 2 VALUES USING INTERRUPTS AND PRIVILEGED INSTRUCTIONS

        As  we  all  know,  there  are  certain  interrupts  that  respond by
 returning a known value. I'm saying 'known' meaning you can get it somewhere
 else too. Let's see:

        INT 11 - This interrupt returns the 'Equipment List'

        The  equipment  list  is  a  word  that hold specific and very useful
 information  about  your  computer.  However, this word can also be found in
 BIOS at address 0000:0410h. Here goes the code:

        XOR AX, AX
        MOV DS, AX
        MOV BX, WORD PTR [0410]
        INT 11H
        SUB AX, BX
        ADD <KEY>, AX

        If  the  emulator  skips INT 11, AX will be different from BX, so the
 <key> value will be corrupted.

        INT  12  -  This  interrupt  returns  'The  Total Available Memory in
                    Paragraphs'

        This  word can also be found in BIOS at 0000:0413h. The anti-emu code
 is the same as above, but the Int value and the BIOS source.

        INT 2Fh - The Multiplex interrupt

        Now  here  we can do a lot of things. The very first and very good is
 this one:

        MOV AX, 1686H
        INT 2FH

        This  one  returns  0  in  AX  if  the  CPU  is in protected mode and
 something else if it's in real mode. But, we also have this instruction:

        SMSW BX

        This instruction (Store Machine Status Word) will put MSW in  BX. The
 MSW  (Machine  Status  Word)  is  a  word  with a lot of info on it. For our
 specific  example  we are only interested in the first bit. This bit is 0 if
 the  CPU is in real mode and 1 if the CPU is in protected mode. Do you start
 to see a pattern  here?   First  of  all  int  2fh  is not emulated by many
 emulator around and the SMSW instruction either...


<------------------------------------------------- THE  FPU  ATTACK


  USING THE FPU INSTRUCTIONS

        Ok,  nowadays  whoever still owns a 286 or less (duh!!) is considered
 to  be  owning  a pocket calculator. Whoever has a 386 kinda like enters the
 human  kind  (;-)),  BUT.  There's always a but. If he does not posses a FPU
 (Floating  Point Unit) he is also considered obsolete as human ;-). In other
 words  who doesn't have a FPU on his computer could just skip all this stuff
 and go watch a movie or something ;-)))

        Anyway, the FPU is a very powerful thing that wonders around your CPU
 helping  with the math calculation speed. Plus, it's 'floating' prefix gives
 you  an idea about it's main purpose: making floating point calculations. No
 more  only  integer  numbers,  now you can calculate using decimals also. Is
 this  gonna  help us ? Well, I tell you: A LOT ! Why ? The first argument is
 this: no code analyzer / emulator I know about (except probably Dr.Web which
 emulates a couple of instructions) is able to emulate FPU instructions. Some
 of  them,  like  TBAV  hang while emulating the code. Some of them just jump
 over  the  FPU  instructions, hoping they are only junk or the program is no
 virus  at  all.  Actually there are very few viruses out there using the FPU
 instructions  and the explanations for this is that people want to see their
 viruses  spreading.  The  FPU  instructions  pose a threat: on computers not
 equipped  with  a  FPU the code will hang ! In the same idea, the anti-virus
 products  writers  didn't  attempt to emulate FPU instructions as 99% of the
 viruses  in  the  wild don't use them. Also, as you read above about how the
 instructions  are  emulated,  emulating  the FPU instructions would probably
 triple the time the emulator needs to go through the code and, as I said the
 slower  the  emulator  goes,  the  worse  the  AV  product is. Combine a FPU
 oriented  decryptor  with  a huge polymorph generated decryptor and the code
 emulator will be lost in it.


        |  SMALL TUTORIAL ON THE FPU INSTRUCTIONS
        '------------------------------------------

        First  of  all,  in order not to crash the program currently running,
 one  may want to check whether a copprocesor unit is installed. This is done
 easily  by taking the MSW using the instruction SMSW AX, and checking the __
 bit. If it's set we have a copprocessor. If it's not set and your virus uses
 FPU  in  decryptors,  then  it's  a  dead  cause:  get  out with an error or
 something.  If  you  just  use  FPU to fool the emulators that stop over FPU
 instructions, just skip the part.

        We  shall  assume  that  we  have  a  computer  that has an installed
 coprocessor (387, 487, etc...).

        First, let's talk about IEEE standard 754. This is the standard Intel
 uses in order to make the coprocessor 'understand' floating point numbers.
 Basically, these numbers are coded like this:

        S, E, F, where:

        S = sign
        E = exponent
        F = fraction part

        The length of the S is one bit (0 if the number is positive, and 1 if
 it's negative). The length of the E is calculated like this:

        F has a length equal to All_bits - E_length - 1.

        Let's  see  for  example  how do we code a Double Word floating point
 number:

        S - 1  bit
        E - 11 bits
        F - 20 bits
        -----------
          = 32 bits

        So, usualy the floating point number is expressed like this:

        S, 2^E * F

        What  is  very  nice  about  the copprocessor unit is that you really
 don't  need  to  put  up  with  this  crap way of storing the floating point
 numbers.  The  Fpu provides it's "stack" in order to help you out. The stack
 looks like this:

        ST(0), ST(1), ... , ST(9)

        The  ST's  are  holders  for  the  floating  point  numbers  (let  us
 understand  eachother:  a  floating point includes an integer number; as you
 will see this is very important in out bussiness).

        So,  basically  you have the loading instructions. These instructions
 allow  you  to load from a certain place a number. The number will be placed
 at  the head of the stack and all other numbers will be pushed up. This goes
 like this:

        load m : ST(0) = m ; ST(1) = 0 ; ST(2) = 0 ...
        load n : ST(0) = n ; ST(1) = m ; ST(2) = 0 ...
        load p : ST(0) = p ; ST(1) = n ; ST(2) = p ...

        I  hope  this  clears  it. The most used stack register is the ST(0).
 This  is because we have special instructions that use other stack registers
 to  compute  as a second operator. First take a look at the FPU instructions
 in a very nice table I ripped of from TechHelp and then I shall explain more
 with some examples:

    .-----------------------------.
    | Data Transfer and Constants |
    '-----------------------------'

 FLD src              Load real: st(0) <- src (mem32/mem64/mem80)
 FILD src          Load integer: st(0) <- src (mem16/mem32/mem64)
 FBLD src              Load BCD: st(0) <- src (mem80)

 FLDZ                 Load zero: st(0) <- 0.0
 FLD1                    Load 1: st(0) <- 1.0
 FLDPI                  Load pi: st(0) <- ã (ie, pi)
 FLDL2T           Load log2(10): st(0) <- log2(10)
 FLDL2E            Load log2(e): st(0) <- log2(e)
 FLDLG2           Load log10(2): st(0) <- log10(2)
 FLDLN2            Load loge(2): st(0) <- loge(2)

 FST dest            Store real: dest <- st(0) (mem32/mem64)
 FSTP dest                       dest <- st(0) (mem32/mem64/mem80); pop stack
 FIST dest        Store integer: dest <- st(0) (mem32/mem64)
 FISTP dest                      dest <- st(0) (mem16/mem32/mem64); pop stack
 FBST dest            Store BCD: dest <- st(0) (mem80)
 FBSTP dest                      dest <- st(0) (mem80); pop stack

                       .---------.
                       | Compare |
                       '---------'

 FCOM              Compare real: Set flags as for st(0) - st(1)
 FCOM op                         Set flags as for st(0) - op (mem32/mem64)
 FCOMP op                        Compare st(0) to op (reg/mem); pop stack
 FCOMPP                          Compare st(0) to st(1); pop stack twice

 FICOM op          Compare integer: Set flags as for st(0) - op (mem16/mem32)
 FICOMP op                          Compare st(0) to op (mem16/mem32); pop
                                    stack

 FTST              Test for zero: Compare st(0) to 0.0

 FUCOM st(i)     Unordered Compare: st(0) to st(i)                     [486]
 FUCOMP st(i)                     Compare st(0) to st(i) and pop stack
 FUCOMPP st(i)                    Compare st(0) to st(i) and pop stack twice

 FXAM                      Examine: Eyeball st(0) (set condition codes)

                     .------------.
                     | Arithmetic |
                     '------------'

 FADD                     Add real: st(0) <- st(0) + st(1)
 FADD src                           st(0) <- st(0) + src (mem32/mem64)
 FADD st(i),st                      st(i) <- st(i) + st(0)
 FADDP st(i),st                     st(i) <- st(i) + st(0); pop stack
 FIADD src             Add integer: st(0) <- st(0) + src (mem16/mem32)


 FSUB                Subtract real: st(0) <- st(0) - st(1)
 FSUB src                           st(0) <- st(0) - src (reg/mem)
 FSUB st(i),st                      st(i) <- st(i) - st(0)
 FSUBP st(i),st                     st(i) <- st(i) - st(0); pop stack
 FSUBR st(i),st  Subtract Reversed: st(0) <- st(i) - st(0)
 FSUBRP st(i),st                    st(0) <- st(i) - st(0); pop stack
 FISUB src        Subtract integer: st(0) <- st(0) - src (mem16/mem32)
 FISUBR src     Subtract Rvrsd int: st(0) <- src - st(0) (mem16/mem32)

 FMUL                Multiply real: st(0) <- st(0) * st(1)
 FMUL st(i)                         st(0) <- st(0) * st(i)
 FMUL st(i),st                      st(i) <- st(0) * st(i)
 FMULP st(i),st                     st(i) <- st(0) * st(i); pop stack
 FIMUL src        Multiply integer: st(0) <- st(0) * src (mem16/mem32)

 FDIV                  Divide real: st(0) <- st(0) ÷ st(1)
 FDIV st(i)                         st(0) <- st(0) ÷ t(i)
 FDIV st(i),st                      st(i) <- st(0) ÷ st(i)
 FDIVP st(i),st                     st(i) <- st(0) ÷ st(i); pop stack
 FIDIV src          Divide integer: st(0) <- st(0) ÷ src (mem16/mem32)
 FDIVR st(i),st  Divide Rvrsd real: st(0) <- st(i) ÷ st(0)
 FDIVRP st(i),st                    st(0) <- st(i) ÷ st(0); pop stack
 FIDIVR src       Divide Rvrsd int: st(0) <- src ÷ st(0) (mem16/mem32)

 FSQRT                 Square Root: st(0) <- sqrt st(0)

 FSCALE        Scale by power of 2: st(0) <- 2 ^ st(0)

 FXTRACT       Extract exponent: st(0) <- exponent of st(0); and gets pushed
                                 st(0) <- significand of st(0)

 FPREM           Partial remainder: st(0) <- st(0) MOD st(1)
 FPREM1   Partial Remainder (IEEE): same as FPREM, but in IEEE standard [486]

 FRNDINT      Round to nearest int: st(0) <- INT( st(0) ); depends on RC flag

 FABS           Get absolute value: st(0) <- ABS( st(0) ); removes sign
 FCHS                  Change sign: st(0) <- -st(0)

                 .----------------.
                 | Transcendental |
                 '----------------'

 FCOS                     Cosine: st(0) <- COS( st(0) )
 FPTAN           Partial tangent: st(0) <- TAN( st(0) )
 FPATAN       Partial Arctangent: st(0) <- ATAN( st(0) )
 FSIN                       Sine: st(0) <- SIN( st(0) )
 FSINCOS         Sine and Cosine: st(0) <- SIN( st(0) ) and is pushed to st(1)
                                  st(0) <- COS( st(0) )
 F2XM1       Calculate (2 ^ x)-1: st(0) <- (2 ^ st(0)) - 1
 FYL2X     Calculate Y * log2(X): st(0) is Y; st(1) is X; this replaces st(0)
                                  and st(1) with: st(0) * log2( st(1) )
 FYL2XP1 Calculate Y * log2(X+1): st(0) is Y; st(1) is X; this replaces st(0)
                                  and st(1) with: st(0) * log2( st(1)+1 )

              .-------------------.
              | Processor Control |
              '-------------------'

 FINIT              Initialize FPU
 FSTSW AX        store Status word: AX <- MSW
 FSTSW dest                         dest <- MSW (mem16)

 FLDCW src       Load control word: FPU CW <- src (mem16)
 FSTCW dest     Store control word: dest <- FPU CW

 FCLEX            Clear exceptions

 FSTENV dest     Store environment: store status, control and tag words and
                                    exception pointers into memory at dest
 FLDENV src       Load environment: load environment from memory at src
 FSAVE dest        Store FPU state: store FPU state into 94-bytes at dest
 FRSTOR src         Load FPU state: restore FPU state as saved by FSAVE

 FINCSTP   Increment FPU stack ptr: st(6)<-st(5); st(5)<-st(4),...,st(0)<-?
 FDECSTP   Decrement FPU stack ptr: st(0)<-st(1); st(1)<-st(2),...,st(7)<-?

 FFREE st(i)    Mark reg st(i) as unused

 FNOP           No operation: st(0) <- st(0)

 WAIT/FWAIT  Synchronize FPU & CPU:
             Halt CPU until FPU finishes current opcode.


        Along these instructions I can add here the

        FXCH - exchange instruction      st(0) <- st(1)
                                         st(1) <- st(0)

         which is very usefull sometimes.

        So,  as  you  saw,  mainly all you should use are registers ST(0) and
 ST(1) because you can use the shorter form of the instruction. Let's imagine
 we want to compute something like this:

        cos(((a+b)*(c+d))/f)

        I  will  give  you a table with the instructions and the state of the
 stack in the same time so you can understand:

        fild word ptr [a]       ; ST(0) = a
        fild word ptr [b]       ; ST(0) = b ; ST(1) = a
        fadd                    ; ST(0) = a + b
        fist word ptr [temp]    ; save result
        fild word ptr [c]       ; ST(0) = c
        fild word ptr [d]       ; ST(0) = d ; ST(1) = c
        fadd                    ; ST(0) = c + d
        fild word ptr [temp]    ; ST(0) = c + d ; ST(1) = a + b
        fmul                    ; ST(0) = (a+b)*(c+d)
        fild word ptr [f]       ; ST(1) = f
        fdiv                    ; ST(0) = (a+b)*(c+d)/f
        fcos                    ; ST(0) = cos((a+b)*(c+d)/f)

        See,  it's  much  more easier to make calculations using the FPU. And
 the conversion between normal registers is done like this:

        mov word ptr [x], ax
        fild word ptr [x]

        You  will  ask  me  why  did I use FILD (load integer) instead of FLD
 (load  float)  ? Easy, that's because I didn't get the time to fully explain
 the IEEE 754 standard and the loading instructions expect that the source to
 contain   a   number   in  the IEEE 754 standard. EG, if you have at address
 [this_Address]  the  number 1111h, if you do a FILD you will have 1111h into
 the ST(0), but if you do FLD you will have 1.89793e-40... Kinda nasty. Plus,
 you  must  use the form FLD DWORD PTR [x] to load a floating number. Anyway,
 as I said, these insides are of less importance. The most important thing is
 to know how to use them and have a good algorithm to use them.
        Anyway, for those of you who want to study more on the floating point
 storage  way,  I  have included alongside this article a program in ZIP form
 copyright by Borland International. Use it wisely... May the FPU be with you
 ;-)
        Look there at the ways you have in order to retrieve the mantisa, the
 exponent and so on...

        Oh,  one  more  thing.  I forgot to tell you, those who don't like to
 read  all  those  .DOC  files ;-) that in order to use the FPU with the TASM
 assembler, you need to use this kind of header for your files (actually this
 is the header I always use):

        .386
        .387
        .model TPascal
        .code
        org 0

        In this way you can safely use 32 bit registers and FPU instructions.
 I can say it's the best way to compile an ASM file...


        |  CREATING FOOLING CODE
        '-------------------------

        Now,  let's go down to business. We'll come back to the same good old
 method.  We'll  try  to  create  a  set  of  instructions that when normally
 executed  will render to a known goal, but when emulated by a code emulator,
 they will generate a completely different result.

        Of  course,  talking about math coprocessor instructions, we're gonna
 have  to  use a lot of math, but not high math, just common,  ok,  don't get
 scared.

        Ok,  let's  take a peek at common math. As we all know, an odd number
 added to an even number always gives as a result an odd number (e.g. 3 + 4 =
 7).  What  do  we  obtain  if  we  divide  an odd number by 4 ? Let's make a
 table:

                      .==============================.
                      |  X    |    X/4   |   (X+1)/2 |
                      |==============================|
                      |  1    |   0.25   |     1     |
                      |  3    |   0.75   |     2     |
                      |  5    |   1.25   |     3     |
                      |  7    |   1.75   |     4     |
                      |  9    |   2.25   |     5     |
                      |  11   |   2.75   |     6     |
                      |  13   |   3.25   |     7     |
                      |  15   |   3.75   |     8     |
                      |  17   |   4.25   |     9     |
                      |  19   |   4.75   |     10    |
                      |  21   |   5.25   |     11    |
                      |  23   |   5.75   |     12    |
                      |  25   |   6.25   |     13    |
                      |  27   |   6.75   |     14    |
                      |  29   |   7.25   |     15    |
                      '=============================='

        In  the  first  column we have a series of odd numbers, in the second
 column  we have that number divided by 4. As you can see, the decimals vary:
 .25,  .75, .25,  etc. The last column is calculated like this (X+1)/2, where
 x  is  the  number  in the first column on the same line (e.g. (7+1)/2 = 8).
 What  do  we notice ? We notice that every time a .25 appears, next to it in
 the third column we have an odd number and every time a .75 appears, next to
 it  in  the  third  column we have an even number. Ok, now that we establish
 some rules let's go specific:

        First  you  must  generate 2 random words. After that, be sure one of
 them is odd and one of them is even. To do that, for the odd number set it's
 first bit:

                OR <reg1>, 0001h

        And for the even number reset it's first bit:

                AND <reg2>, 1110h

        Now  add  the two numbers into <reg3>. Ok, now we know that we have a
 random odd number in the register <reg3>.

        Next  step,  make a floating point division with 4 on this odd number
 and  take  the real part and save it somewhere (reg4 for ex.). Then take the
 number  again,  increase  it by 1 and make a floating division by 2. Now, as
 we divided an even number, the result will be an integer. This integer might
 be odd or even. To check it's parity we do this:

                Mov ax, number
                Jp odd_number
        Even_number:
                ...
        Odd_number:
                ...

        If  the  number  is  even  we  are  sure  that  the reg4 contains 75,
 otherwise,  if  it's  odd  we  know that the reg4 contains 25. Somewhere you
 should  have an address that holds a double word that equals 25. If reg4 has
 75,  than negate the number at that address (a good way to do it in order to
 use  more  FPU  instructions  would  be to make a floating point subtract by
 subtracting 25 from 0, obtaining -25). Now that we have this, simply add the
 double word to the number we have. The two possibilities are:

        25 + 25 = 50
        75 - 25 = 50

        So,  starting  from two absolutely random numbers (which, BTW, can be
 DWORDS  or  QWORDS  or  whatever  so  you can use more FPU instructions), we
 obtained  a  fixed number, i.e. 50. Of course, this 50 number will be placed
 either in a ST(?) register or on a double word address. The only thing to do
 is crop it's end and just keep the 50 into the CL register.

        Now,  simply  add  a  6 to CL. In this way we shall have 56 in the CL
 register. And here comes the nice part:

                ROL <key>, CL

        I  hope  you  got  it. As CL is 56 and 56 is divisible by 8, it means
 that the key register will roll around 7 times, but still remain the same...

        Now,  what  do  you think a code emulator will do ? Before you do all
 the above stuff, put a random number at the address where the 50 will be. Be
 sure  that  number  is odd. If a code emulator will simply jump over the FPU
 instructions,  as  most  of  them  do,  at the end it will retract in the CL
 register  the  odd  random number, which means that the ROL instruction will
 permanently  damage  the  key  making  it  impossible  for  the  emulator to
 correctly decrypt the encrypted code.

        This is just an idea. You can think of more. For example try dividing
 by  2. Any odd number divided by 2 gives a .5 decimal. Also you could obtain
 the 6 in the same devious manner. Let's take an example:

        FLDPI   ST(?)   is  a  FPU  instruction  that  loads  the  PI  number
 (3.141592654)  into  the  ST(?)  register.  Now,  we  all  know  that PI*2 =
 6.283185307.  Which  only  leaves us to take the integer part of this and we
 have the 6 !!

        This is a simple method. We can think of more complicated ones, like:

        Compute   ArcTangent(1).  This  gives  us  a  result  equal  to  PI/4
 (0.785398163). FMUL it by 4 and we have PI. Then  SQR the number and we have
 PI*PI  (9.869604401),  and  then  FSUB a PI from it and we have 6.728011748.
 Now just remove the integer part and you'll have 6 !

        I tell you, there are hundreds of methods to do that.

        Of  course,  all  the  above  sequences will render to some very easy
 recognizable signatures. Therefore, these sequences should only be used in a
 second  level  decryptor. That means that you have your virus protected by a
 poly  generated decryptor which kills any string scan possibility, but still
 it  cannot  have  enough  auto-generated  anti  emulating and anti-debugging
 routines.  After  the  poly  decryptor  finishes  it's  work, it should give
 control  to  a  second decryptor. Here you can insert all the anti-debugging
 and anti-emulating stuff you like.


        | CREATING VERY COMPLICATED DECRYPTORS
        '--------------------------------------

        Well,  now I'm gonna go more deep. Do you remember Taylor's formula ?
 This formula hides a lot of things we can play with. Let's see. If we have a
 function  and we want to compute the value of that function for a particular
 value, sometimes it's impossible to do it without Taylor's formula. However,
 I will use it on a not so difficult function and that is EXP(X), or e to the
 power of x, where e is 2.718281828...

        The general formula for the Taylor series is:

 .--------------------------------------------------------------------------.
 |                (x-a)^1           (x-a)^2                  (x-a)^n    n   |
 |  f(x) = f(a) + ------- * f'(a) + ------- * f''(a) + ... + ------- * f(a) |
 |                   1!                2!                       n!          |
 '--------------------------------------------------------------------------'
        ,where a is a choosen constant.

        A less  difficult  approach to this is the MacLaurin series, which is
 almost  the  same as Taylor's, with the difference that the constant a is 0.
 So we have:

 .--------------------------------------------------------------------------.
 |                 x^1            x^2                   x^n    n            |
 |  f(x) = f(0) + ---- * f'(0) + ---- * f''(0) + ... + ---- * f(0)          |
 |                 n!             2!                    n!                  |
 '--------------------------------------------------------------------------'

        And  we know that EXP(0) = 1, which means that all the f'(x) is 1 and
 disappear. So the formula remains like this:

               .--------------------------------------------.
               |                   x^2   x^3         x^n    |
               |  EXP(x) = 1 + x + --- + --- + ... + ---    |
               |                   2!    3!          n!     |
               '--------------------------------------------'

        The  problem  is, how deep are we gonna go in the search for the real
 result   of  the  calculation  ?  I  mean,  which should be the value of the
 n number ? Let's look at this example table for x=3.

   .================================================================.
   |            X^n   |   n  |                n!   |   X^n/n!       |
   |================================================================|
   |            1.00  |   0  |              1.00   |   1.00         |
   |            3.00  |   1  |              1.00   |   3.00         |
   |            9.00  |   2  |              2.00   |   4.50         |
   |           27.00  |   3  |              6.00   |   4.50         |
   |           81.00  |   4  |             24.00   |   3.38         |
   |          243.00  |   5  |            120.00   |   2.03         |
   |          729.00  |   6  |            720.00   |   1.01         |
   |        2,187.00  |   7  |          5,040.00   |   0.43         |
   |        6,561.00  |   8  |         40,320.00   |   0.16         |
   |       19,683.00  |   9  |        362,880.00   |   0.05         |
   |       59,049.00  |  10  |      3,628,800.00   |   0.02         |
   |      177,147.00  |  11  |     39,916,800.00   |   0.00         |
   |      531,441.00  |  12  |    479,001,600.00   |   0.00         |
   |    1,594,323.00  |  13  |  6,227,020,800.00   |   0.00         |
   |================================================================|
   |    TOTAL                                      |  20.08         |
   '================================================================'

        As you can see, starting from n=11 we have only 0 on the last column.
 This  means we can safely compute only 10 steps. Now, using Taylor's formula
 we  have  computed that EXP(3) = 20.08. Now, take a calculator and calculate
 it. Yes, I know, it's 20.09, or 20.086, depending on the calculator. Anyway,
 what  we  are  interested in is the integer part. But first, let's look at a
 way to compute all this:

        We need:

        1) A factorial routine
        2) A power routine
        3) A divide function
        4) An add function

        1) Factorial routine:

        This  is  a  somehow  optimized factorial routine (doesn't takes into
 account N=0)

 ;we enter with CX filled with the N number and we exit with AX filled
 ;with N!

         Factorial proc near

               fild word ptr [m]            ; load 1
               fild word ptr [m]            ; three times
               fild word ptr [m]

         repeat:
               fmul st(1), st               ; multiply by the base
               fadd st, st(2)               ; increase the base
               loop repeat                  ; and repeat
               fincstp                      ; mov ST(1) to ST(0)
               fistp word ptr [m]           ; store the result
               mov ax, word ptr [m]         ; and get it into AX
               ret

         m dd 1
         Factorial endp

        2) The power routine

        We'll use the simple method of consecutive multiplication, as we only
 have 10 steps to go and the power we raise to is an integer number.
        The procedure will raise AX to the power CX:

         Power Proc Near
               mov word ptr [m], ax
               fild word ptr [m]
               fild word ptr [m]
               dec cx

         repeat:
               fmul st, st(1)
               loop repeat
               fistp word ptr [m]
               mov ax, word ptr [m]
               ret

         m dd 1
         Power Endp

        Of course, the above procedures  do  not handle the exceptions (like
 0!, or x^0).  For the complete program, look at the TAYLOR.ASM file included
 in this tutorial.

        And here comes the fun part :

                ADD <KEY register>, AX

        So, AX means 20, if the CPU/FPU executed all the instructions as said
 above,  the  register  that  holds  the  key should increase with 20. If the
 debugger  or  code emulator didn't compute correctly one of the instructions
 than  the key we'll be added a random number, killing the decryption process
 completely  (of  course  don't  forget to set AX with some big random number
 before running the Taylor procedure).

        Here  is, however one of my favorite decryptors I have ever think of.
 It's main background is the propriety of three numbers, known as Phytagora's
 numbers. These numbers (a, b, c) verify the following formula:

        a^2 = b^2 + c^2

        Now,  all  you have to do is find 3 numbers that meet this propriety,
 like   for  example  a=5, b=4, c=3. In order to do that, you must choose two
 random numbers (let's call them m and n) and apply the following formulas:

        a = m*m + n*n
        b = 2*m*n
        c = |m*m - n*n|

        The  main  propriety  of  the Pythagora's numbers is that if they are
 used as a triangle's sides, then the angle against the a side will always be
 90ø:

        |\
     b  |   \  a
        |      \
        |_________\
            c

        Therefore,  given  the  fact that one triangle's angles summed give a
 total  of  180ø,  we can say that angles B and C summed give 90ø (where B is
 the angle made by a and b, and C is the angle made by a and c).

        We also know how to compute these angles, as:

        cos(B) = c / a ==> B = arccos(c/a)        (1)
        cos(C) = b / a ==> C = arccos(b/a)        (2)
        sin(B) = b / a ==> B = arcsin(b/a)        (3)
        sin(C) = c / a ==> C = arcsin(c/a)        (4)

        and B + C = 90ø, which leads us to our main formula:

        cos(B + C) = cos(90ø) = 1

        and for cos(B+C) we have the following formula:

        cos(B+C) = cos(B) * cos(C) + sin(B) * sin(C) = 1, so,

        using (1), (2), (3) and (4), we have that:

 cos(B+C)=cos(arccos(c/a))*cos(arccos(b/a))+sin(arcsin(b/a))*sin(arcsin(c/a))

         = 1 (aprox.) ==>

         we must round cos(B+C) in order to have 1.

        So,   we  choosed  2  random  number  (which,  BTW  don't  have to be
 integers,  they  also may be floating point numbers) which led us to 3 other
 numbers  that  meet  Pythagora's propriety and using the last formula we are
 sure we'll obtain a result that equals 1.

        As you can see, we are forced to use the ArcSin and ArcCos functions.
 Unfortunately,  the  FPU  doesn't  have these functions. However, it has the
 FATAN function, which computes the ArcTangent. In order to obtain the arcsin
 and arccos we can use the following formulas:


              ArcSin = ArcTan(x/sqrt(1-sqr(x)));
              ArcCos = ArcTan(sqrt(1-sqr(x))/x);


        Let me take a brief example:

        m = 1
        n = 2

        a = 1*1 + 2*2 = 1 + 4            = 5
        b = 2*1*2                        = 4
        c = |1*1 - 2*2| = |1 - 3| = |-3| = 3

        verification: a^2     = 5*5                = 25
                      b^2+c^2 = 4*4 + 3*3 = 16 + 9 = 25

        Now let's compute angles:

        B = arccos(c/a) = arccos(3/5) = 0.927295218
        C = arccos(b/a) = arccos(4/5) = 0.6435011879
        B = arcsin(b/a) = arcsin(4/5) = 0.927295218
        C = arcsin(c/a) = arcsin(3/5) = 0.6435011879

        These have been computed using the ArcTan formula presented above.

        cos(arccos(3/5)) = cos(0.927295218)  = 0.9998690361902
        cos(arccos(4/5)) = cos(0.6435011879) = 0.9999369305892
        sin(arcsin(4/5)) = sin(0.927295218)  = 0.0161836481643
        sin(arcsin(3/5)) = sin(0.6435011879) = 0.0112309769722

        so:

        cos(B+C) = 0.9998690361902 * 0.9999369305892 +
                   0.0112309769722 * 0.0161836481643 =
                                                     .----------------.
                 = 0.999805975039 + 0.000181758179 = | 0.999987733218 |
                                                     '----------------'
        ==> round(cos(B+C)) = 1 (bingo ! ;-))

        As  I  said,  the m and n numbers may be floating point numbers which
 will lead to floating point a, b, c's... Much nicer to handle them.

        Let's see which FPU instruction do we need:

        - FMUL
        - FSIN
        - FCOS
        - FATAN
        - FDIV
        - FADD
        - FROUND
        - FSQR
        - FSQRT
        - FSUB

        I  would  say  rather  plenty  (not  counting the loading and storing
 instructions...).  Taking  into  consideration the quickness of the FPU, the
 above formula is completed very quickly. I want to see an emulator emulating
 it !

        What  do we do with the 1 we obtained ? We can use it to increase the
 pointer  in  the  code  to  be  decrypted,  we  can  use  it to increase the
 encryption key, or anything we can think of.

        Included alongside this article you have a demonstration of the above
 calculations in the PYT.ASM file.

        Also,  both  methods  are  used in the CTAYLOR.ASM and CPYT.ASM files
 which  have  the  purpose  to  demonstrate  the  way  to use the two methods
 presented  which  I  called  the  FPU.Taylor.Crypt and FPU.Pythagoras.Crypt.
 Basically  the  programs  will  display  a  text on the screen, then it will
 display  it  scrambled  and then unscrambled again. You can see the speed of
 the procedures there. I doubt that there exist any code emulator written yet
 to emulate that code !


        | CREATING SELF MODIFYING CODE
        '------------------------------

        Another  nice way to use FPU instructions is to create self modifying
 code. Basically this is done like this:

        1) make a FPU calculus with a known result
        2) store the result on the following dword

        For example, we have know how to obtain the number 00001234h.  That's
 17185 - 12525, for example.

        We'll make this:

                mov al, 13h
                mov si, 0
                lea bx, patch
                add bx, 4
                finit
                fild word ptr [b]
                fild word ptr [a]
                fsub                   ; ST(0)
                fist dword ptr [patch]
        patch:
                sub ax, 14h
                nop
                nop
                js patch
        ...
        a dw 12525
        b dw 17185

        In  the moment the integer number is stored over the 'patch' address,
 the instruction sub ax, 14h changes to:

                xor al, 12h
                add [bx+si], al

                This means that after the XOR Al will turn to 1.

                [bx+si] points to patch+4. By doing the Add [bx+si], the two
 NOPS  will  change  into  Xchg  ax, cx. This instruction will put 1 into CX.
 Furthure, you  can  use  the number 1 in CX in your code. If a code emulator
 skips   the   FPU   instructions,  the  whole  code  goes to hell... This is
 because  sub  instruction  will  get  executed  and  AX  will  be  signed  a
 reeeeeeally  long  time,  which  leads  us  into  a  very  long  loop with a
 conditional  jump.  This  particular  kind of jump kills many code emulators
 which  pretend to return to the place where the condition happened and go on
 with the code... But what do you do when the code goes infinite ?


<-------------------------------------------------------------- TIPS & TRICKS

        Ok,   everybody   sometimes   thinks  that  he  discovered  something
 marvelous. He is so happy... until he finds out that someone else discovered
 the  same  thing  like  a  few  years  ago... ;-) It doesn't mean you are an
 illiterate,  but,  you  just  didn't read that particular book... Well, this
 happened to me to. I thought I found out something really neat, but it seems
 that  another  guy,  a great coder named .... made this up way before I even
 thought  about  FPU's.  It's  called  'moving memory using FPU'. The message
 about this showed up on my virus mailing-list and I give full credit to it's
 author, but still I will present it here as I think it's a great idea.

        So, the basic beyound this is that we have a load function in the FPU
 and a store function too. So:

        ; make DS:ESI point to the source code
        ; make ES:EDI point to the destination code
        ; ECX = length of code to be moved
        ; the code length is calculated in 16 bytes chunks

                mov_loop:
                        fild qword ptr [esi]
                        fild qword ptr [esi+8]
                        fxch
                        fistp qword ptr es:[edi]
                        fistp qword ptr es:[edi+8]
                        add edi, 16
                        add esi, 16
                        sub ecx, 16
                        jns mov_loop


        So,  this procedure moves memory very quickly and is undetectable for
 now by any AV or code-emulator. Hope you will use it smartly...


        These  would  be  some thoughts and ideas about how you can play with
 the  FPU  instructions,  but  I repeat, there are thousands and thousands of
 ways to do it. And, as I said, almost no emulator or real debugger can break
 it. If you can, you should use more than one method just to be sure, because
 however some AV's started emulating a couple of instructions.


        | THE ENVELOPE OF THE MATRIX METHOD
        '-----------------------------------

        I called this method in this way, exactly because we are about to use
 a  matrix  in order to obtain our encryption. Ok, so the usual algorithm for
 encryption  just  takes  one by one bytes or words or dwords or whatever and
 applies  an  math  operation over them and then stores the result. This is a
 linear encryption which can be broke very easy by a good programer. However,
 if  we  are  creating  a  devious,  hard to understand when coded encryption
 method,  we got big chances. So, let's start. Let's say we have to encrypt a
 part of a file that looks like this:

        a1, a2, a3, ... , an

        where ak are the encryption unit (byte, word, dword, qword, tbyte,..)

        We   than  we'll  take  the first 25 units and arange them in a sqare
 matrix like this:

        a11 a12 a13 a14 a15
        a21 a22 a23 a24 a25
        a31 a32 a33 a34 a35
        a41 a42 a43 a44 a45
        a51 a52 a53 a54 a55

        Ok,  now  let's define what is 'giving a roll to the matrix'. Imagine
 that  the above matrix is a piece of paper. A sqare. And you want to fold it
 over the first diagonal. You would obtain this result:

       a _________              _________
        |        /|            |        /
        |      /  |   ------>  |      /
        |    /    |            |    /
        |  /      |            |  /      (we took corner b over corner a)
        |/________|            |/
                   b

        Now,  the  same  thing  is what we shall do with our matrix above. We
 shall  take each value from beneath the first diagonal and bring it over the
 opossite value. We'll do this by applying a math formula. First we are gonna
 apply  an  'ADD-ROLL',  which  means  that  each  element  beneath the first
 diagonal will be added to it's pair above the diagonal. Let's see what do we
 get:

        a11+a55  a12+a45  a13+a35  a14+a25  a15
        a21+a54  a22+a44  a23+a34    a24    a25    (FD - ADD-ROLL)
        a31+a53  a32+a43    a33      a34    a35    (First Diagonal Add Roll)
        a41+a52    a42      a43      a44    a45
          a51      a52      a53      a54    a55

        So,  I think it's clear enough. All elements above the first diagonal
 were added the elements beneth the first diagonal.
        In  the second step we shall apply a SD-SUB-ROLL, which means that we
 are  going  to take the left-down corner and put it over the right-up corner
 and  the  math  operation  between  the  elements will be substract. I'm not
 drawing another matrix because I hope it's clear.
        Then   we   are  going  to apply  a H-XOR-ROLL (horizontal xor roll),
 which  means  that  we are taking all elements beneath the horizontal middle
 line  of  the  matrix  and  xor  them over their opossite elements above the
 horizontal line.
        Finally  we  apply  a V-ADD-ROLL  (vertical add roll), which means we
 add every element from the left side of the vertical center of the matrix to
 their opossite elements on the right side.

        After  all  these  are  done,  we  can say that our initial matrix is
 pretty  messed  up.  Let's  call the final scrambled matrix A, and define it
 like this:

        A11 A12 A13 A14 A15
         :               :
        A51 ........... A55

        So, the final formulas after applying the above rollings are:

        Encryption formulas (I noted the XOR operation with '|'):

        A11 = (a11+a55)|a51
        A12 = (a21+a45-a21-a54)|a52
        A13 = (a13+a35-a31-a53)|a53
        A14 = (a14+a25-a41-a52)|a54 + (a12+a45-a21-a54)|a52
        A15 = (-a51-a15)|a55 + (a11+a55)|a51
        A21 = (a21+a54)|(a41+a52)
        A22 = (a22+a44)|a42
        A23 = (a23+a34-a32-a43)|a43
        A24 = (-a24-a42)|a44 + (a22+a44)|a42
        A25 = (-a25-a52)|(-a45-a54) + (a21+a54)|(a41+a52)
        A31 = a31+a53
        A32 = a32+a43
        A33 = a33
        A34 = -a34-a43+a32+a43
        A35 = -a35-a53+a31+a53
        A41 = a41+a52
        A42 = a42
        A43 = a43
        A44 = a44+a42
        A45 = -a45-a54+a41+a52
        A51 = a51
        A52 = a52
        A53 = a53
        A54 = a54+a52
        A55 = a55+a51

        So, now we have our scrambled matrix. Of course, as you can see there
 still  are  there  a couple of codes that didn't get encrypted. No problem !
 So, we have 25 elements. Let's see:

    if the a's are bytes we have 8*25  = 200 bits
    if the a's are words we have 16*25 = 400 bits

    Anyway, the total number of bits is divisible by ten.  Now  here  is  the
 thing.   You  should  create a 10 bit long key.  Why ?  Because this is most
 unusual.  Put the first 8 bits in the  register Al, for ex., and the other 2
 bits in register Bl, like this:

        aaaaaaaabb000000
        |  al  || bl   |

        Now, we put our scrambled matrix like this:

        A11, A12, ... , A21, A22, ... , A55

        And we look at it at the bit level.
        First  apply  a XOR over the beginning of the elements. Then increase
 the key like this:

        000000aaaaaaaabb
        |  al  ||  bl  |

        This  is easily done using the shifting with carry. Then increase the
 pointer with one byte and apply again. It will be like this:

 bits to scramble: xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx ...
 key:            : aaaaaaaa bb000000 aaaaaaaa bb000000
                            000000aa aaaaaabb 000000aa aaaaaabb ...

        I  hope  you get it. You scramble at bit level with a ten letter key,
 with  an interleaved algorithm, leaving four unscrambled bits every 12 bits.
 I  would  say  it's a rather peculiar cryption. The decryption is very easy.
 The  only  nifty  thing is that if someone sees the code that does the above
 thing disassembled he will have to work his butt of a few days to figure out
 the encryption.

        OK  !  Now  we have our matrix completely encrypted and placed in the
 encrypted  code place. We return to the unencrypted code. We firstly took 25
 units.  Now  we  must take another 25 units and continue with the algorithm.
 And  so  we do, until we don't have 25 units, so we create a matrix with the
 last  elements  padded  with  zero.  And  there you have the Envelope of the
 matrix cryption method.

        You  will ask, well how the hell do I decrypt it ? Thought I'd  leave
 you here ? ;-)

        So, first, when decrypting you must start again by retriving 25 units
 from  the  crypted  code. Then you decrypt the 10 bit key encryption and you
 have  the  originall  A11...A55 matrix. Here are the formulas to decrypt the
 matrix  in order to obtain the original a11, a12,..., a15 elements. And  no,
 they are not in a random order. They are in the exact order in which you CAN
 decrypt them ! Here we go:

        a11 = A33
        a42 = A42
        a43 = A43
        a51 = A51
        a52 = A52
        a53 = A53
        a54 = A54 - a52
        a55 = A55 - a51
        a44 = A44 - a42
        a41 = A41 - a52
       -a45 = A45 + a54-a41-a52
        a31 = A31 - a53
       -a35 = A35 + a53-a31-a53
        a32 = A32 - a43
       -a34 = A34 + a43-a32-a43
        a11 = A11|a51 - a55
        a21 = A21|(a41+a52)-a54
        a22 = A22|a42 - a44
        a23 = A23|a43 - a34+a32+a43
       -a24 = (A24 - (a22+a44)|a42)|a44 + a42
        a12 = A12|a52 - a45+a21+a45
        a13 = A13|a53 - a35+a31+a53
       -a15 = (A15 - (a11+a55)|a51)|a55 + a51
        a14 = (A14 - (a12+a45-a21-a54)|a52)|a54 - a25+a41+a52
       -a25 = (A25 - (a21+a54)|(a41+a52))|(-a45-a54) + a52

        You must negate a45, a35, a34, a24, a15 and a25.

        That's  it  !  Applying these formulas you have the original state of
 the matrix which you put in it's decrypted string place unit after unit.

        And  as  we  spoke so much about FPU, I think I don't have to mention
 that   all  the  above  calculations  may  be  done  using  the  faster  fpu
 instructions.  I doubt any code emulator will be able to 'violate your mail'
 by ripping your envelopes ;-)))

        I sure hope I'll get the time to write an example on this.
        Unfortunately, I didn't...

<-------------------------------------------------------------- P-MODE ATTACK

        What I am going to tell  you  now  is  something I am not quite sure
 that it will work on all systems, but if you read it, you  may  try  it  at
 least.  The main ideea is this:  swith to P-Mode using a strange manner, do
 sumthing and then switch back to real-mode...  A code emulator will be dead
 by then...

        Here is the main thing we are interested in:

        Some multiplex interrupts.
        First,  the  interrupt  that tells us if we have DPMI present and if
 so, what is the address we need in order to switch to it. It goes like this:

 Expects: AX =  1687h
 ------------------------------------------------------------------
 Returns: AX    0000h = successful
                else  = no DPMI host present
          BX    flags:     bit 0: 0=32-bit programs are not supported
                                  1=32-bit programs are supported
                       bits 1-15: not used
          CL    processor type:  02H = 80286
                                 03H = 80386
                                 04H = 80486
                                 05H = Pentium
                                 >5  = reserved for future Intel CPUs
          DX    DPMI major + minor version number  (e.g., 010aH=1.10)
          SI    number of 16-byte pares needed for DPMI host private
          ES:DI entry address to call to enter Protected Mode
          ------------------------------------------------------------------

       SI on return,  this  is  an  amount  of  real-mode memory, in 16-byte
 paragraphs, that you must supply when you process the switch  (see  below).
 It might be 0000H, indicating no memory needed.

    ES:DI  on  return,  this  is  the Entry Address you must call (via a FAR
 CALL) in order to switch to protected mode.  The calling parameters are:

          Entry:
            AX= 0000H = you'll be running as a 16-bit application
                0001H = you'll be running as a 32-bit application
            ES= the segment of the memory you're be supplying to DPMI host.
                If SI was 0 after INT 2fH 1687H, then ES is ignored.
          Return:
            CF set (CY) if switch to protected mode failed
               (and AX is a DPMI Error Code)
            CS = selector for your code segment (64K limit)
            SS = selector for your stack segment (64K limit)
            DS = selector for your data segment (64K limit)
            ES = selector for your program's PSP (256-byte limit)
            FS = 0 (on 80386+ CPUs)

       There's no need to flush  here  the  DPMI error codes...  It's either
 you can or you cannot enter PMODE.   Let's  check  a  little  program  that
 should (or at least I hope) go into PMODE and back:

 start:
    mov ax, 1687h                 ; We call the multiplex int
    int 2fh                       ;
    cmp ax, 0                     ; do we have DPMI ?
    jne no_dpmi                   ; no...
    mov switchcall, di            ; if so, save the switch call address
    mov switchcall+2, es          ; offset and segment
    cmp si, 0                     ; check if we need memory in 16 byte chunks
    je no_mem                     ; no...
    mov bx, si                    ; otherwise allocate memory
    mov ah, 48h                   ; using DOS
    int 21h                       ;
    jc error                      ; if this occurs you have no memory...
; so you might need to shrink mem using 4Ah first...
no_mem:
    mov es, ax                    ; put the new segment in ES
    mov ax, 0                     ; choose 16 bit application
    call switchcall               ; and switch to PMODE
    jc cantswitch                 ; error ?
    mov ax, 0400h                 ; try to use PMODE interrupt 31h
    int 31h                       ;
                                  ;
    mov ax, 4c00h                 ; switch back to REAL mode
    int 21h                       ;
                                  ;
no_dpmi:                          ;
cantswitch:                       ;
    mov ax, 4c00h                 ; and quit
    int 21h                       ;
                                  ;
switchcall dw 0, 0                ; call switch address


       So, as you could see, we have the interrupt 31h we can use.  In order
 to use it, you must have a  real  grip on what a selector, descriptor, etc.
 is, so better check a DOS32 documentation. The usefull functions are these:

             AX    Function Use
             ---------------------------------------
             0000H (allocate LDT descriptors)
             0001H (free LDT descriptor)
             0002H (segment to descriptor)
             0003H (query selector increment value)
             0006H (query segment base address)
             0007H (set segment base address)
             0008H (set segment limit)
             0009H (set descriptor access rights)
             000aH (create alias descriptor)
             000bH (query descriptor)
             000cH (set descriptor)
             000dH (allocate specific descriptor)
             000eH (query multiple descriptors)
             000fH (set multiple descriptors)
             0100H (allocate DOS memory block)
             0101H (free DOS memory block)
             0102H (resize DOS memory block)
             0200H (query real-mode interrupt vector)
             0201H (set real-mode interrupt vector)
             0202H (query processor exception handler vector)
             0203H (set processor exception handler vector)
             0204H (query protected-mode interrupt vector)
             0205H (set protected-mode interrupt vector)
             0300H (simulate real-mode interrupt)
             0301H (call real-mode for FAR RET return)
             0302H (call real-mode for IRET return)
             0303H (allocate real-mode callback address)
             0304H (free real-mode callback address)
             0305H (query state save/restore addresses)
             0306H (query raw mode switch address)
             0400H (query DPMI version)
             0401H (query DPMI capabilities)
             0500H (query free memory information)
             0501H (allocate memory block)
             0502H (free memory block)
             0503H (resize memory block)
             0504H (allocate linear memory block)
             0506H (query page attributes)
             0507H (set page attributes)
             0508H (map device in memory block)
             0509H (map conventional memory in memory block)
             050aH (query memory block size and base)
             050bH (query memory information)
             0600H (lock linear region)
             0601H (unlock linear region)
             0602H (mark real-mode region as pageable)
             0603H (relock real-mode region)
             0604H (get page size)
             0700H (mark page as demand paging candidate)
             0701H (discard page contents)
             0800H (physical address mapping)
             0801H (free physical address mapping)
             0900H (disable virtual interrupt state)
             0901H (enable virtual interrupt state)
             0a00H (query vendor-specific API entry address)
             0b00H (set debug watchpoint)
             0b01H (clear debug watchpoint)
             0b02H (query state of debug watchpoint)
             0b03H (reset debug watchpoint)
             0c00H (setup DPMI TSR callback)
             0c01H (protected-mode terminate and stay resident)
             0d00H (allocate shared memory)
             0d01H (free shared memory)
             0d02H (serialize on shared memory)
             0d03H (free serialization on shared memory)
             0e00H (query coprocessor status)
             0e01H (set coprocessor emulation)


        As you can read in the descriptions, quite a few interesting  things
 out there.  But, as I said, I don't have time to write on this anymore, so,
 just  go ahead and try using some of the above functions.  I think it would
 bea really neat to allocate DOS  memory from protected mode and then return
 to real mode and use it... although I didn't try it ;-)).


         .-----------------.
         |   Final Word    |
         '-----------------'

         So,   as   I   said,   I   left   this  behind me now and I am going
 towards  Win32  programing.  In  order  to  do  that  I  spent a lot of time
 studying,  I  had to read my butt out and gather utilities and tutorials and
 tools   and  everything, so I kinda left this  away...   So, I guess my next
 article will be on Win95/98...

        Write me anytime with suggestions, ideas or anything at:

              lordjulus@geocities.com

       From time to time check my page at:

              http://members.tripod.com/~lordjulus

       If you are interested in virii news and info, you may try to  join  my
 virus list by sending a blank e-mail to:

              virus-list-subscribe@makelist.com

 All the Best !

                                               .-------------------------.
                                               |  Lord Julus - 1998 (c)  |
                                               '========================='

 I would like to thank the following: Qark, Quantum, RockSteady, DarkAngel, 
 Hellraiser, MrSandman, Darkman, VirtualDaemon, JackyQwerty, Azrael,  B0z0, 
 Neurobasher, NowhereMan, TheUnforgiven, LiquidJesus, a.s.o...              
                                                      Lord Julus