HUMAN is Named a Leader and Earns Top Scores in Nine Criteria in the Forrester Wave™: Bot Management Software, Q3 2024
Tech & Engineering Blog

Defeating Javascript Obfuscation

TL;DR

Now this is a story all about how
My code gets flipped-turned upside down
And I’d like to take a minute
Just sit right there
I’ll tell you how I came to build my deobfuscator - REstringer.

 

To make a long story short, I’m releasing a Javascript deobfuscation tool called REstringer, both as code and as an online tool.

To make a short story long - I want to share my incentive for creating the tool, some design decisions, and the process through which I’m adding new capabilities to it - so you can join in on the fun! It’s time to put the FUN into deobFUNsca- – Ok this might need some work 😅.

REsolve, REplace, REpeat

The Name Is Carter. Magecarter.

A couple of years ago, I started working on the Code Defender research team at PerimeterX.

Part of my workflow includes investigating Magecart attacks, which are now almost synonymous with skimming and client-side supply chain attacks. These attacks usually take the form of Javascript files injected either directly into compromised sites, or delivered via a compromised third-party script. I collect these files and analyze them. I will share more about my investigation process in a future post.

There are many approaches to code analysis, though we can basically categorize them as:

  • Static - reviewing the code without running it. Finding patterns, and following the code’s execution flow.
  • Dynamic - running the code in whatever research environment that works for you, stopping mid-execution and observing values and stack order in context.

Unless the script is really simple to understand or you’ve already dealt with similar code, you’d use a combination of the two approaches to get the complete picture of what the script does and how.

For example, when analyzing a script like the anti-vm skimmer mentioned in my previous blog posts (The Far Point of a Static Encounter and Automating Skimmer Deobfuscation), I would use a combination of static review of the code’s structure and only run parts of the code (in the Chrome devtools, on the about:blank page):

  1. Identify an unpacking / decoding function by observing the code structure. How this is identified is mentioned in the previous blogs.

    The UDS function matches the description:

        function UDS(b) {
            // ...
            return f.join('');
         }
  2. Identify an instance where UDS is called with a string argument.

       UDS('laessrrutnwmbnpoyhokgixzdoutrjtqccvcf').substr(0, 11);

    I replaced the MXQ variable with its value, 759 - 748, which is 11.

  3. Unpack / decode relevant strings dynamically by running as little code as possible. I copy the UDS function and append the call to it afterwards to get the value:

Image 15

Now, I take that value and replace the function call with it.

That’s not so bad, is it? Doing this once or twice is fine and dandy, but it can be pretty time consuming and tiresome when faced with more and more instances of the same obfuscation.

In the words of Raymond Hettinger - “There must be a better way”! And of course there is! If you can do it manually, you can almost definitely write a script to do it.

The Better Way (Take 1) - An Array of References

Let’s start with something simple and work our way up.

One of the more basic obfuscation types actively used by Magecart attackers is the Array Replacements method. It is characterized by having an array initialized with strings, and references throughout the code to indices inside the array.

How does it look?

    var arr = ['log', 'hello', ' ', 'world'];
    console[arr[0]](arr[1] + arr[2] + arr[3]);

I’ll follow the example of Guy Bary, in his Analyzing Magecart Malware – From Zero to Hero blog post:

  1. Load the array into memory.
  2. Find the array references using regex.
  3. Get their value by running them with eval.
  4. Replace the references with the actual value.
   const code = `var arr = ['log', 'hello', ' ', 'world'];
   console[arr[0]](arr[1] + arr[2] + arr[3]);`;

   eval(code.split('\n')[0]); // Load only the array into memory.
   let deob = code;
   const arrName = 'arr';
   const referencesRegex = new RegExp(`${arrName}\\[\\d+\\]`);
   let match = deob.match(referencesRegex);
   while (match) {
   const original = match[0];
   const replacementValue = JSON.stringify(eval(original));
   deob = deob.replace(original, replacementValue);
   match = deob.match(referencesRegex);
   }
   console.log(deob);

While this does work, it requires manual intervention - providing the name of the array. Of course, we could and should also cache the results or use replaceAll to improve performance, but that would take away from the simplicity of the example.

Notice I’m initially running only the first line of the code, using code.split('\n')[0], to avoid any side effects which might arise due to running the entire code, namely a ‘hello world’ message being printed out during execution. Although I managed to avoid that here, the source script could have placed the array pretty much anywhere, or be devoid of new lines completely (as is often the case with injected code). The side effect could also be a lot worse than just a benign message.

Increasing The Complexity

Let’s mix it up a bit. Consider an evolution of this obfuscation technique, the Augmented Array Replacements method. It has the same characteristics as its un-augmented counterpart, but with an additional Immediately Invoked Function Expression (or IIFE) with a reference to the array as one of its arguments, which changes (augments) the values in the array in some way.

How does it look?

    var arr = ['world', ' ', 'hello', 'log'];
    (function(a, b) {
      a = a.reverse();
    })(arr, 7);
    console[arr[0]](arr[1] + arr[2] + arr[3]);

After arr is declared, the IIFE reverses its order so that each index now points to the correct value. If we try to run the deobfuscation script on it as-is, we’d get this broken code:

    console["world"](" " + "hello" + "log");

The easy solution is to run the script as-is to get the context just right, and then search-and-replace. To do that, we remove the line splitting from eval(code.split('\n')[0]) and are left with eval(code).

Traps That Spring Break (Execution)

Guy’s blog deals with a different obfuscation type though - the Augmented Array Function Replacements (skipped the un-augmented version). This obfuscation type also contains an array with strings and an IIFE augmenting it, but instead of being directly referenced throughout the code, a function call is used with string/number parameters. This function retrieves the value from the array corresponding with the provided parameters.

How does it look?

    var arr = ['d29ybGQ=', 'IA==', 'aGVsbG8=', 'bG9n'];
    (function(a, b) {a = a.reverse();})(arr, 7);
    function dec(a, b) {
      debugger;
      return atob(arr[b - 57]);
    }
    console[dec('vxs', 57)](dec('ap3', 58) + dec('3j;', 59) + dec('aaa', 60));

This toy example shows the added value of the function and why it’s preferred to the simpler array replacements: you can do anything you want inside the function, namely, distance the parameter value from the actual array index to avoid simple search and replace tactics, and lay traps to ensnare investigators and to poke in curious eyes 👉👀. The trap I’ve included hangs execution (but doesn’t break it) if the devtools are open or if it runs inside an IDE with debugging abilities. Due to the script being very short with only 4 calls to the dec function, it doesn’t really do much more than annoy, but imagine a script with hundreds of calls - it’d get pretty old pretty fast. You can of course counter this simple trap in a number of ways such as disabling breakpoints, removing that line manually, etc.

Here are a couple examples of conditions on which traps might spring and break execution:

  • The devtools are opened.
  • The script is beautified.
  • There are indicators of the NodeJS environment.
  • There are indicators of a VM environment.

Another useful trick to throw the investigation of this obfuscated script off-track is by supplying the dec function with a throwaway argument, which isn’t used at all (like the a argument in the example). All the calls contain different and a random first parameter, which helps in complicating search-and-replace patterns. Adding insult to injury, you can even overload the function calls with additional unnecessary parameters to completely change the call signature.

As those of you who like playing with the code while reading might have noticed, this example can still be deobfuscated with our simple search and replace technique, like so:

    const code = `var arr = ['d29ybGQ=', 'IA==', 'aGVsbG8=', 'bG9n'];
    (function(a, b) {a = a.reverse();})(arr, 7);
    function dec(a, b) {
      debugger;
      return atob(arr[b - 57]);
    }
    console[dec('vxs', 57)](dec('ap3', 58) + dec('3j;', 59) + dec('aaa', 60));`;
    
    eval(code);
    let deob = code;
    const funcName = 'dec';
    const referencesRegex = new RegExp(`${funcName}\\('[^)]+?\\)`);
    let match = deob.match(referencesRegex);
    while (match) {
    	const original = match[0];
    	const replacementValue = JSON.stringify(eval(original));
    	deob = deob.replace(original, replacementValue);
    	match = deob.match(referencesRegex);
    }
    console.log(deob);

So just to complicate matters (because why not?), I’ve moved the dec function to the bottom of the code, and overwritten the array and dec function after execution:

    var arr = ['d29ybGQ=', 'IA==', 'aGVsbG8=', 'bG9n'];
    (function(a, b) {a = a.reverse();})(arr, 7);
    console[dec('vxs', 57)](dec('ap3', 58) + dec('3j;', 59) + dec('aaa', 60));
    arr.length = 0;
    dec = () => {};
    function dec(a, b) {
    	debugger;
    	return atob(arr[b - 57]);
    }

I can write a more complicated search and replace script that extracts just the array and function and then deobfuscates the strings, but that isn’t scalable. So now what?

The Better Way (Take 2) - Generalizing the Solution: Going Off Script

Let me jump straight into the deep end of the pool by walking you through building a deobfuscator for this type of obfuscation. It’s going to be a basic solution, and I’m going to mention the shortcomings of the specific implementation at the end of this part, so write down whatever it is you think you’d do differently, and let me know if I missed anything.

Just a quick reminder - I’m going to be using the flAST package to flatten the AST to make it easier to find code structures that are referencing each other, and replace them with the deobfuscate strings.

Let’s start with the array. In this short script we’re deobfuscating, there’s only one array, so it’s simple, but let’s pretend there might be more than one. How would we find the specific array we’re looking for? We’ll first describe it and then write it in code.

We’re looking for:

  • A variable declarator initialized with an array expression.
  • The array expression is not empty, and contains only strings (literals).
  • The array is referenced in a function (i.e. not the global scope).

Using flAST we can look at an identifier’s references and their scope:

    const relevantArrays = ast.filter(n =>
      n.type === 'VariableDeclarator' &&
      n?.init?.type === 'ArrayExpression' &&
      n.init.elements.length &&   // Is not empty.
      // Contains only literals.
      !n.init.elements.filter(e => e.type !== 'Literal').length &&
      // Used in another scope other than global.
      n.id?.references?.filter(r => r.scope.scopeId > 0).length);

This is far too generic. We can test how well this describes the structure we’re looking for by appending our obfuscated code to an unobfuscated version of a large script (like this arbitrary jQuery script) and counting how many arrays match the description.

In this case, three arrays matched:

    re = ["Top","Right","Bottom","Left"];		    // 1
    Ue = ["Webkit","Moz","ms"];				        // 2
    arr = ['d29ybGQ=', 'IA==', 'aGVsbG8=', 'bG9n'];	// 3

So, we are generic, but not too generic. We can use the relationship between the array and the dec function to hone in on the correct one. What we’re looking for is:

  • A function declaration with exactly one reference to the relevant array.
  • Has at least as many references to it as there are items in the array.

In code, this description looks like this:

    const decodingFunction = ast.find(n =>
      n.type === 'FunctionDeclaration' &&
      n.scope.references.filter(r => r?.identifier?.declNode?.nodeId === arrId.nodeId).length === 1 &&
      n.id?.references?.length >= arrId.references.length);

So, for each of the arrays we found, we can look for a decoding function that matches the description until we find a match.

Now, all that’s left is finding the augmenting function by looking for a call expression where the callee is a function expression (i.e. a function declared in-place, rather than an identifier or an object) and our array as one of its arguments. We don’t even need to search the entire code for that description, just the references to our array:

    arr.id.references.find(n =>
      n.parentNode.type === 'CallExpression' &&
      n.parentNode?.arguments?.filter(a => a.nodeId === n.nodeId)?.length)?.parentNode;

Putting it all together, it makes more sense to me to place each description in its own function, and all functions in a single class and share the parsed AST rather than pass it every time.

    const {generateFlatAST, generateCode, Arborist} = require('flast');
    
    class Deob {
     constructor(code) {
       this.code = code;
       this.ast = generateFlatAST(code);
       this.arborist = null;
       this.modified = false;
     }
    
     _getRelevantArrays() {
       // Get all arrays which contain only strings (literals) which have at least 1 reference used inside
       // an inner scope - i.e. a scope other than the global scope.
       return this.ast.filter(n =>
         n.type === 'VariableDeclarator' &&
         n?.init?.type === 'ArrayExpression' &&
         n.init.elements.length &&   // Is not empty.
         !n.init.elements.filter(e => e.type !== 'Literal').length &&  // Contains only literals.
         n.id?.references?.filter(r => r.scope.scopeId > 0).length);   // Used in another scope other than global.
     }
    
     _getDecodingFunction(arrId) {
       // Find a function declaration which references the array,
       // and at least as many references as there are literals in the array.
       return this.ast.find(n =>
         n.type === 'FunctionDeclaration' &&
         n.scope.references.filter(r => r?.identifier?.declNode?.nodeId === arrId.nodeId).length === 1 &&
         n.id?.references?.length >= arrId.references.length);
     }
    
     _getAugmentingFunction(arrId) {
       return arrId.references.find(n =>
         n.parentNode.type === 'CallExpression' &&
         n.parentNode?.arguments?.filter(a => a.nodeId === n.nodeId)?.length)?.parentNode;
     }
    
     deobfuscate() {
       // The main logic
       const arrayCandidates = this._getRelevantArrays();
       let arr, decFunc;
       for (const arrCandidate of arrayCandidates) {
         decFunc = this._getDecodingFunction(arrCandidate.id);
         if (decFunc) {
           arr = arrCandidate;
           break;
         }
       }
       if (decFunc) {
         let context = arr.parentNode.src + ';\n';
         const augmentingIife = this._getAugmentingFunction(arr.id);
         if (augmentingIife) context += augmentingIife.src + ';\n';
         context += decFunc.src + ';\n';
         this.arborist = new Arborist(this.ast);
         decFunc.id.references.map(n => n.parentNode).forEach(n => {
           if (n.type === 'CallExpression') {
             const newValue = eval(context + n.src);
             this.arborist.markNode(n, {
               type: 'Literal',
               value: newValue,
               raw: `'${newValue}'`,
             });
           }
         });
         this.modified = this.arborist.applyChanges() > 0;
       }
     }
    }
    
    const code = `var arr = ['d29ybGQ=', 'IA==', 'aGVsbG8=', 'bG9n'];
    (function(a, b) {a = a.reverse();})(arr, 7);
    console[dec('vxs', 57)](dec('ap3', 58) + dec('3j;', 59) + dec('aaa', 60));
    arr = [];
    dec = () => {};
    function dec(a, b) {
     debugger;
     return atob(arr[b - 57]);
    }`;
    const deob = new Deob(code);
    deob.deobfuscate();
    if (deob.modified) console.log(generateCode(deob.arborist.ast[0]));
    else console.log('Unable to deobfuscate ¯\\(ツ)/¯');

You can replace the inline code with the injected jQuery code and test it out.

All in all, it wasn’t that much cheese coding. This solution is still lacking, though, since it does not cover all possible scenarios. Here’s a partial list of scenarios that are not covered:

  • There might be other arrays in the code that match the same description.
  • There might be more than one augmenting function.
  • The decoding function could use a browser api (like checking the location.href), which wouldn’t work in our NodeJS environment out of the box.
  • There might be more than one obfuscated section in the code.
  • The array might be defined in one line and populated in another.

What I hope I managed to convey is how to use flAST to create concise and effective code that speaks in node structures and relationships.

The Journey So Far

What I’ve shared until now is how my deobfuscation journey started. What does it look like now?

You can check it out for yourself!

I’ve narrowed down some of the obfuscation descriptions in my Obfuscation Detector, and for some of those obfuscation types, REstringer provides specific pre-processors to remove traps or simplify the code before moving to more generic deobfuscation attempts.

For example, the Caesar+ obfuscation, which I broke down in Deobfuscating Caesar+, has an outer layer which acts as a sort of packer that concatenates the packed code string using HTML elements. This method is used to prevent running the code outside of a browser environment. In order to automate the unpacking in a nodeJS environment, a simulated browser environment is used via jsdom.

The main deobfuscation logic is laid out in the main deobfuscate method:

    deobfuscate(clean = false) {
      this.determineObfuscationType();
      this._runProcessors(this._preprocessors);
      this._loopSafeAndUnsafeDeobfuscationMethods();
      this._runProcessors(this._postprocessors);
      if (this.normalize) this._normalizeScript();
      if (clean) this.runLoop([this.removeDeadNodes]);
      return this.modified;
    }

The first thing is to identify the obfuscation type, run the appropriate preprocessors and set up any relevant post-processors. Then, the main deobfuscation methods run in a loop until they are no longer effective. Post-processors are then applied to the now deobfuscated code.

Finally, the code is normalized (e.g. turning bracket[‘notation’] into dot.notation) and stripped of any dead code if that option is selected.

The safe and unsafe are terms indicating whether a method is using eval to resolve a code’s output or not. Using eval is considered unsafe, even when running it in a sandbox environment like vm2.

    _loopSafeAndUnsafeDeobfuscationMethods() {
      let modified;
      do {
          this.modified = false;
          this.runLoop(this._safeDeobfuscationMethods());
          this.runLoop(this._unsafeDeobfuscationMethods(), true);
          if (this.modified) modified = true;
      } while (this.modified);
      this.modified = modified;
    }

The way this loop is constructed, the safe methods run until they exhaust themselves, and then the unsafe methods run once. The cycle is then repeated. This is done in order to avoid running overly eager evals on code which can be deobfuscated using other methods, which produce more accurate results.

The runLoop method takes an array of deobfuscation methods and… well, runs them in a loop (surprised, right?). It takes a “snapshot” of the code before running and compares it after each loop. If the snapshot matches the current state of the code, it means none of the methods have made any changes, and the loop can end. A maximum number of cycles is also observed to avoid an endless loop scenario. Here’s a cleaned up version of this method:

    runLoop(targetMethods, runOnce = false) {
      let scriptSnapshot = '';
        try {
          this._ast = generateFlatAST(this.script);
          while (scriptSnapshot !== this.script && this.cyclesCounter < this.maxCycles) {
            this._arborist = new Arborist(this._ast);
            scriptSnapshot = this.script;
            for (const func of targetMethods) {
              try {
                func.bind(this)();
              } catch (e) {
                // Error logging...
              }
            }
            const changesMade = this._arborist.applyChanges() || 0;
            if (changesMade) {
              this._ast = this._arborist.ast;
              this.script = generateCode(this._ast[0]);
              if (this.script !== scriptSnapshot) this.modified = true;
            }
            this.cyclesCounter++;
            if (runOnce) break;
          }
        } catch (e) {
          // Error logging...
        }
      }

This description wouldn’t be complete without a list of currently available safe and unsafe methods:

    _safeDeobfuscationMethods() {
      return [
        this._normalizeEmptyStatements,
        this._removeNestedBlockStatements,
        this._removeRedundantLogicalExpressions,
        this._resolveMemberExpressionReferencesToArrayIndex,
        this._resolveMemberExpressionsWithDirectAssignment,
        this._resolveDefiniteBinaryExpressions,
        this._parseTemplateLiteralsIntoStringLiterals,
        this._resolveDeterministicIfStatements,
        this._unwrapFunctionShells,
        this._replaceFunctionShellsWithWrappedValue,
        this._replaceCallExpressionsWithUnwrappedIdentifier,
        this._replaceEvalCallsWithLiteralContent,
        this._replaceIdentifierWithFixedAssignedValue,
        this._replaceIdentifierWithFixedValueNotAssignedAtDeclaration,
        this._replaceReferencedProxy,
      ];
    }
    
    _unsafeDeobfuscationMethods() {
      return [
        this._resolveAugmentedFunctionWrappedArrayReplacements,
        this._resolveMemberExpressionsLocalReferences,
        this._resolveDefiniteMemberExpressions,
        this._resolveLocalCalls,
        this._resolveBuiltinCalls,
        this._resolveDeterministicConditionalExpressions,
        this._resolveInjectedPrototypeMethodCalls,
        this._resolveEvalCallsOnNonLiterals,
        this._resolveFunctionConstructorCalls,
      ];
    }

I’d love to go into more details in the future about how I implemented the eval mechanism, how the caching works, choices I made regarding what can and cannot be deobfuscated at any given time, or how I collect context recursively.

There is a lot more documentation and comments in the repo.

Where Do We Go? Where Do We Go Now?

This project has been with me since my starting days at PerimeterX, and I’m really grateful for being given the opportunity and time to work on such an interesting long term project.

REstringer has been coded, refactored, re-written, shredded, glued back together, translated, renamed, lost, found, subjected to public inquiry, queried, lost again, and finally approved to be released as open source. What a journey!

REstringer is great, but it’s far from being perfect or complete. I’m going to keep at it, and I hope it’ll be useful to other researchers and obfuscation enthusiasts, all of whom are welcome to use and contribute as they see fit.

Thanks for reading! I hope we all meet again in REstringer 2: The Search for More Obfuscation, and may the Schwartz be with you!