How to validate a numeric expression with a regular expression?

I need within my application in PHP7 to validate a numeric expression with a regular expression, the case is that I almost managed but ran into a problem. The non-significant zeros within the expression:

When I have the numeric expression below I can validate:

10 + ( 10 * 10 ) - 20

Using the following regular expression:

$cRegex  = '/^' ;
//          '|------|----|-----|---|-----|----|---|
$cRegex .= '([-+\(]?[\(]?[0-9]+[.]?[0-9]+[\)]?[\)]?)?' ;
//          '|---------|----|----|-----|---|-----|----|----|---|
$cRegex .= '([-+*\/\(]?[\(]?[\(]?[0-9]+[.]?[0-9]+[\)]?[\)]?[\)]?)?' ; 
//          '|---------|----|----|-----|---|-----|----|----|---|
$cRegex .= '([-+*\/\(]?[\(]?[\(]?[0-9]+[.]?[0-9]+[\)]?[\)]?[\)]?)?' ; 
//          '|---------|----|----|-----|---|-----|----|----|---|
$cRegex .= '([-+*\/\(]?[\(]?[\(]?[0-9]+[.]?[0-9]+[\)]?[\)]?[\)]?)?' ; 

$cRegex .= '$/' ;

My problem is that when the following numeric expression appears: 690/09. The return is true and should be false because the correct would be 690/9. In PHP7 the "09" of the numeric expression gives problem.

So I'm asking for help to improve my regular expression so I can detect it.


I managed as follows:

$cRegex  = '/^' ;
//          '|---------|----|---------------|---|-----|----|---|
$cRegex .= '([-+*\/\(]?[\(]?([1-9]{1}[0-9]+)[.]?[0-9]+[\)]?[\)]?)' ; 
//          '|---------|----|--------------|--------|---|--------|----|---|
$cRegex .= '([-+*\/\(]?[\(]?[\(]?([1-9]{1})([0-9]+)?[.]?([0-9]+)?[\)]?[\)]?)?' ; 
//          '|---------|----|--------------|--------|---|--------|----|---|
$cRegex .= '([-+*\/\(]?[\(]?[\(]?([1-9]{1})([0-9]+)?[.]?([0-9]+)?[\)]?[\)]?)?' ; 
//          '|---------|----|--------------|--------|---|--------|----|---|
$cRegex .= '([-+*\/\(]?[\(]?[\(]?([1-9]{1})([0-9]+)?[.]?([0-9]+)?[\)]?[\)]?)?' ; 
//          '|---------|----|--------------|--------|---|--------|----|---|
$cRegex .= '([-+*\/\(]?[\(]?[\(]?([1-9]{1})([0-9]+)?[.]?([0-9]+)?[\)]?[\)]?)?' ; 
//          '|---------|----|--------------|--------|---|--------|----|---|
$cRegex .= '([-+*\/\(]?[\(]?[\(]?([1-9]{1})([0-9]+)?[.]?([0-9]+)?[\)]?[\)]?)?' ; 
$cRegex .= '$/' ;

I tested with the following numeric expressions:

  • +(690.91)+1 - validated
  • +(690/2)+20.01+(10*3)-010 - correctly rejected
  • +(690/2)+20.01+(10*3)-10 - validated
  • 690 - validated
  • 690/09 - rejected correctly
  • 690/9 - validated

Note: I removed the blanks from the expressions.

Author: hkotsubo, 2019-05-16

1 answers

Like you said that the expression can be "anything", I am assuming that it can have more than one pair of nested parentheses, such as for example:

690.91+(1.3*(4-7/(3+6.4)))

Your regex can't detect this, because of the nested parentheses. In fact, they have other problems as well. For example, you worried so much about validating the more complex cases (like +(690.91)) that you ended up passing up the simpler cases, like 1+1 (see here ).

Some your regex details:

  • [\(] is the same as \( (which by the way, is the same as [(], since within brackets, many characters do not need to be escaped with \, see an example ). Anyway, if you want to capture only one character, you don't need the brackets, so to check for a parentheses opening, just use \(.
    • the same goes for [.], which can only be written as \.
    • the brackets are useful when there is more than one character possible (ex: [ab] means "the letter a or the letter b"), but when you only want to capture one character, they are unnecessary
  • or quantifier {1} it means "exactly one occurrence", but by default, anything you use in a regex already indicates that you want an occurrence of that. So [1-9]{1} is the same as [1-9].
  • in the first part of the regex you used ([1-9]{1}[0-9]+)[.]?[0-9]+ (a digit of 1 to 9, followed by one or more digits from 0 to 9, followed by an optional period, followed by one or more digits from 0 to 9). That is, this snippet only validates numbers that have at least 3 digits (if you do not have the dot), or at least 2 digits before the dot (see). That's why regex doesn't validate 1+1.
  • you have left both opening and closing parentheses optional. This means that your regex accepts expressions that have an open parenthesis but do not have the respective closure, or that do not have the opening but have the closure (example).

Correcting / improving...

For numbers, you could use something like -?(?=[1-9]|0(?!\d))\d+(?:\.\d+)?.

It starts with the optional minus (-?). Then we have a lookahead - the snippet within (?= - which checks whether what is ahead is a digit from 1 to 9 or a zero that is not followed by another digit (the | means or and the snippet 0(?!\d) ensures that there is no digit ahead of zero). So the expression can have the number zero alone (0), but it cannot have 09, for example).

Then we have \d+ (one or more digits), followed by optionally by a point and more digits (so we can have 10 and 10.1).

If you want regex to also accept numbers like .123 (which is another way of writing 0.123), just switch to -?(?:(?=[1-9]|0(?!\d))\d+(?:\.\d+)?|\.\d+) - this means that it accepts numbers in the manner already explained above or a dot followed by one or more digits. (see here some examples of this regex).


Then, for the arithmetic expression itself, it is not enough to do a lot of sub-expressions and leave them optional. It is necessary to check, among other things, whether the parentheses are balanced (for each opening there is the corresponding closure).

The other problem is that the expression can have several levels of parentheses nesting, so it would not be enough to do as you did (several different options in a row), since the possibilities are many: expressions without parentheses, with 1 pair of parentheses in each operand, with several pairs nested in each operand, etc.

Not to mention that your regex limits the expression to only 6 operands (ex: +690.91+1-2-1-3-4). If we add one more (like +690.91+1-2-1-3-4-1), it is no longer validated (see). (this case would even solve by swapping the ? at the end of each stretch for * (zero or more occurrences), but still it would not solve the other problems already mentioned).

The solution in this case is to use Recursive Patterns and subroutines (regex deliberately copied from here and adapted for the case in question):

$regex = '{
    (?(DEFINE)
       (?<number>    (?: -? (?: (?= [1-9]|0(?!\d) ) \d+ (?:\.\d+)? ) | \.\d+ ) )
       (?<sign>      (?: [-+] ))
       (?<expr>      (?: (?&term) (?: \s* [-+] \s* (?&expr))? ))
       (?<term>      (?: (?&factor) (?: \s* [/*] \s* (?&term))? ))
       (?<factor>    (?: (?&number) |
                         (?&sign) \s* (?&factor) |
                        \( \s* (?&expr) \s* \)))
    )
    ^ \s* (?&expr) \s* $
}x';

This regex is well complex. The first snippet (inside block (?(DEFINE)) defines subroutines. Basically, "sub-expressions" are created, each a name. The syntax (?<nome> defines the subroutine, and the syntax (?&nome) replaces this snippet with the corresponding regex.

For example, the first subroutine is called "number" (its definition is within the snippet bounded by (?<number>), and it corresponds to the regex that checks a number (the same one already mentioned above). Then in the other subroutines we see the use of (?&number) - this snippet is replaced by the corresponding regex.

Then we have the subroutine "sign", which captures the sign ([-+], a minus or plus sign). Next we define the subroutines "factor", "term" and "expr":

  • an "expr" can have a" term " alone, or added/subtracted to another "expr"
  • a "term" can be a "factor" alone, or multiplied/divided by another "term"
  • a "factor" can be a "number", or a" factor "with a" sign "before, or an" expr " in parentheses

Notice that the structure is recursive (so regex can check for multiple nested parentheses and expressions of any size). And in several places I use \s* (zero or more spaces), so the regex allows there to be spaces in the expression.

After the Block DEFINE, There is the regex proper fingers: ^ \s* (?&expr) \s* $. The markers ^ and $ are, respectively, the beginning and end of the string. Then we have optional spaces at the beginning and end, and in the middle of them we have the expression.

Another important point is that I use the modifier x (at the end of the string), as this causes the regex to ignore line breaks and whitespace. This allows you to write it the way above (with multiple spaces and line breaks, making it more organized and a little easier to read). If I didn't use x, all of the above regex would have to be written in a single line, and without spaces - which would make it even harder to read and understand.

Another detail is that instead of delimiting the regex with /, I used keys ({}). With this, the bars inside the regex do not need to be written as \/ (it's okay that there is only one bar in the regex, but particularly, I prefer to minimize the amount of \ whenever possible).

Testing regex:

$list = array('+(690.91)+1', '+(690/2)+20.01+(10*3)-010', '+(690/2)+20.01+(10*3)-10', '690',
  '690/09', '690/9', '1+1', '690 + 1', '10+(10*10)-20', '690.91+(1.3*(4-7/(3+6.4)))',
  '690.91+(01.3*(4-7/(3+6.4)))', '.24+3', '+(690.91+1', '+690.91+1-2-1-3-4-1',
  '690.91+(1.3*(4-7/(3+6.4)))/(-1.3*4/(3.2-(1/7.5)))');
foreach ($list as $exp) {
    echo $exp. '='. (preg_match($regex, $exp) ? 'ok' : 'nok'), PHP_EOL;
}

It is in accordance with your examples, with the bonus of validating the cases that your regex cannot (nested parentheses, 1+1, expressions with spaces, etc.):

+(690.91)+1=ok
+(690/2)+20.01+(10*3)-010=nok
+(690/2)+20.01+(10*3)-10=ok
690=ok
690/09=nok
690/9=ok
1+1=ok
690 + 1=ok
10+(10*10)-20=ok
690.91+(1.3*(4-7/(3+6.4)))=ok
690.91+(01.3*(4-7/(3+6.4)))=nok
.24+3=ok
+(690.91+1=nok
+690.91+1-2-1-3-4-1=ok
690.91+(1.3*(4-7/(3+6.4)))/(-1.3*4/(3.2-(1/7.5)))=ok

But perhaps regex is not the best solution for your case. Have you tried to see some parser specific for arithmetic expressions ? Despite being cool, regex is not always the best solution . It is also worth remembering that regex only validates the expression, but does not calculate its value (and in that case, a better solution would be to use specific functions/APIs).

 2
Author: hkotsubo, 2019-05-18 19:30:17