Skip to content

Make literal and regex more flexible wrt whitespace #25

Open
@inkytonik

Description

@inkytonik

At the moment the decision of whether or not to skip whitespace in literal and regex (RegexParsers.scala) is made on the basis of whether the skipWhitespace method returns true or not. Thus, the same decision is used for all occurrences of literal and regex in a parsing module.

Sometimes it is convenient to skip white space on most occasions, but in one or two places, not to skip it. The current design of literal and regex makes this impossible, I think, without duplicating some internal details of RegexParsers in the client code.

E.g., suppose we are parsing names that consist of an alphabetic part, followed by a numeric part, and we want to use the two parts separately after parsing. We could parse a whole name with one regex parser ("[A-Z]+[0-9]+".r) and then post-process the whole name to extract the two parts.

However, it is conceptually cleaner to recognise the two pieces with separate regexes so the two parts are delivered separately in one step without post-processing being required. Unfortunately, the obvious parser

"[A-Z]+".r ~ "[0-9]+".r

does not work because the second parser will skip whitespace (assuming that we haven't altered the default behaviour). We can't just turn of whitespace processing in the module since we want the first parser here (and probably many others) to skip white space.

One solution is to create a version of regex (and similarly, literal) that does not perform whitespace handling and use that for the second parser above. However, this cannot be done in user code easily since we would have to duplicate elements of the RegexParsers implementation such as most of the regex method and the (private) class SubSequence.

A better approach would seem to be to extend the existing regex and literal methods to have an optional Boolean argument that specifies for a particularly call whether the module-wide whitespace handling should be performed or not. Then the above spec would be something like

"[A-Z]+".r ~ regex ("[0-9]+".r, handleWhiteSpace = false)

which is verbose but does the trick.The verbosity should not be a problem since it can be hidden behind another name, and this situation is rare anyway.

I'm interested in feedback on whether something like this would be supported in the library, or if there another approach to handling this issue that I've missed. I can submit a pull request for the actual change if there is support.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions