History | Log In     View a printable version of the current page.  
Issue Details (XML | Word | Printable)

Key: IDEADEV-11213
Type: Bug Bug
Status: Resolved Resolved
Resolution: Fixed
Priority: Normal Normal
Assignee: Alexey Kudravtsev
Reporter: Sascha Weinreuter
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
IDEA: Development

ASTNode.getText() returns escaped text for Injected Language

Created: 29 Jul 06 22:04   Updated: 06 Dec 07 18:52
Component/s: Plugin Support. API
Fix Version/s: Selena 7.0.2

Original Estimate: Unknown Remaining Estimate: Unknown Time Spent: Unknown

Build: 5,581
Fixed in build: 7,565
Severity: High


 Description  « Hide
ASTNode.getText() for elements of a language that has been injected into a String-literal returns the literal content of the text, even though the lexer has been passed the unescaped text. This is inconsistent and dangerous when relying on the text of an element for its further processing.

To workaround that, I'd need to know whether/where the element is injected into and unescape the text myself before further processing it. It's great that the text is parsed in its unescaped form, but it must also be consistent with the AST/PSI.

Example:

String re = "\b" -> Lexer for injected language gets a single character: '\b' but ASTNode.getText() returns "\\b".


 All   Comments   Work Log   Change History      Sort Order:
Sascha Weinreuter - 29 Jul 06 22:08
formatting

Sascha Weinreuter - 12 Aug 06 17:01
While not a showstopper, I think it's a rather odd behavior, at least from the API-user's point of view. Is this by design or is there a chance that this will change for the final 6.0?

It would not be the end of the world if it stays like it is, but then the behavior should be documented.


Alexey Kudravtsev - 08 Sep 06 12:40
This behaviour is by design.
There is a contract stating that text obtained from PSI should be the same as the file text.
I.e. following should be true:
document.getText().equals(psiFile.getText())

And, moreover, this should also hold for any PSI element, i.e.
document.getText().substring(element.getTextRange().getStartOffset(), element.getTextRange().getEndOffset()).equals(element.getText())

must be true for any PSI element.


Sascha Weinreuter - 08 Sep 06 13:01
I agree that is applies to PsiElements, but not necessarily to ASTNodes (even if from the internal implementation's point of view they are the same). The end result of this is that a language needs to know into which context it is injected into (if injected at all):

Suppose I have a Token INTEGER_LITERAL: If injected into an XML attribute, the text can be "1234" or "1234": In both cases, the text passed to my lexer is the same - which is good. But how am I supposed to deal with that when I want to calculate the literal's value? There doesn't even seem to be any utility function that could help me to decode that myself.

Suggestion: Add a method getDecodedText() (or similar) to ASTNode and/or ASTWrapperPsiElement that at least provides a convenient solution for the case when I need to process the text myself.


Sascha Weinreuter - 08 Sep 06 16:31
Stupid JIRA formatting: Of course I meant "& #x31;& #x32;& #x33;& #x34;" (without the spaces)

Sascha Weinreuter - 17 Sep 06 20:29
Ok, here's another problem:

There's a difference whether an element is part of the prefix/suffix of an injected fragment or not. While elements that are part of e.g. a String literal return the escaped text, elements from the prefix/suffix return the unescaped text.

Even though this appears logical at first glance, this is kind of a showstopper because this makes it impossible to distinguish whether to manually decode the text or not. (e.g. through getContainingFile().getContext() instanceof PsiLiteralExpression).

I see the following possibilities to address this (in order of preference):

  • fix this in a way that any language can be transparently injected
    or
  • add a method getDecodedText() (see comment above) that handles text of prefix/suffix correctly
    or
  • apply escaping rules of injection context to getText() of nodes in prefix/suffix as well
    or
  • provide some way to determine whether an ASTNode/PsiElement is part of the prefix/suffix and doesn't need to be unescaped

Please respond ASAP. Thanks.


Alexey Kudravtsev - 18 Sep 06 13:20
All I can do in the meantime is to refer you to the highly obscured and implementation tied method

com.intellij.psi.impl.source.tree.injected.InjectedLanguageUtil#isInInjectedLanguagePrefixSuffix

which of course will be changed in the future, and so on, so on.

Overall, things like prefix/suffix handling should be reviewed, since now a single quote (' or ") being typed into injected Javascript language breaks all prefix/suffix things because it makes all text after quote a part of the single long string literal spanning all injected text incuding suffix.
All suggestions about transparent injection possibilities are very much welcome.


Sascha Weinreuter - 18 Sep 06 14:24
Well, that's good enough for me for the moment. Thanks a lot for the hint, this at least helps me to deal with the issue in a new language that is explicitly meant to be injected into strings.

Alexey Kudravtsev - 21 Nov 07 15:18
InjectedLanguageManager.getUnescapedText

Sascha Weinreuter - 21 Nov 07 15:53
Hmm, that still requires the language to be aware that it is potentially injected. I was looking for a more transparent solution, but I guess this would be too hard because it violates certain assumptions about PSI & text. But the new method is a good start anyway. Thanks.