this post was submitted on 06 Dec 2023

5 points (100.0% liked)

Python

5396 readers

1 users here now

Welcome to the Python community on the programming.dev Lemmy instance!

📅 Events

October 2023

PyConES Canarias 2023, 6-8th
DjangoCon US 2023, 16-20th (!django 💬)

November 2023

PyCon Ireland 2023, 11-12th
PyData Tel Aviv 2023 14th

Past

July 2023

PyDelhi Meetup, 2nd
PyCon Israel, 4-5th
DFW Pythoneers, 6th
Django Girls Abraka, 6-7th
SciPy 2023 10-16th, Austin
IndyPy, 11th
Leipzig Python User Group, 11th
Austin Python, 12th
EuroPython 2023, 17-23rd
Austin Python: Evening of Coding, 18th
PyHEP.dev 2023 - "Python in HEP" Developer's Workshop, 25th

August 2023

PyLadies Dublin, 15th
EuroSciPy 2023, 14-18th

September 2023

PyData Amsterdam, 14-16th
PyCon UK, 22nd - 25th

🐍 Python project:

💓 Python Community:

#python IRC for general questions
#python-dev IRC for CPython developers
PySlackers Slack channel
Python Discord server
Python Weekly newsletters
Mailing lists
Forum

✨ Python Ecosystem:

🌌 Fediverse

Communities

#python on Mastodon
c/django on programming.dev
c/pythorhead on lemmy.dbzer0.com

Projects

Pythörhead: a Python library for interacting with Lemmy
Plemmy: a Python package for accessing the Lemmy API
pylemmy pylemmy enables simple access to Lemmy's API with Python
mastodon.py, a Python wrapper for the Mastodon API

Feeds

founded 1 year ago

MODERATORS

[email protected]

Problems parsing a string with pyparsing (lemmy.world)

submitted 11 months ago by [email protected] to c/[email protected]

2 comments fedilink hide all child comments

i was trying to parse a string with pyparsing so all the words were separated from the punctuation signs, i was using this expression to do it:

OneOrMore(Word(alphanums)) + OneOrMore(Char(printables))

But when i parse the following string with this expression:

return abc(1, ULLONG_MAX)

All the words inside the parentheses get split:

['return', 'abc', '(', '1', ',', 'U', 'L', 'L', 'O', 'N', '_', 'M', 'A', 'X', ')', ';']

But if i use this expression:

OneOrMore(Word(alphanums)) + OneOrMore(Char(string.punctuation))

Only a part of the string gets parsed:

['return', 'abc', '(']

What is wrong with those expressions?

top 2 comments

sorted by: hot top controversial new old

[–] [email protected] 4 points 11 months ago* (last edited 11 months ago)

Personally I would recommend to use regex instead for parsing, which would also allow you to more easily test your expressions. You could then get the list as

import re
result = re.findall(r'[\w_]+|\S',  yourstring)  # This will preserve ULLONG_MAX as a single word if that's what you want

As for what's wrong with your expressions:

First expression: Once you hit (, OneOrMore(Char(printables)) will take over and continue matching every printable char. Instead you should use OR (|) with the alphanumerical first for priority OneOrMore(word | Char(printables))

Second expression. You're running into the same issue with your use of +. Once string.punctuation takes over, it will continue matching until it encounters a char that is not a punctuation and then stop the matching. Instead you can write:

parser = OneOrMore(Word(alphanums) | Word(string.punctuation))
result = parser.parseString(yourstring)

Do note that underscore is considered a punctutation so ULLONG_MAX will be split, not sure if that's what you want or not.

[–] [email protected] 3 points 11 months ago

Haven't used that particular library, but have written libraries that do similar sorts of things and have played with a few other similar libraries in C++ and Haskell. I've taken a quick glance at the documentation here, but since I don't know this library specifically apologizes in advance if I make a mistake.

For OneOrMore(Word(alphanums)) + OneOrMore(Char(printables)) it looks it matches as many alphanum Words as it can (whitespace sequences being an acceptable separator between tokens by default) and when it hits ( it cannot continue with that so tries to match the next expression in the sequence. (i.e. OneOrMore(Char(printables)))

The documentation says:

Char - a convenience form of Word that will match just a single character from a string of matching characters

Presumably, that means it will not group the characters together, which is why you get individual character matches after that point for all the remaining non-whitespace characters. (Your result also seems to imply there was a semicolon at the end of your input?)

For OneOrMore(Word(alphanums)) + OneOrMore(Char(string.punctuation)) it looks like it cannot match further than ( since 1 is not a punctuation character; so, you got the tokens for the parts of the string that matched. (If you chained the parser expression with something like + Word(alphanum) I'd expect you'd get another token [i.e. "1"] added onto the end of your result.) You may eventually want StringEnd/LineEnd or something like that -- I'd expect they'd fail the parser expression if there's unconsumed input (for error detection), but again, haven't used this specific library, so it may work different than I expect.

There appears to be a Combine class you can use to join string results together; that might be useful for future reference.

i was trying to parse a string with pyparsing so all the words were separated from the punctuation signs

Have not tested it (since I don't have a copy of the library installed anywhere and can't set up an environment for it easily right now) but perhaps something like OneOrMore(Word(alphanums)|Char(string.punctuation)) would be more like what you are looking for?