How I made a dataclass remover

Trey Hunner

12 min. read • Python 3.10—3.12 • June 30, 2022

This is a follow-up to Appreciating Python's match-case by parsing Python code. In that first post I discussed my adventures with Python's structural pattern matching.

I try not to work late at night, but I had an insight a few weeks ago that kept me on my computer long past my bedtime. I thought "could I make an automatic dataclass to regular class converter?" Spoilers: the answer was yes, by writing code that made sloppy-but-good-enough assumptions.

Pause. I want to note that I don't dislike dataclasses. I think dataclasses are great and I teach them often in my Python training sessions. In fact, it may seem counter-intuitive, but I'm hoping a dataclass remover will help me teach dataclasses more effectively.

When I teach Python's dataclasses I often show a "before" and "after", like an infomercial for a cleaning product.

Seeing the equivalent code for a dataclass helps us appreciate what dataclasses do for us. This process really drives home the point that dataclasses make friendly-to-use classes with less boilerplate code.

So I was up late writing a dataclass to regular class converter, which started as a script and eventually turned into a WebAssembly-powered web app. During this process I used some interesting Python features that I'd like to share.

How does this `undataclass.py` script work?

Essentially the undataclass.py script:

Parses the contents of a Python file into an abstract syntax tree (using the ast module)
Identifies dataclass-related AST nodes representing dataclasses, dataclass fields, and __post_init__ methods
Builds up strings representing Python code for the various equivalent non-dataclass methods
Parses those strings into AST nodes and injects them into the was-a-dataclass node
Converts all the AST nodes back into Python code

I used some tricks I don't usually get to use in Python. I used:

Many very hairy match-case blocks which replaced even hairier if-elif blocks
A sentinel object to keep track of a location that needed replacing
Python's textwrap.dedent utility, which I feel should be more widely known & used
slice assignment to inject one list into another
The ast module's unparse function to convert an abstract syntax tree into Python code

Let's take a look at some of the code.

Structural pattern matching

Python's structural pattern matching can match [iterables][] based on their length and content. With structural pattern matching (which uses match-case that was added in Python 3.10) we can turn this code (from Django):

def do_get_available_languages(parser, token):
    args = token.contents.split()
    if len(args) != 3 or args[1] != "as":
        raise TemplateSyntaxError(
            f"'get_available_languages' requires 'as variable' (got {args})"
        )
    return GetAvailableLanguagesNode(args[2])

Into this:

def do_get_available_languages(parser, token):
    match token.split_contents():
        case [name, "for", code "as" info]:
            return GetLanguageInfoNode(parser.compile_filter(code), info)
        case [name, *rest]:
            raise TemplateSyntaxError(
                f"'{name}' requires 'as variable' (got {rest!r})"
            )

Python's match-case also allows for deep type checking and attribute content assertions, which allows turning this:

if isinstance(node, ast.Call):
    if (isinstance(node.func, ast.Attribute)
            and node.func.value.id == "dataclasses"
            and node.func.attr == "dataclass"):
        return True
    elif node.func.id == "dataclass":
        return True
elif (isinstance(node, ast.Attribute)
        and node.value.id == "dataclasses"
        and node.value.attr == "dataclass"):
    return True
elif isinstance(node, ast.Name) and node.id == "dataclass":
    return True
else:
    return False

Into this:

match node:
    case ast.Call(
        func=ast.Attribute(
            value=ast.Name(id="dataclasses"),
            attr="dataclass",
        ),
    ):
        return True
    case ast.Call(func=ast.Name(id="dataclass")):
        return True
    case ast.Attribute(
        value=ast.Name(id="dataclasses"),
        attr="dataclass"
    ):
        return True
    case ast.Name(id="dataclass"):
        return True
    case _:
        return False

Python's match-case statements tend to be very complex, but also much less visually dense than than an equivalent if statement. I ultimately ended up using 7 match-case statements throughout the undataclass.py script, each of which replaced an even more complex if-elif statement.

If you're interested in how I used match-case during this adventure, see part one of this two part post on how I used match-case to parse Python code while writing undataclass.py.

The sentinel object

While looping through the AST nodes in a dataclass, I needed to keep track of where the new methods (__init__, __repr__, __eq__, etc.) should be inserted. It seemed most appropriate that these would be the first function definitions in our class, which means we'd insert these methods just before the first function definition we discovered.

Once I decided on my location-to-insert-methods, I needed a placeholder value to keep track of that location because I wouldn't actually have the methods-to-be-inserted until later on. But which value to use?

Objects that acts as a placeholders are often called "sentinel values". A sentinel value is useful for indicating something that isn't real data. In Python, the most common sentinel value is None. But you can also invent your own sentinel values in Python.

None didn't feel like an appropriate placeholder value to represent "the place dataclass-equivalent methods should go", so instead I made my own sentinel value. I called object to make a completely unique placeholder and then pointed the DATACLASS_STUFF_HERE variable to it (and yes that variable name isn't great):

DATACLASS_STUFF_HERE = object()

Then I stuck that unique placeholder object in a new_body list which I used to store all the new nodes that would overwrite the original nodes from the old dataclass:

match node:
    case ast.FunctionDef():
        if DATACLASS_STUFF_HERE not in new_body:
            new_body.append(DATACLASS_STUFF_HERE)
        if node.name == "__post_init__":
            post_init = node.body
        else:
            new_body.append(node)

But where did I replace this placeholder object with something useful? That's where sliced assignment came in.

We'll get to sliced assignment later. First let's talk about generating the AST nodes for those new methods.

Generating AST nodes for Python code

My make_dataclass_methods function accepted the class name, the options provided to the dataclass decorator, the dataclass fields found, and a list of the AST nodes found in the __post_init__ method (if any). This function then returned a list of AST nodes that represented the new methods we needed (__init__, __repr__, etc.).

dataclass_extras = make_dataclass_methods(
    dataclass_node.name,
    options,
    fields,
    post_init,
)

This make_dataclass_methods function is essentially a big chain of if statements which checked certain scenarios related to our dataclass options:

def make_dataclass_methods(class_name, options, fields, post_init):
    """Return AST nodes for all new dataclass attributes and methods."""
    nodes = []
    kw_only_fields = process_kw_only_fields(options, fields)
    init_fields, init_vars = process_init_vars(fields)
    if options.get("slots", False):
        nodes += ast.parse(make_slots(fields)).body
    if options.get("match_args", True):
        nodes += ast.parse(make_match_args(fields)).body
    if options.get("init", True):
        nodes += ast.parse(make_init(
            init_fields,
            post_init,
            init_vars,
            options.get("frozen", False),
            kw_only_fields,
        )).body
    if options.get("repr", True):
        nodes += ast.parse(make_repr(fields)).body
    if options.get("compare", True):
        nodes += ast.parse(make_order("==", class_name, fields)).body
    if options.get("order", False):
        nodes += ast.parse(make_order("<", class_name, fields)).body
    if (options.get("frozen", False) and options.get("eq", True)
            or options.get("unsafe_hash", False)):
        nodes += ast.parse(make_hash(fields)).body
    if options.get("frozen", False):
        nodes += ast.parse(make_setattr_and_delattr()).body
        if options.get("slots", False):
            nodes += ast.parse(make_setstate_and_getstate(fields)).body
    return nodes

This acts like a restaurant menu: it figures out which features we want and then gives us the AST nodes representing those features. It asks questions like this:

Is slots=True set? Great, add nodes for __slots__.
Is repr not set to False? Great, add nodes for __repr__.
Is order set to True? Great, add nodes for __lt__.

Note that in each of these if statements we have a line that looks like this:

nodes += ast.parse(make_SOMETHING_OR_OTHER()).body

That make_SOMETHING_OR_OTHER function returns a string representing Python code. Once we get that string, we use ast.parse, to parse it and then grab the body attribute from the resulting node to get its subnodes. We then use += to extend our nodes list with these new subnodes.

If we used pdb to inspect the nodes list just before we return from this function, we might see something like this:

> undataclass.py(307)make_dataclass_methods()
-> return nodes
(Pdb) pp nodes
[<ast.Assign object at 0x7f2c03307070>,
 <ast.FunctionDef object at 0x7f2c03307bb0>,
 <ast.FunctionDef object at 0x7f2c03307eb0>,
 <ast.FunctionDef object at 0x7f2c03307fa0>]
(Pdb) nodes[1].name
'__init__'
(Pdb) pp nodes[1].body
[<ast.Assign object at 0x7f2c03307b20>,
 <ast.Assign object at 0x7f2c03307970>]
(Pdb) ast.unparse(nodes[1].body[0])
'self.x = x'

That second node (nodes[1]) represents a __init__ function.

So each of our make_SOMETHING_OR_OTHER functions need to generate Python code. But how do they do that?

Messily.

Building strings that represent Python code

Each of the make_SOMETHING_OR_OTHER functions essentially made strings that represent bits of code and then glued those strings together with f-strings and the string join method.

Have you ever written strings that represent Python code from within Python? No? That's probably a good thing! This part was unavoidably very messy.

For example here's the code that generates __slots__ (if slots=True is set):

def attr_name_tuple(fields):
    """Return code for a tuple of all field names (as strings)."""
    joined_names = ", ".join([
        repr(f.name)
        for f in fields
    ])
    if len(fields) == 1:  # Single item tuples need a trailing comma
        return f"({joined_names},)"
    else:
        return f"({joined_names})"


def make_slots(fields):
    """Return code of __slots__."""
    return f"__slots__ = {attr_name_tuple(fields)}"

This built-up and returned a single line of Python code (that __slots__ string below):

>>> from types import SimpleNamespace as field
>>> from undataclass import make_repr
>>> make_slots([field(name="x"), field(name="y")])
"__slots__ = ('x', 'y')"

Here's the code that builds up our __repr__ method:

def make_repr(fields):
    """Return code for the __repr__ method."""
    repr_args = ", ".join([
        f"{f.name}={{self.{f.name}!r}}"
        for f in fields
        if f.repr
    ])
    return dedent("""
        def __repr__(self):
            cls = type(self).__name__
            return f"{{cls}}({repr_args})"
    """).format(repr_args=repr_args)

That make_repr function returns a string that represents the Python code needed for a friendly __repr__ method:

>>> from types import SimpleNamespace as field
>>> from undataclass import make_repr
>>> print(make_repr([field(name="x", repr=True), field(name="y", repr=True)]))

def __repr__(self):
    cls = type(self).__name__
    return f"{cls}(x={self.x!r}, y={self.y!r})"

Note that the returned code isn't indented, even though the multi-line string we wrote to generate this code is indented. The magic here is in the textwrap.dedent utility. Python's textwrap.dedent was super helpful for generating all the needed Python code.

Without that dedent call above, the output would look like this:

        def __repr__(self):
            cls = type(self).__name__
            return f"{cls}(x={self.x!r}, y={self.y!r})"

Instead of like this:

def __repr__(self):
    cls = type(self).__name__
    return f"{cls}(x={self.x!r}, y={self.y!r})"

I use dedent in lots of my own code that involves multi-line strings and many Python Morsels exercises include solutions that use dedent. You can see a demo of textwrap.dedent in action here. If you ever need to remove indentation from a multi-line string in Python, I highly recommend taking a look at textwrap.dedent.

Even with dedent, we can't be saved from messy code here. Code that generates Python code isn't pretty by its very nature. But keep in mind that the alternative would have been creating lots of AST nodes manually. Writing Python code within strings and then using ast.parse to parse those strings made for much more readable code.

Sliced assignment

After we call that make_dataclass_methods function, we'll have a list of AST nodes (pointed to by a dataclass_extras variable):

dataclass_extras = make_dataclass_methods(
    dataclass_node.name,
    options,
    fields,
    post_init,
)

What do we do with that list?

Remember that DATACLASS_STUFF_HERE sentinel value we used as a placeholder? We need to replace it with all the nodes in our dataclass_extras list now.

We can use slice assignment to do that:

if DATACLASS_STUFF_HERE in new_body:
    index = new_body.index(DATACLASS_STUFF_HERE)
    new_body[index:index+1] = dataclass_extras
else:
    new_body += dataclass_extras
dataclass_node.body = new_body

If DATACLASS_STUFF_HERE was not in our new_body list, then we add all the nodes to the end of our list. But if DATACLASS_STUFF_HERE was in our new_body list then we find its position and then replace it with all those new AST nodes we made. We're doing it through slice assignment.

Did you know you can assign to a slice in Python? It's a somewhat strange thing to see, but it's super helpful during those rare times that it's useful:

>>> numbers = [2, 1, 11, 18]
>>> numbers[1:1] = [3, 4, 7]
>>> numbers
[2, 3, 4, 7, 1, 11, 18]
>>> numbers[2:] = [29, 47]
>>> numbers
[2, 3, 29, 47]

Note that I could have instead made a new list using slicing:

if DATACLASS_STUFF_HERE in new_body:
    index = new_body.index(DATACLASS_STUFF_HERE)
    new_body = new_body[:index] + dataclass_extras + new_body[index+1:]
else:
    new_body += dataclass_extras
dataclass_node.body = new_body

But that's not quite as fun, is it? 😜

Yes this adventure resulted in a useful tool for teaching dataclasses, but my primary motivation was to have fun doing something I don't normally get to do.

Unparsing abstract syntax trees

Now that we've modified each of our dataclass AST nodes to un-dataclass them, how do we generate the Python code that our abstract syntax tree represents?

We can use ast.unparse for that!

    return ast.unparse(new_nodes)

When we inspected the nodes list generated by the make_dataclass_methods function earlier, if we'd called ast.unparse on that nodes list, we might have seen something like this:

(Pdb) print(ast.unparse(nodes))
__match_args__ = ('x', 'y')

def __init__(self, x: float, y: float) -> None:
    self.x = x
    self.y = y

def __repr__(self):
    cls = type(self).__name__
    return f'{cls}(x={self.x!r}, y={self.y!r})'

def __eq__(self, other):
    if not isinstance(other, Point):
        return NotImplemented
    return (self.x, self.y) == (other.x, other.y)

The unparse function accepts a tree of AST nodes and returns the Python code that those nodes represent. Neat, huh?

The big downside to using ast.unparse is that we lose the original formatting of our code. How many blank lines did we use? How did wrap our code? And were there code comments? We lose all of that!

But this tool isn't meant to generate exactly the replacement code we need. The generated code is meant to be an example of what a non-dataclass version would look like. For that purpose, ast.unparse is certainly good enough.

This code makes sloppy assumptions

The code I ended up writing was very sloppy about in the assumptions it made.

This dataclass converter isn't intended to automatically turn every possible dataclass into a fully functional regular class. That task simply isn't possible: some limits need to be set.

I decided that I would make fairly reasonable assumptions about the ways dataclasses are typically written and run with those assumptions. If later on I need to refactor a section that assumed a bit too much, so be it! That's either a problem for my future self or (more likely) a problem I'll never need to worry about.

That was fun, now go appreciate dataclasses

I think Python's dataclasses are great. Dataclasses encourage Python programmers to make classes that have a friendly __init__ method, a helpful string representation, and allow for sensible equality checks. I made this tool to demonstrate what dataclasses do for us.

You could also use this code to actually replace dataclasses as well though and that's sometimes helpful. All programming abstractions are a trade off and sometimes dataclasses become slightly more hassle than they're worth. At that time, you could consider diving deeper into another abstraction (like attrs for example) or creating your class manually by converting your dataclass to a regular class.

Regardless of whether, how, and when you use dataclasses I hope you learned something from my adventures parsing Python code to turn dataclasses into regular classes (including the first part of this journey on appreciating Python's match-case by parsing Python code). And I hope this journey inspires you to write your own code to sloppily perform silly tasks.

A Python tip every week

Need to fill-in gaps in your Python skills?

↑

A Python Tip Every Week

Need to fill-in gaps in your Python skills? I send weekly emails designed to do just that.

How does this undataclass.py script work?