argparse or argh … pass

As part of our data analytics infrastructure migration we’re using the argparse package much more heavily now to standardise how we run scripts. Along the way I ran into multiple issues and spent some time debugging. Here’s a snippet of what I’ve learnt in the past week, sharing hoping to save you some time if you encounter the same!

 

Jump to:

  1. $ sign in arguments

  2. Boolean arguments

  3. Arguments that are not always required

 

Background

Our first previous iteration of data analytics infrastructure was our first experience on Python and running things on a cloud instance. Needless to say many things were not done well and were hard to maintain. With that in mind we recently made a push to rewrite our code using updated packages and write in a standardised way. Scheduling of tasks via dags on Airflow also meant that we had to keep the coding more disciplined and standardised.

Argparse helps us assign the right parameters to the right functions. Having used it more extensively now I’ve found some quirks which will be basic to some, but hopefully helpful to point out to others. The solutions below are the ones I’ve adopted to resolve the issues. With lots of help from stackoverflow, of course.

 

$ sign in arguments

One of the string variables we had to enter had a $ sign in the string, and everything from the $ sign was not passed through. The $ sign arose from us having to send the table_id of a BigQuery partitioned table together with its partition assignment (denoted by the date string after the $ sign).

The solution to it can be found here.

Shell interprets $ as the start of a variable. This interpretation will occur even if double quotes are applied to the string. To resolve this either enclose your entire string in single quotes, or escape the $ sign with \$.

 

Boolean arguments

Intending to set up a boolean argument, I set the argument type to bool thinking that it would correctly interpret an incoming True or False. Turns out, either input evaluated to True.

And the solution to that, here.

Inputing False as the argument value does not mean a bool False, but a stringFalse“. As a result, all strings evaluate to True, other than an empty string. To be able to input True and False strings in the argument, yet having them read as expected, a separate function has to be created, that firstly asserts the input to be a string “True” or “False”, then next return whether or not the input is “True”. This function is then assigned to the type parameter of the argument to be evaluated.

 

Arguments that are not always required

In the current migration we are also trying to group similarly used functions in the same scripts/task. For example a task that creates dataset A and dataset B may be placed in one script. The possible functions are imported via another script utils. The arguments received via argparse is passed along to the different functions through a single line:

args = vars(cli_args)
task_name = args.pop('task')
task_func = getattr(utils, task_name)
task_func(**args)

 

However, task_a may require inputs x and y, while task_b require inputs x, y and z. Setting up argparse to accept inputs x, y and z and using the above to hand the inputs over to the functions will result in task_a failing with the follow message:

task_a() got an unexpected keyword argument 'z'

 

Even if z is not specified in the shell command. A closer look, printing out args while not adding any of the arguments to the command e.g.:

pipenv run python -u test.py

Reveal:

{'y': None, 'x': None, 'z': None}

That without providing these values argparse hands them over with input None once x, y and z are configured to be arguments in the parser.

 

Thus what is required is to remove the None arguments before passing them to the function:

def main(cli_args):
    args = vars(cli_args)
    task_name = args.pop('task')
    to_remove = []
    for arg in args:
        if args[arg] is None:
            to_remove.append(arg)
    for arg in to_remove:
        args.pop(arg)
    task_func = getattr(utils, task_name)
    task_func(**args)

And with that, arguments that are not used will not be erroneously handed over to the function. If it is possible that a function may or may not taking in certain arguments, this handling can be done at the function itself.

 

No more passing

And those are just a bit of the learnings on argparse over the past week, looking forward to new learnings this week! Any suggestions on how the above can be tackled better? Any issues that you’ve faced using argparse? Let me know in the comments below!

Leave a Reply

Your email address will not be published. Required fields are marked *