[NEW] Provenance and Attribution: Minimize IP liability for GenAI output
Home / Blog /
Why so many Python projects lack dependencies management in git, and what you can do about it
//

Why so many Python projects lack dependencies management in git, and what you can do about it

//
Tabnine Team /
3 minutes /
February 15, 2022

Modern software is almost always dependent on external libraries. A software project rarely starts from scratch. Javascript apps are built on top of popular frameworks such as node.js, express, React or Angular. The Apache foundation contributed countless auxiliaries for Java programming. Python is no different in that sense. The first step in a new project for many Python programmers is to grab Flask, boto3 or pandas from PyPi – Not to mention the massive deep learning frameworks.

Still, we observed a strange phenomena. We scanned the source code and project files of 549,007 Python repositories from GitHub. We detected that at least 410,370 of them had import clauses in its source files for modules from a PyPi library. Out of these, 321,482 (78.3%) did not have a dependencies management file, e.g. requirements.txt, in its top folder. Contrarily, out of more than 1M Javascript files, only half of them did not include NPM files. Why is storing dependencies management files in version control more common in Javascript than python by nearly 30%?

We can only speculate what causes this. One thing, however, stands out: In the list of popular external modules from repositories lacking dependencies management, numpy and matplotlib outranks Django and Flask. This leads us to believe that this gap derives from Python’s role as an on-demand number-cruncher.

The costs of Python

Python is known – and loved – for its low administrative overhead. To get some results from Python, just fire up the interpreter from your preferred environment, and start hacking away. No initialization steps, no need for a well-structured project folder. Interactive tools, such as IPython and Jupyter, build upon this trait. You can have many ad-hoc sessions and notebooks instead of some redundant mini-project per each.

This loose attitude comes with a cost. Like in many scripting languages, in Python, code execution is decoupled from its dependencies management. You configure the environment, with its installed libraries, per interpreter, and then use whichever interpreter you want to run your code. Unlike Maven and NPM, Python tools, such as pip and conda, do not keep the list of dependencies in a file per project. It is stashed away in the environment directory.

Keeping track of installed libraries can be redundant for short-lived hacks. However, if your code is important enough for putting it under version control, tracking external dependencies is essential. It makes sure that your work will remain functional, even if checked out freshly, by someone else, or by your future self. How can one keep an up-to-date requirements file with minimal fuss?

Tracking Python dependencies easily

The key to keeping track of you Python installations is to first write the dependency to a designated file and then have it installed, instead of remembering to do the opposite. We collected several handy tricks for doing so:

Pip install from file

Using pip and a requirements.txt file is the most common method for dependencies management in Python. Updating the requirements file manually can be tedious and error-prone. Instead, let’s combine adding a dependency to the file and installing it into a single shell command: echo some-lib >> requirements.txt; pip install -r requirements.txt

In BASH, you can use this function:

REQ_FILE=path/to/requirements.txt
pip_add() {
  echo $1 >> $REQ_FILE
  pip install -r $REQ_FILE
}

Some IDEs, such as JetBrains PyCharm, support this method natively, so modifying requirements.txt will prompt an option to install the new dependency.

Use an advanced dependencies manager

Pip, the ubiquitous Python dependencies manager, is designed to be lightweight and simple. It’s a good design for a baseline tool. However, alternatives do exist. Both Poetry and Pipenv will keep the same basic commands, such as add and install, but will record the installations to a file while doing so. If you want to keep a pip-like workflow while putting your dependencies under version control, consider using one of these alternative managers.

Extract dependencies from an existing project

The above tips would work great for a new project, but what can you do if you have an existing code base that you want to commit to version control?

The out-of-the-box option is to use the pip freeze command. It will output all the currently installed packages to a file. Similarly, use conda list if you use Anaconda.

There are, however, two drawbacks for this approach:

  • Each will detect only libraries installed using its native command, i.e. pip or conda. If you have a mixture, or external installations, it will not work.
  • It will output all the packages, not just the ones you installed explicitly. The list will be lengthy, and hard to read and edit.

piperqs is an intriguing project. It aims at extracting a list of dependencies and versions directly from the source code. It makes use of the PyPi API and Python modules’ __version__ metadata field. Furthermore, it keeps a list of stdlib packages, and a mapping from non-standard module names to their originating projects. Mapping an import clause from source code to a package name is not a trivial task, so the output may not be perfect. Of course, it will also miss modules that are dynamically imported, e.g. using importlib. That said, it can be a real life-saver when dealing with large legacy projects.

Happy hacking!