Warm tip: This article is reproduced from serverfault.com, please click

Using Natural Language Tool Kit with Django on Heroku

发布于 2020-11-21 15:56:02

I’ve got a basic Django project. One feature I am working on counts the number of most commonly occurring words in a .txt file, such as a large public domain book. I’ve used the Python Natural Language Tool Kit to filter out “stopwords” (in SEO language, that means redundant words such as ‘the’, ‘you’, etc. ).

Anyways, I’m getting this debug traceback when Django serves the template:

Resource [93mstopwords[0m not found. Please use the NLTK Downloader to obtain the resource: [31m <<< import nltk nltk.download('stopwords') [0m For more information see: https://www.nltk.org/data.html

So I need to download the library of stopwords. To resolve the issue, I simply open a Python REPL on my remote server and invoke these two straightforward lines:

<<< import nltk
<<< nltk.download('stopwords')

That's covered at length elsewhere on SO. That resolves the issue, but only temporarily. As soon as the REPL session is terminated on my remote server, the error returns because the stopwords file just evaporates.

I noticed something strange when I use git to push my changes up to my remote server on Heroku. Check this:

remote: -----> Python app detected
remote: -----> No change in requirements detected, installing from cache
remote: -----> Installing pip 20.1.1, setuptools 47.1.1 and wheel 0.34.2
remote: -----> Installing SQLite3
remote: -----> Installing requirements with pip
remote: -----> Downloading NLTK corpora…
remote:  !     'nltk.txt' not found, not downloading any corpora
remote:  !     Learn more: https://devcenter.heroku.com/articles/python-nltk 
remote: -----> $ python manage.py collectstatic --noinput
remote:        122 static files copied to '/tmp/build_f2f9d10f/staticfiles', 388 post-processed.

That devcenter link is kind of like a stub, meaning that it’s not very detailed. It’s sparse at best. The article says that to use Python nltk, you need to add an nltk.txt file to the project directory which specifies the list of objects for Heroku to download. So I went ahead and created an nltk text file which contained:

corpora

Here is this active nltk.txt currently located in my project directory. In addition to coprora, I also tried adding various combinations of the following three entries to nltk.txt:

corpus

stoplist

english

I tried adding all four, just two and just one. For example, here is an alternate nltk.txt that I tried verbatim. My feeling is that the main one I really need is just corpora, so that is the only entry in the nltk.txt that I am working with right now. With corpora there, when I push the change and Heroku builds the environment, I see this error and trace-back:

remote: -----> Downloading NLTK corpora…
remote: -----> Downloading NLTK packages: corpora english stopwords corpus
remote: /app/.heroku/python/lib/python3.6/runpy.py:125: RuntimeWarning: 'nltk.downloader' found in sys.modules after import of package 'nltk', but prior to execution of 'nltk.downloader'; this may result in unpredictable behaviour
remote:   warn(RuntimeWarning(msg))
remote: [nltk_data] Error loading corpora: Package 'corpora' not found in
remote: [nltk_data]     index
remote: Error installing package. Retry? [n/y/e]
remote: Traceback (most recent call last):
remote:   File "/app/.heroku/python/lib/python3.6/runpy.py", line 193, in _run_module_as_main
remote:     "__main__", mod_spec)
remote:   File "/app/.heroku/python/lib/python3.6/runpy.py", line 85, in _run_code
remote:     exec(code, run_globals)
remote:   File "/app/.heroku/python/lib/python3.6/site-packages/nltk/downloader.py", line 2538, in <module>
remote:     halt_on_error=options.halt_on_error,
remote:   File "/app/.heroku/python/lib/python3.6/site-packages/nltk/downloader.py", line 790, in download

I am clearly not using nltk.txt properly because it isn’t finding the corpora package. I can install nltk and have it run without issue in my local dev server but my remaining question is this: how do I make Heroku handle nltk properly remotely in this situation?

User Michael Godshall provides the same answer to more than one Stack Overflow question explaining that you can create a bin directory within the project root and add both a post_compile bash script and a install_nltk_data script. However this is no longer necessary because heroku-buildpack-python upstream maintainer Kenneth Reitz implemented an easy solution. All that is required now is to add an nltk.txt which contains the library you need. But I did that and I am still getting the error above.

The official nltk website documents how to use the library in general and how to install it which isn’t helpful in the case of Heroku because Heroku seems to handle nltk differently.

Questioner
Angeles89
Viewed
0
Angeles89 2020-11-29 22:24:05

Eureka! I got it working. My problem was with the name of the nltk library download. I tried stoplist when the actual name is stopwords. Ha! The contents of my nltk.txt is now simply: stopwords. When I pushed to Heroku, the build succeeded and my website is now deployed and accessible on the web.

Special thanks goes out to @Darkknight for his patience and insight in the comment section of his answer.