With Scrapy you can crawl web sites and get their content, mainly text and images. Since it isn't possible to install Scrapy with sudo apt-get install scrapy, the recommended way is to install it inside a virtual Python environment called Anaconda.
With these commands you can get Scrapy up and running on Ubuntu 16.04:
Download the latest Anaconda to your /tmp directory and start the installation, you can always see if there is a new release here: https://repo.continuum.io/archive/
cd /tmp
curl -O https://repo.continuum.io/archive/Anaconda3-5.3.1-Linux-x86_64.sh
bash Anaconda3-5.3.1-Linux-x86_64.sh
Press enter and type "yes" when required. When you are prompted for which directory to install Anaconda in, I personally change it to a directory with a punctuation mark in front, to keep it hidden, like this:
/home/YOUR_USER_NAME/.anaconda3
Make sure this line is added in ~/.bashrc:
# added by Anaconda3 4.4.0 installer
export PATH="/home/YOUR_USER_NAME/.anaconda3/bin:$PATH"
Make the added line take effect
source ~/.bashrc
Check Anaconda works
conda list
Install scrapy
conda install -c conda-forge scrapy
Check that scrapy is installed
scrapy --version
RESULT:
Scrapy 1.4.0 - no active project
See scrapy location
whereis scrapy
RESULT:
scrapy: /home/YOUR_USER_NAME/.anaconda3/envs/scrapy_june_2017/bin/scrapy
After test running Scrapy I got this error
# Error: PIL missing
File "/home/YOUR_USER_NAME/.anaconda3/envs/scrapy_june_2017/lib/python3.6/site-packages/scrapy/pipelines/images.py", line 15, in
from PIL import Image
ModuleNotFoundError: No module named 'PIL'
Fix the error by installing pillow with conda, still inside the virtual environment
conda install pillow
Scrape with Scrapy
cd /home/project/scrapy/projectname/ && /home/YOUR_USER_NAME/.anaconda3/envs/scrapy_june_2017/bin/scrapy crawl my_spider -o /home/YOUR_USER_NAME/my_spider.csv -t csv --set=CLOSESPIDER_ITEMCOUNT=10 --set=CLOSESPIDER_TIMEOUT=500
Open a page in scrapy console
scrapy shell https://example.com/test-page
Get the title of the page, in H1 tags
response.xpath('//h1/text()').extract()
Exit Scrapy console with Ctrl+D
NOTE: When you install Anaconda it might install its own version of glib, taking over gsettings, which is quite annoying, since it can result in this error: "GLib-GIO-Message: Using the 'memory' GSettings backend. Your settings will not be saved or shared with other applications." when trying to use gsettings. A work around is to use "/usr/bin/gsettings" to access the original gsettings.
https://askubuntu.com/questions/916334/ubuntu-16-04-glib-gio-message-using-the-memory-gsettings-backend-your-settin/959346#959346
Update Anaconda and packages
Run these two commands:
conda update conda
conda update anaconda
From: https://medium.com/@mauridb/how-to-check-your-anaconda-version-c092400c9978
Virtual environments
You can create a virtual environment for different environments
conda create --name scrapy_june_2017 python=3
Activate the virtual environment
source activate scrapy_june_2017
Close the Anaconda virtual environment
source deactivate scrapy_june_2017
Remove (base) from your command line
If you want to remove (base)
from your terminal, update your .bashrc file to use the old simpler format. Replace my_user_name with your own user name:
# Anaconda 4.4.0 config style # added by Anaconda3 4.4.0 installer export PATH="/home/my_user_name/.anaconda3/bin:$PATH"
Delete this bit:
# added by Anaconda3 5.3.1 installer # >>> conda init >>> # !! Contents within this block are managed by 'conda init' !! __conda_setup="$(CONDA_REPORT_ERRORS=false '/home/my_user_name/.anaconda3/bin/conda' shell.bash hook 2> /dev/null)" if [ $? -eq 0 ]; then \eval "$__conda_setup" else if [ -f "/home/my_user_name/.anaconda3/etc/profile.d/conda.sh" ]; then . "/home/my_user_name/.anaconda3/etc/profile.d/conda.sh" CONDA_CHANGEPS1=false conda activate base else \export PATH="/home/my_user_name/.anaconda3/bin:$PATH" fi fi unset __conda_setup # <<< conda init <<<
https://stackoverflow.com/questions/51526503/why-does-base-appear-in-my-anaconda-command-prompt
Uninstalling Anaconda
Uninstalling Anaconda is as easy as deleting the folder:
rm ~/.anaconda3 -rf
Remember to remove instances in ~/.bashrc file, and source it.