## I am starting on ubuntu 19.10 ##
sudo mkdir stretch-armhf
sudo debootstrap --arch=armhf stretch stretch-armhf/ http://deb.debian.org/debain
sudo chroot stretch-armhf
mount -t proc proc /proc
## I would try to install vim here ##
apt-get install vim
## find out num of processors ##
nproc
## disable daemons ##
## /usr/sbin/policy-rc.d ##
vim /usr/sbin/policy-rc.d
## add the following to the file and then exit ##
exit 101
## make sure the file has the right permissions ##
sudo chmod 755 /usr/sbin/policy-rc.d
## install some programs here ##
apt-get install libblas3 libatlas-base-dev
apt-get install python3-dev python3-yaml python3-pillow python3-setuptools python3-numpy python3-cffi python3-wheel
## build python 3.7.3 ##
apt-get install build-essential zlib1g-dev libncurses5-dev libgdbm-dev libnss3-dev libssl-dev libreadline-dev libffi-dev wget
wget https://www.python.org/ftp/python/3.7.3/Python-3.7.3.tar.xz
tar xvz Python-3.7.3.tar.xz
cd Python-3.7.3
./configure --enable-optimizations
make
make altinstall
cd ..
## make git ##
wget https://github.com/git/git/archive/v2.25.0.tar.gz
apt-get install dh-autoreconf libcurl4-gnutls-dev libexpat1-dev gettext libz-dev libssl-dev
tar xvz v2.25.0.tar.gz
cd v2.25.0
make configure
./configure --prefix=/usr
make all doc info
make install install-doc install-html install-info
cd ..
alias python=python3.7
alias python3=python3.7
python3.7 -m pip install wheel pyyaml setuptools
git clone https://github.com/pytorch/pytorch --recursive
cd pytorch
git checkout v1.4.0
git submodule update --init --recursive
export NO_CUDA=1
export NO_DISTRIBUTED=1
export NO_MKLDNN=1
export BUILD_TEST=0 # for faster builds
export MAX_JOBS=8 # see nproc above
python3.7 setup.py bdist_wheel
## wait 5 to 10 hours ##
## this is where you find the wheel... ##
ls dist/
Awesome Guy, Etc.
All code and apps from David Liebman
Friday, February 7, 2020
Friday, April 13, 2018
readme.md
This is the readme.md file from my latest project. https://github.com/radiodee1/awesome-chatbot
awesome-chatbot
Keras or pytorch implementation of a chatbot. The basic idea is to start by setting up your training environment as described below and then training with or without autoencoding. The inspiration for this project is the tensorflow NMT project found at the following link: https://github.com/tensorflow/nmt Also, this was inspiring: https://pythonprogramming.net/chatbot-deep-learning-python-tensorflow/ Finally there was a great deep learning youtube series from Siraj Raval. A link for that is here
Organization
The folders and files in the project are organized in the following manor. The root directory of the project is called
awesome-chatbot. In that folder are sub folders named data, model, raw and saved. There are several script files in the main folder along side the folders mentioned above. These scripts all have names that start with the word do_ . This is so that when the files are listed by the computer the scripts will all appear together. Below is a folder by folder breakdown of the project.dataThis folder holds the training data that the model uses during thefitandpredictoperations. The contents of this folder are generally processed to some degree by the project scripts. This pre-processing is described below. This folder also holds thevocabfiles that the program uses for training and inference. The modified word embeddings are also located here.modelThis folder holds the python code for the project. Though some of the setup scripts are also written in python, this folder holds the special python code that maintains the keras model. This model is the framework for the neural network that is at the center of this project. There are also two setup scripts in this folder.botThis folder is the home of programs that are meant to help the chatbot run. This includes speech-to-text code and speech-recognition code. Ultimately this directory will be the home of a loop of code that monitors audio input from a microphone and decides what to do with it.rawThis folder holds the raw downloads that are manipulated by the setup scripts. These include the GloVe vectors and the Reddit Comments download.savedThis folder holds the saved values from the training process.
Description of the individual setup scripts is included below.
Suggested Reading - Acknowledgements
- Some basic material on sequence to sequence NMT models came from these sources. The first link is to Jason Brownlee's masterful blog series. The second is to Francois Chollet's Keras blog.
- Specifically regarding attention decoders and a special hand written Keras layer designed just for that purpose. The author of the layer is Zafarali Ahmed. The code was designed for an earlier version of Keras and Tensorflow. Zafarali's software is provided with the 'GNU Affero General Public License v3.0'
- Pytorch code was originally written by Sean Robertson for the Pytorch demo and example site. He uses the MIT license.
- Additional Pytorch code was written by Austin Jacobson. A link to his NMT project is included here. He uses the MIT license.
GloVe and W2V Word Embeddings Download
- This link brings you to a page where you can download W2V embeddings that google makes available. At the time of this writing this project does not use w2v embeddings, but uses GloVe instead.
- This link starts a download of the GloVe vectors in the
glove.6Bcollection. The download takes a while and uses 823M.
REDDIT Download
- This link starts a download that takes several hours for the Reddit Comments file from November of 2017. The file is several gigabytes.
Scripts For Setup
Here is a list of scripts and their description and possibly their location. You must execute them in order. It is recommended that you install all the packages in the
requirements.txt file. You can do this with the command pip3 install -r requirements.txtdo_make_glove_download.shThis script is located in the root folder of the repository. It takes no arguments. Execute this command and the GloVe word embeddings will be downloaded on your computer. This download could take several minutes. The file is found in therawfolder. In order to continue to later steps you must unpack the file. In therawdirectory, execute the commandunzip glove.6B.zip.do_make_reddit_download.shThis script is located in the root folder of the repository. It takes no arguments. Execute this command and the Reddit Comments JSON file will be downloaded on your computer. This download could take several hours and requires several gigabytes of space. The file is found in therawfolder. In order to continue to later steps you must unpack the file. In therawdirectory execute the commandbunzip2 RC_2017-11.bz2. Unzipping this file takes hours and consumes 30 to 50 gigabytes of space on your hard drive.do_make_db_from_reddit.pyThis script is located in the root folder of the repository. It takes one argument, a specification of the location of the uunpacked Reddit Comments JSON file. Typically you would execute the command as./do_make_db_from_reddit.py raw/RC_2017-11. Executing this file takes several hours and outputs a sqlite data base calledinput.dbin the root directory or your repository. There should be 5.9 Million paired rows of comments in the final db file. You can move the file or rename it for convenience. I typically put it in therawfolder. This python script usessqlite3.do_make_train_test_from_db.pyThis file is not located in the root folder of the repository. It is in the subfolder that themodel.pyfile is found in. Execute this file with one argument, the location of theinput.dbfile. The script takes several hours and creates many files in thedatafolder that themodel.pyfile will later use for training. These data files are also used to create the vocabulary files that are essential for the model.do_make_vocab.pyThis file is located in the directory that thedo_make_train_test_from_db.pyis found in. It takes no arguments. It proceeds to find the most popular words in the training files and makes them into a list of vocabulary words of the size specified by thesettings.pyfile. It also adds a token for unknown words and for the start and end of each sentence. If word embeddings are enabled, it will prepare the word embeddings from the GloVe download. The GloVe download does not include contractions, so if it is used no contractions will appear in thevocab.big.txtfile. The embeddings can be disabled by specifying 'None' forembed_sizein themodel/settings.pyfile. Embeddings can be enabled with some versions of the keras model. The pytorch model is to be used without pre-set embeddings. This script could take hours to run. It puts its vocabulary list in thedatafolder, along with a modified GloVe word embeddings file.do_make_rename_train.shThis file should be called once after the data folder is set up to create some important symbolic links that will allow themodel.pyfile to find the training data. If your computer has limited resources this method can be called with a single integer,n, as the first argument. This sets up the symbolic links to piont themodel.pyfile at thenth training file. It should be noted that there are about 80 training files in theRC_2017-11download, but these training files are simply copies of the larger training file, calledtrain.big.fromandtrain.big.to, split up into smaller pieces. When strung together they are identical to the bigger file. If your computer can use the bigger file it is recommended that you do so. If you are going to use the larger file, call the script withhout any arguments. If you are going to use the smaller files, call the script with the number associated with the file you are interested in. This call woudl look like this:./do_make_rename_train.sh 1
Scripts For Train - do_launch_model.sh
This is a script for running the
model.py python file located in the model folder. There are several commandline options available for the script. Type ./do_launch_model.sh --help to see them all. Some options are listed below. There is also a do_launch_pytorch.sh file. It works with similar commandline options.--helpThis prints the help text for the program.--mode=MODENAMEThis sets the mode for the program. It can be one of the following:trainThis is for training the model for one pass of the selected training file.longThis is for training the model for several epochs on the selected training files. It is the preferred method for doing extended training.inferThis just runs the program'sinfermethod once so that the state of the model's training might be determined from observation.reviewThis loads all the saved model files and performs ainferon each of them in order. This way if you have several training files you can choose the best.interactiveThis allows for interactive input with thepredictpart of the program.
--printable=STRINGThis parameter allows you to set a string that is printed on the screen with every call of thefitfunction. It allows thedo_launch_series_model.pyscript to inform the user what stage training is at, if for example the user looks at the screen between the switching of input files. (see description ofdo_launch_series_model.pybelow.)--baename=NAMEThis allows you to specify what filename to use when the program loads a saved model file. This is useful if you want to load a filename that is different from the filename specified in thesettings.pyfile. This parameter only sets the basename.--autoencode=FLOATThis option turns on auto encoding during training. It overrides themodel/settings.pyhyper parameter. 0.0 is no autoencoding and 1.0 is total autoencoding.--train-allThis option overrides thesettings.pyoption that dictated when the embeddings layer is modified during training. It can be used on a saved model that was created with embedding training disabled.
If you are running the pytorch model, the model will save your last position in the training corpus file whenever it saves the weights. If you want to erase this position and start over in the training file you can erase the 'saved/basename.best.pth.tar' file. (You can also set the 'zero_start' option to True.) This removes the old weights also. To get them back rename the highest saved weights file (or any one of your choosing) to 'basename.best.pth.tar'.
Scripts For Train - do_launch_series_model.py
This script is not needed if your computer will run the
--mode=long parameter mentioned above for the do_launch_model.sh script. If your computer has limited memory or you need to train the models in smaller batches you can use this script. It takes no arguments initially. It goes through the training files in the data folder and runs the training program on them one at a time. There are two optional parameters for this script that allow you to specify the number of training files that are saved, and also the number of epochs you want the program to perform.
Hyper-parameters - model/settings.py
This file is for additional parameters that can be set using a text editor before the
do_launch_model.sh file is run.save_dirThis is the relative path to the directory where model files are saved.data_dirThis is the relative path to the directory where training and testing data ate saved.embed_nameThis is the name of the embed file that is found in thedatafolder.vocab_nameThis is the name of the primary vocabulary list file. It is found in thedatafolder.test_nameThis is the name of the test file. It is not used presently.test_sizeThis is the size of the test file in lines. It is not used.train_nameThis is the name of the train file. It is the 'base' name so it doesn't include the file ending.src_endingThis is the filename ending for the source test and training files.tgt_endingThis is the filename ending for the target test and training files.base_filenameThis is the base filename for when the program saves the network weights and biases.base_file_numThis is a number that is part of the final filename for the saved weights from the network.num_vocab_totalThis number is the size of the vocabulary. It is also read by thedo_make_vocab.pyfile. It can only be chhanged when the vocabulary is being created before training.batch_sizeTraining batch size. May be replaced bybatch_constant.steps_to_statsNumber representing how many times thefitmethod is called before the stats are printed to the screen.epochsNumber of training epochs.embed_sizeDimensionality of the basic word vector length. Each word is represented by a vector of numbers and this vector is as long asembed_size. This can only take certain values. The GloVe download, mentioned above, has word embedding in only certain sizes. These sizes are: None, 50, 100, 200, and 300. If 'None' is specified then the GloVe vectors are not used. Note: GloVe vectors do not contain contractions, so contractions do not appear in the generated vocabulary files ifembed_sizeis not None.embed_trainThis is a True/False parameter that determines whether the model will allow the loaded word vector values to be modified at the time of training.autoencodeThis is a True/False parameter that determines whether the model is set up for regular encoding or autoencoding during the training phase.infer_repeatThis parameter is a number higher than zero that determines how many times the program will run theinfermethod when stats are being printed.embed_modeThis is a string. Accepted values are 'mod' and 'normal' and only the keras model is effected. This originally allowed the development of code that used different testing scenarios. 'mod' is not supported at the time of this writing. Use 'normal' at all times.dense_activationThere is a dense layer in the model and this parameter tells that layer how to perform its activations. If the value None or 'none' is passed to the program the dense layer is skipped entirely. The value 'softmax' was used initially but produced poor results. The value 'tanh' produces some reasonable results.solThis is the symbol used for the 'start of line' token.eolThis is the symbol used for the 'end of line' token.unkThis is the symbol used for the 'unknown word' token.unitsThis is the initial value for hidden units in the first LSTM cell in the keras model. In the pytorch model this is the hidden units value used by both the encoder and the decoder. For the pytorch model GRU cells are used.layersThis is the number of layers for both the encoder and decoder in the pytorch model.learning_rateThis is the learning rate for the 'adam' optimizer. In the pytorch model SGD is used.tokens_per_sentenceThis is the number of tokens per sentence.batch_constantThis number serves as a batch size parameter.teacher_forcing_ratioThis number tells the pytorch version of the model exactly how often to use teacher forcing during training.dropoutThis number tells the pytorch version of the model how much dropout to use.pytorch_embed_sizeThis number tells the pytorch model how big to make the embedding vector.zero_startTrue/False variable that tells the pytorch model to start at the beginning of the training corpus files every time the program is restarted. Overrides the saved line number that allows the pytorch model to start training where it left off after each restart.
Raspberry Pi and Speech Recognition
The goal of this part of the project is to provide for comprehensive speech-to-text and text-to-speech for the use of the chatbot when it is installed on a Raspberry Pi. For this purpose we use the excellent google api. The google api 'Cloud Speech API' costs money to operate. If you want to use it you must sign up for Google Cloud services and enable the Speech API for the project. This document will attempt to direct a developer how to setup the account, but may not go into intimate detail. Use this document as a guide, but not necessarily the last word. After everything is set up the project will require internet access to perform speech recognition.
PyTorch
An important part of the process of porting this project to the Raspberry Pi is compiling Pytorch for the Pi. At the time of this writing the compiling of Pytorch is possible following the urls below. You do not need to compile Pytorch before you test the speech recognition, but it is required for later steps.
- http://book.duckietown.org/master/duckiebook/pytorch_install.html
- https://gist.github.com/fgolemo/b973a3fa1aaa67ac61c480ae8440e754
Speech Recognition -- Google
The Google Cloud api is complicated and not all of the things you need to do are covered in this document. I will be as detailed as possible if I can. The basic idea is to install the software on a regular computer to establish your account and permissions. You will need to create a special json authentication file and tell google where to find it on your computer. Then install as much software as possible on the Raspberry Pi along with another special authentication json file. This second file will refer to the same account and will allow google to charge you normally as it would for a regular x86 or x86_64 computer. The speech recognition code in this project should run on the regular computer before you proceed to testing it on the Raspberry Pi.
Install all the recommended python packages on both computers and make sure they install without error. This includes
gtts, google-api-python-client, and google-cloud-speech. Install the Google Cloud SDK on the regular computer. The following link shows where to download the SDK.Resources
You may need to set up a billing account with Google for yourself. Here are some resources for using the Google Cloud Platform.
- https://cloud.google.com/sdk/docs/quickstart-linux See this url for details.
- https://cloud.google.com/speech/docs/quickstart See this location for more google setup info.
- https://console.cloud.google.com/apis/ Try this url and see if it works for you. If you see a dashboard where you can manipulate your google cloud account you are ready to proceed. You want to enable 'Cloud Speech API'.
Steps for the cloud
- Use Google Cloud Platform Console to create a project and download a project json file.
- Setup a google cloud platform account and project. For a project name I used
awesome-sr. - Before downloading the json file, make sure the 'Cloud Speech API' is enabled.
- Setup a google cloud platform account and project. For a project name I used
- Download and install the Google-Cloud-Sdk. This package has the
gcloudcommand. - This download includes the
google-cloud-sdkfile. Unpack it, and executing the command./google-cloud-sdk/install.sh - You must also restart your terminal.
- I put my project json file in a directory called
/home/<myname>/bin. - Use the
gcloudcommand to set up your authentication. I used the following:gcloud auth activate-service-account --key-file=bin/awesome-sr-*.json - Use the Google Cloud Platform Console to create a second project json file for the Raspberry Pi. Go to the Downloads folder and identify the Raspberry Pi json file. Transfer the file to the Pi with a command like
scp. - Finally you must set up a bash shell variable for both json files so that google can find the json files when you want to do speech recognition. The process for setting up this shell variable is outlined below.
Test google speech recognition with the
bot/game_sr.py script. The script may be helpful at different times to tell if your setup attempt is working. To execute the script, switch to the bot/ folder and execute the command python3 game_sr.py.Setup Bash Variable
- This guide assumes you are using a linux computer. It also assumes that if you downloaded the json file from the internet and it was stored in your
Downloadsfolder, that you have moved it to the root of your home directory. - For convenience I made a folder in my home directory called
bin. This will be the folder for the json file on my regular computer. - On the Raspberry Pi I navigated to the
/optdirectory and made a folder calledbot. I placed the json file at/opt/bot/. - For simplicity I will refer to the json file on my regular computer as
awesome-sr-XXXXXX.json. In this schemeawesome-sris the name of my project andXXXXXXis the hexadecimal number that google appends to the json file name. Because this name is long and the hex digits are hard to type I will copy and paste them when possible as I set up the Bash shell variable. - Edit the
.bashrcfile with your favorite editor. - Add the following to the last line of the
.bashrcfile:export GOOGLE_APPLICATION_CREDENTIALS=/path/to/json/awesome-sr-XXXXXX.jsonA link follows that might be helpful: https://cloud.google.com/docs/authentication/getting-started#setting_the_environment_variable - Save the changes.
- You must exit and re-enter the bash shell in a new terminal for the changes to take effect. After that you should be able to run the
game_sr.pyfile. You will be charged for the service. - On the Raspberry Pi use the same general technique as above. Edit the
.basshrcfile to contain the lineexport GOOGLE_APPLICATION_CREDENTIALS=/opt/bot/awesome-sr-XXXXXX.jsonwhereXXXXXXis the hexadecimal label on the json file on the Rapberry Pi. This number will be different from the one on your regular computer.
Subscribe to:
Comments (Atom)