BvSalud PlanTL

INTRODCTION

BvSalud PlanTL is a project to download all articles from bvsalud, LILACS and IBECS and save into the MongoDB. As you will have noticed, they have millions of article. Downloading all articles, always, is a waste of time, when we can just download not existing in our DATABASE.

REQUERIMENT

$ cd BvSalud

$ pip install -r requirements.txt

If you are getting problem with any library while installing requirements, so you can install id individually

BeautifulSoup:

pip: $ pip install beautifulSoup
pymongo:

pip: $ pip install pymongo

TUTORIAL CRAWLER

If you are going to excecute this program for the first time or you don't have any data in MongoDB than you must pass the condition (argument) -m/--mode [all] and also a argument for output folder -o/--output folder_name. Otherwise, you must pass the condition (argument) -m/--mode [new].

All data download will be saved in the directory /home/user_name/bvSalud_downloads/sub_folder (sub_folder is the argument that you pass it (-o/--output [folder name])). If the folder already exits, it will delete it and make a new.

How to download:

$ git clone https://github.com/ankush13r/BvSalud.git

$ cd BvSalud

- First time:

$ python bvs -m/--mode 'all' -o/--output 'folder_name'

For first time you have to run it with the argument "all". It will download all articles to you MongoDb in DB bvs and collection training_collection_All, even it will create the collection training_collection_None_Indexed_t1 with all none_indexed articles.

Result (MongoDB):

bvs (DB):
- all_articles (Collection): This collection will contain all articles.
- none_indexed_t1 (Collection): This collection will just contain none indexed articles.
- errors_training (Collection): Save all errors occurred while saving.

- All next time:

$ python bvs -m/--mode 'new' -o/--output folder_name

The next rounds are just for new articles, not available in MongoDB. In these rounds program will compare all none indexed article if they have been indexed or there is any new article. After all it will save all new article directly and those that have been indexed will update in the collection all articles.

Make sure it's not the first time.

Result (MongoDB):

bvs (DB):
- all_articles (Collection): This collection will contain all articles, new arrived and old.
- none_indexed_t1 (Collection): This collection just containts old none indexed.
- none_indexed_t2 (Collection): This collection will contain all new and old none indexed.
- Update_info (Collection): This collection saves information about none indexed articles, if those are new or just have been updated to indexed, and also save the date.
- errors_training (Collection): Save all errors occurred while saving.

- JUST COMPARE:

If you already have none_indexed_t1 and none_indexed_t2, and just need to compare and update collection all_articles.

$ python bvs -m/--mode 'compare'

With mode "compare" you are just comparing time1 and time2 none indexed collection. And updating all articles collection depending on if there are any new article or modified.

- RESTART from the last failed to finish it:

If the program has been failed while dowinloading or saving your records or just want to restart from last time stopped.

$ python bvs -m/--mode 'must be same as last time' -o/--output 'must be same as last time' --restare

With argument --restart the program will find last stop and start from that point. Warning: -m/--mode and -o/--output must be

TUTORIAL testSet:

$ cd BvSalud

$ python tesSet.py -y 'year' -o output_file_name.json

The program is to make a json file for all none indexed articles. Those article that has been select, will be update by new row {selected: true}, in mongoDB, collection all_articles.

-y 'yyyy': all articles will be greater o equal than year received by argument.

-o 'output_file_name.json': A file_name to create a json file with article selected.

TUTORIAL goalSet:

$ cd BvSalud

$ python goalSet.py -y 'year' -o output_file_name.json

The program is to make a json file for all indexed articles that have {selected: true}.

-y 'yyyy': All articles will be greater o equal than year received by argument.
-o 'output_file_name.json': A file_name to create a json file with article selected.

TUTORIAL get goalSet information as abstract length and language format:

$   python3.7 count_abstrect_toquens.py -o output_file.json input_file1.json [input_file2.json] [+]

PIPILINE

For more information contact:* [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 231 Commits
BvSalud		BvSalud
Pubmet_training		Pubmet_training
match_mh		match_mh
select_all_before_selected		select_all_before_selected
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BvSalud PlanTL

INTRODCTION

REQUERIMENT

TUTORIAL CRAWLER

How to download:

- First time:

- All next time:

- JUST COMPARE:

- RESTART from the last failed to finish it:

TUTORIAL testSet:

TUTORIAL goalSet:

TUTORIAL get goalSet information as abstract length and language format:

PIPILINE

About

Releases

Packages

Languages

License

ankush13r/BvSalud

Folders and files

Latest commit

History

Repository files navigation

BvSalud PlanTL

INTRODCTION

REQUERIMENT

TUTORIAL CRAWLER

How to download:

- First time:

- All next time:

- JUST COMPARE:

- RESTART from the last failed to finish it:

TUTORIAL testSet:

TUTORIAL goalSet:

TUTORIAL get goalSet information as abstract length and language format:

PIPILINE

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages