Metadata-Version: 2.1
Name: refy
Version: 0.1
Summary: A scientific papers recomendation tool.
Home-page: https://github.com/FedeClaudi/refy
Author: Federico Claudi
License: UNKNOWN
Description: # refy
        A scientific papers recomendation tool.
        
        ## Overview
        `refy` leverages Natural Langual Processing (NLP) machine learning tools to find new papers that might be relevant given the ones that you've read already. 
        There's a few software tools out there that facilitate the exploration of scientific literature, including:
        * [meta.org](https://www.meta.org/) which allows users to set up feeds that identify newly published papers that seem relevant given a set of *keywords*
        * [inciteful](https://inciteful.xyz/) and [scite.ai](https://scite.ai/) let you explore the network of citations around a given paper of interest
        * [connected papers](https://www.connectedpapers.com/) let's you visualize papers related to a given paper of interest
        
        Most currently available software is limited in two key ways:
        1. Tools like [meta.org](https://www.meta.org/) rely on keywords, but keywords (e.g. computational neuroscience, Parkinso's Disease) are often overly general. As a result of that you have to sift through a lot irrelevant literature before you find something interesting
        2. Other tools like [connected papers](https://www.connectedpapers.com/) only work with one input paper at the time: you give it the title of a paper you've read and they give you suggestions. This is limiting: any software that can analyse **all papers you've read** can use a lot more information to find new papers that match more closely your interests.
        
        This is what `refy` is for: **`refy` analyzes the abstracts of several papers of yours and matches them agaist a database of almost ONE MILLION paper abstracts**. By using many papers at once `refy` has a lot more information at its disposal which (hopefully) means that it can better recomend relevant papers. More details about the database used by `refy` can be found at the bottom of this document. 
        
        >**Disclaimer:** The database used by `refy` is focused on neuroscience papers and preprints published in the last 30 years. If you are interested in older papers or work in a different field, please read the instructions below about how to adjust the database to your needs.
        
        ## Usage
        ### Installation
        If you have an environment with `python >= 3.6`, you can install `refy` with:
        ```
        pip install refy
        ```
        
        You will then need to install `tables` separately with `pip install tables`, if [you get an error](https://www.google.com/search?client=firefox-b-d&q=pip+install+tables+not+working) while doing that, a possible workaround is:
        ```
        conda install hdf5
        conda install pytables
        ```
        
        You can check if everything went okay with:
        ```
        refy example
        ```
        which should print something like:
        
        <p align="center">
        <img src="imgs/example.png" width=800px style='margin:auto'></img>
        </p>
        
        
        
        
        ### Usage
        Usage provides a Command Line Interface (CLI) to expose it's functionality directly in your terminal:
        <p align="center">
        <img src="imgs/helptxt.png" width=800px style='margin:auto'></img>
        </p>
        
        >**Note:** the first time you use `refy` it will have to download several files (which you can see [here](https://gin.g-node.org/FedeClaudi/refy/src/master/)) with data it needs to work. This should only take a few minuts and it will require about 3GB of memory/
        
        You can use `refy` in two modes:
        1. In `query` mode you can find papers relevant for a given input string (e.g. `locomotion mouse brainstem`)
        2. In `suggest` mode you give `refy` a [`.bib` bibtext](https://en.wikipedia.org/wiki/BibTeX) with metadata about as many publications as you want. `refy` will use this information to find papers relevant across all of your input papers
        
        For **query mode** you will use the command `refy query STRING` while for `suggest` you'd use `refy suggest PATH`. In both cases you can use optional arguments:
        ```shell
            -N INTEGER            number of recomendations to show  [default: 10]
            -since INTEGER        Only keep papers published after SINCE
            -to INTEGER           Only keep papers published before TO
            -save-path, --s TEXT  Save suggestions to file
            -debug, --d           set debug mode ON/OFF  [default: False]
        ```
        
        For example:
        ```shell
        refy query "locomotion control brainstem" --N 100 --since 2015 --to 2018 --s refs.csv
        ```
        Will show 100 suggested papers published between 2015 and 2018 and will save the results to `refs.csv`.
        
        >**Note:** in `suggest` mode, the content of your `.bib` file **must** include papers abstracts. **Only papers with abstracts** will be used for the analysis. Your entris should look like this:
        ```
        @ARTICLE{Claudi2020-tb,
          title    = "Brainrender. A python based software for visualisation of
                      neuroanatomical and morphological data",
          author   = "Claudi, Federico and Tyson, Adam L and Branco, Tiago",
          abstract = "Abstract Here we present brainrender, an open source python
                      package for rendering three-dimensional neuroanatomical data
                      aligned to the Allen Mouse Atlas. Brainrender can be used to
                      explore, visualise and compare data from publicly available
                      datasets (e.g. from the Mouse Light project from Janelia) as well
                      as data generated within individual laboratories. Brainrender
                      facilitates the exploration of neuroanatomical data with
                      three-dimensional renderings, aiding the design and
                      interpretation of experiments and the dissemination of anatomical
                      findings. Additionally, brainrender can also be used to generate
                      high-quality, publication-ready, figures for scientific
                      publications.",
          journal  = "Cold Spring Harbor Laboratory",
          pages    = "2020.02.23.961748",
          month    =  feb,
          year     =  2020,
          language = "en",
          doi      = "10.1101/2020.02.23.961748"
        }
        ``` 
        
        > **hint:** if you use reference managers like zotero or paperpile you can easily export bibtext data about your papers
        
        The **output** of `refy` comes in two ways:
        1. It will print to terminal a list of N recomended paper, sorted by their recomendation score. In addition to the papers' title and year of publication, a url is shown. Clickling on the url should open the paper's web page in your browser (if your terminal supports links)
        2. Optionally, `refy` can save the list of recomended paper to ` .csv` file so that you may explore these at your leasure.
        
        ### Scripting
        You can of course access all functionality through normal python scripts. For instance:
        ```python
        from refy import suggest
        
        rec = suggest('my_library.bib')
        print(rec.suggestions)
        ```
        
        ## Under the hood
        This section explains how `refy` works. If you just want to use `refy` and don't care about what happens under the hood then feel free to skip this. 
        
        `refy` uses NLP algorithms to estimate semantic similarity across papers based on the content of their abstracts. In particular, it uses [`Doc2Vec`](https://medium.com/wisio/a-gentle-introduction-to-doc2vec-db3e8c0cce5e) which is an adaptation of `Word2Vec`, a model that embeddings of words in which semantically similar words are closer in the embedding space than semantically dissimilar words are. `Dov2Vec` expands `Word2Vec` to learn vector embedding of entire documents.
        
        The `Doc2Vec` model used here is train on the entire corpus of almost one million papers. When it comes finding recomendations for your papers, `refy` uses `Doc2Vec` to create a vector representation of your paper and find the N closest vectors which, hopefully, are papers that are similar to yours. 
        
        This operation is repeated for each paper in your `.bib` file and then recomendations are pooled and scored: the papers that scored highest for the most number of input papers will be the most strongly recomended ones. 
        
        ## Database
        `refy` uses a curated database of about one million papers metadata to recomend literature for you. TJe data come from two sources:
        * [semantic scholar's Open Corpus](https://www.semanticscholar.org/paper/Construction-of-the-Literature-Graph-in-Semantic-Ammar-Groeneveld/649def34f8be52c8b66281af98ae884c09aef38b) 
        * [biorxiv](https://api.biorxiv.org/)
        
        Data from these two vasts databases (several million publications) are filtered to selectively keep neuoroscience and ML papers written in english and published in the last 30 years. This selection is necessary to keep the compute and memory requirements within reasonable bounds. 
        
        If **you wish to create your custom database**, these are the steps you'll need to follow:
        1. download the compressed data from [semantic scholar's Open Corpus](http://s2-public-api-prod.us-west-2.elasticbeanstalk.com/corpus/download) and save them in a folder. Note the entire database is >100GB in size even while compressed so it might take a while
        2. clone this repository with `git clone https://github.com/FedeClaudi/refy.git`
        3. in the shell `cd refy`
        4. within the cloned repository edit `refy/settings.py`. In it there are a few settings that apply to the creation of database (e.g. select papers based on the year of publication). Set these settings to values that match your need
        5. Install the edited version of `refy` with `pip install . -U`
        6. Create the edited database with `refy update_database FOLDER` where `FOLDER` is the path to where you saved the data downloaded from semantic scholar
        
        Once the database update has finished (should take <5 hours), you can re-train the `Doc2Vec` model to fit it to your database with `refy train` (use `refy train --help` to see the options available). This step will like require ~12 hours but the duration depend on the specs of your machine and the size of the database as you've created it. 
        
        
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: Microsoft :: Windows :: Windows 10
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Requires-Python: >3.6
Description-Content-Type: text/markdown
Provides-Extra: dev
