Make computational (Ph.D.) research reproducible wide-spread with R packages

Neeraj Dhanraj
6 min readJun 8, 2020

Importance of making research reproducible

How to make research reproducible

Impact of R packages in the computational research

A nation is known by its research contributions and the whole world is moving forward in the direction to contribute research for the betterment of the society. The quality of research is more important than its quantity, and therefore, the impact of research in society is more important than the number of research articles published. In this post, I am sharing my observation related to computational research and numerous possibilities in it.

All over the globe, several researchers are working for their Ph.D. and other research roles. Due to advancements in computational capabilities, many of such researchers are working in computational research in their respective domains. Many of them must be proposing and publishing their excellent computational research contributions. But what will be the future of these contributions?

There are several possibilities:

1. For such contributions, researchers must be getting awards such as a Ph.D. degree or salary incentives.

2. Further, researchers get a good research profile and CV.

3. The publishing agencies (Conferences, Journals) must be getting new articles to publish and the audience.

4. Universities and Institutes must be getting rewards in terms of higher reputations, new projects, funds, and responsibilities.

5. What beyond this if the contributed research is not reproducible?

The computational research is mostly categorized under the applied form of the research, where various problems are handled and solved with the computational methodologies. In literature, there are several research findings based on such excellent computational methodologies but failed to attract researchers who can further work on it, due to difficulties in reproducing the published research. As a researcher, I feel, it is the author’s responsibility to make the research reproducible. It will be surely beneficial for the research followers as well as the authors himself/herself. I would like to demonstrate it with my own experience.

I am a Ph.D. in a Data Science topic, where I worked on various methods for short-term wind energy forecasting. While working on a research topic, I came across a very interesting research article proposing a forecasting method, named, Pattern Sequence-based forecasting (PSF) method which was used to forecast electricity prices. (It is available here.) This method caught my attraction because of its framework and the promising results published in the article. However, it was challenging to use this method in my research, since it was newly published and corresponding method codes were not made available by its authors. So, I was left with the only option to attempt to reproduce the forecasting methodology.

Being a data science researcher, I was having two options to reproduce the method, i.e., R and Python. Both of these are open source tools and popular in data analysis. I prefer to go with R and it benefited me in many ways. I have discussed its benefits in later paragraphs.

I broke down the methodology is several sub-parts and tried to code them as individual functions. Finally, I integrated all functions and come up with a tentative function representing the PSF method. Fortunately, it worked, but the output was far away from the expectations. It took several days to fine-tune the function and to get results that can be acceptable. Besides, I succeeded in achieving and publishing outstanding results in wind energy forecasting with the method. But it was not the end!

The methodology was excellent and could be used in several research domains. I thought to make the methodology reproducible for other researchers. So, I contacted the original author of the PSF methodology and developed an R package for the PSF method in collaboration with them. The details of the R package are here. I can observe how this package helped me in several ways as follows:

1. The R package gets hosted on The Comprehensive R Archive Network (CRAN), a very reputed data research community.

2. This eventually leads to the spread of the method all over the R community.

3. I drafted an introduction and demonstration of the R package and published it in the R journal. It is a highly indexed journal by the R community. The details of the research paper are here.

4. Now the methodology is within the reach of many Ph.D. researchers, Data Scientists, Data Analysts, Academicians, Economists, and Data Enthusiasts.

5. These researchers are using the package and I am earning citations as well as acknowledgments.

6. Due to this package, I came across many researchers working in similar domains and got opportunities to collaborate with some of them.

7. The availability of package made easy to update/hybridize the PSF methodology and to make it more robust and applicable to several applications.

8. Due to good response over the package, it was get noted in CRAN Task View reports published by the R community (here), and the package received well recognition in the research society.

An R package description

As I told earlier, my choice to use R appropriate than Python, because of following reasons:

1. R is more convenient for the prototype than Python. Hence, it is recommended to use R if you are a researcher or an academician. Even, several Data scientists first develop their models/methods in R and then convert them into Python, if required.

2. R provides a great facility to host R packages, i.e. CRAN repository, where same minded people can easily follow and collaborate.

3. Each R package is designed with proper syntax and description document (including Vignettes and reference manuals). Hence, it is very easy to understand and follow the R packages. Such stuff is not mandatory in the Python packages.

4. Before hosting the R package in the CRAN repository, it undergoes scrutiny from experts until all errors get addressed by authors. This ensures the removal of errors and warnings in the packages. Whereas, such service is not available in Python.

5. In R, it is easy to trace which other packages are using your one.

6. The R packages with higher influence and downloads get recognition in CRAN Task Views. Such documents are missing in the Python community.

7. R community cares researchers. For well accepted (by R community) and useful packages, the R Journal provides the facility to publish an introduction report of the packages in the research journal format. The R journal is hosted by the R community itself. (Even if it not published in the journal, the R packages are a citable entity and can be cited in the publications.) There is no such facility from the Python community.

8. The R-mailing list is one of the helpful features hosted by the R community, which allows R users to share the updates in their R packages within the community.

9. The excellent visualization, R-markdown, shiny apps add further advantages in using R.

A package published in the R journal

Coming back to reproducible research; after publishing the R package for the PSF method, I found there is scope for automation of the procedure of performance evaluation. Performance evaluation means a comparative study to compare the performance of the proposed method with state-of-the-art ones. For example, in comparing prediction methods, the time series dataset is segmented into two parts, training and testing. Where models are fitted on the training part and output performance of methods is tested on the testing part of the dataset. There are several error metrics (such as RMSE, MAE, MAPE, etc.) and comparison strategies (such as recursive, direct, Monte-Carlo, etc.). There can be several combinations in performance evaluations. Also, it is very difficult to replicate such studies and reproduce the results with minimum efforts.

To reduce such efforts, I tried to automate the procedure of performance evaluation of time series forecasting and missing value imputation methods. ForecastTB and imputeTestbench are packages for comparing the performance of time series forecasting and missing value imputation methods, respectively. Again, these packages are now published in reputed journals (publication links: ForecastTB and imputeTestbench) and they always help me out in a quick comparative study of newly proposed methods. With simple and few lines of code give me a detailed and reproducible comparison report along with great visualization. Besides, it becomes convenient to share the result reproducible codes among the research collaborators.

My point is to convince the importance of documenting the methodologies and procedures in the form of R packages (or other forms) and making the computative research reproducible for the societal benefits. For any further details, feel free to comment on this post.

Author:

Dr. Neeraj Dhanraj Bokde,

Postdoctoral Researcher,

Aarhus University, Denmark

https://www.researchgate.net/profile/Neeraj_Bokde

--

--