Platform and infrastructure
This tool is run using Python 3.9.6 and Apache Spark 3.5.0. Installation methods vary depending on your operating system, please refer to the linked websites for more information. The sole additional python library needed is MatPlotLib which can be installed with pip: pip install matplotlib
Programming models
The most important model used is within PySpark, and it's a combination of Resilient Distributed Dataset (RDD) and Data Frame (DF). Combining both offers the widest range of possibilities to analyse big data.
The application
The code is organised like a standard python module, with a main file that is the only being executed by the user. The remaining files are sperated according to their functionalities, with three folders for: All-stock performance on a certain period of time, market analysis during historical events and stock-specific analysis.
We already mentioned in the previous section the necessary environment setup to run our tool. The only thing left is download the data, for this go to the Github page of the project and find the link of the archive of the data in the README.md file. Unzip the archive and place the "stock_market_data" folder at the root of the software. Assuming you have everything, navigate to the root of the software code and run: python3 smModule/main.py.
You will then be guided by the interface through the menus to select which study on the market you want to conduct. Each spark task is run in a subprocess and its logs are being hidden for the best user experience.
The menus are written in the main.py file, its task being to offer the interface for the user to choose which options he wants for his study, it then launches the relevant spark task.
Speed tests
We tested on two machines: MacBook Air M2 and Lenovo Yoga Slim 7i Pro on the General Market, both markets best all time performance scenario (2709 files to analyse):
* MacBook Air M2: 4min 27s
* Lenovo Yoga Slim 7i Pro: 4min 15sec
Overheads and optimizations
The obvious point for optimisation at first was to execute multiple calculations within the same file reading loop, so that we don't iterate mutliple times through the same files. However there is still room for improvement as we will see later;