The U.S stock market is one of the oldest and biggest financial market in the world. The New York Stock Exchange (NYSE) traces its routes all the way to 1792 and therefore stores (with its newer counterpart the NASDAQ) a immense quantity of data about the the history of the financial sector in the USA, about the industrial progress and the general economy. For people who are not really involved or interessed by economy and in particular market's data it can be quite complicated to read, see and understand the relevant data.
The history of stock markets is therefore a very large data set and to analyse it efficiently we must use modern technology like Big Data processing. With Big Data we can extract valuable information out of raw data and use to analyse historical events, stock market performance throughout the years and connect all of this to general modern world history. The historical data of stock markets is immense, covering a multitude of variables and dimensions. Big Data processing enables the swift and effective analysis of this vast dataset, allowing us to distill relevant information and uncover hidden patterns.
Development of a User-Friendly Interface:
In our endeavor to make the vast landscape of the U.S. stock market accessible to everyone, we have prioritized the creation of a user-friendly interface. Our goal is to empower individuals, regardless of their level of expertise in finance, with the ability to explore and understand the market's historical data comprehensively. The user interface serves as a window into the complexities of the stock market, presenting relevant data in an intuitive and visually appealing manner.
Efficient Data Processing with PySpark:
To underpin the user interface, we recognized the importance of a robust and efficient backend. Leveraging the power of PySpark and its associated libraries, our code is meticulously crafted to read, process, and analyze the extensive datasets from both the NYSE and NASDAQ. PySpark enables distributed data processing, ensuring scalability and performance even with large datasets, making it an ideal choice for our Big Data project.
Differentiating from Existing Work:
While acknowledging the existing body of work on Big Data applications in the U.S. stock market, our project diverges in its primary focus. Most existing works tend to concentrate on forecasting market trends and catering to professional or well-informed users with intricate analyses. In contrast, our emphasis lies on inclusivity and simplicity and focuses on the analysis of past events as an educational tool.
The Dataset Source:
The dataset we have acquired originates from www.kaggle.com, where a user has generously shared a comprehensive dataset containing records of both the New York Stock Exchange (NYSE) and NASDAQ since their inception. The original contributor utilized the Yahoo Finance API to meticulously extract this dataset. Our dataset encompasses four folders; however, we will primarily focus on two—NYSE and NASDAQ—as they are directly pertinent to our project objectives. It's worth noting that we will not be utilizing the S&P 500 and Forbes 2000 folders due to their limited relevance.
Dataset Composition:
Each of the chosen folders, NYSE and NASDAQ, contains detailed records of traded assets. The format of the dataset is standardized, with each traded asset having corresponding CSV and JSON files. For our purposes, we will be utilizing the CSV format. Each line within these files represents a trading day and includes columns providing essential stock details such as opening price, closing price, volume, and more. Collectively, the data contained in the NYSE and NASDAQ folders amounts to 6.52 gigabytes.
Inclusion of Dow Jones Index Data:
To enhance our understanding and provide a broader context to our analyses, we have decided to incorporate data from the Dow Jones Index. The Dow Jones Index, a stock market indicator comprising 30 prominent companies listed on U.S. stock exchanges, serves as a valuable benchmark. This additional dataset, obtained from [source], is presented in CSV format.