An end-to-end ML web app deployed in AWS EC2 instance to predict listing gain (%) of IPOs in Indian Stock Market.
Investing in IPO for listing gains is a very lucrative way to potentially make money with relatively low time and effort invested. Currently most people refer to broker for advice on IPO investment choices and in some cases tend to be delayed or not updated on their website. This project aims to build a fully automated system that can monitor companies going public, look at their financials and predict expected listing gains. This project was inspired by this paper Experimenting with Multi-modal Information to Predict Success of Indian IPOs. The dataset from this paper was not used instead, data was scraped manually.
A fully automated web application that will scrape to find companies currently going public, and use that to predict expected listing gains (%). The system must also use this data to retrain the model. The end user has the option to run each of the ETL, Training and Inference Pipelines and the logs during the process is displayed. The end user can also experiment with predicting gains by selecting various models from a drop down list.
- Successfully completed project and verified deployment on AWS EC2 instance. Check here.
- Deleted EC2 instance after use to avoid billing.
- If prospective employers want a live demo of the project, please let me know the day before the meeting so that I can set this up again.
- Model performance is horrible. It is overfitting the dataset.
f1_scoreis ~40% on validation set and ~30% on test set. - This project currently is to showcase my ability to build projects end-to-end.
- Future work will be to improve this iteratively.
- ETL Pipeline: Manages data collection, processing & updating.
CompanyListExtractorwill scrape all company data within a time range usingCompanyCrawlConfigto output a list of companies & urls.CheckDatabasewill call data from MongoDB usingMongoDBConfigand filter out the existing data to output a filtered dataframe of companies & urls.IPODataExtractoris an async function that will scrape IPO & GMP data from multiple sites using crawl4ai package withGMPCrawlerConfigto output a dataframe with IPO & GMP data.ScreenerExtractoris an async function that will scrape screener website for selected companies financial data usingScreenerHTMLCrawlerConfigto output a dataframe with IPO, GMP & Screener data.ListingPriceExtractorwill useAngelOneConfigto make API calls and get the listing price of companies that are already listed.TransformDatamain class will callDataTransformersub class withTransformationConfigwhich will transform the data. Essentially it updates subscription columns, creates target column and reorder the dataframe.LoadDatamain class will callDataLoadersub class withMongoDBConfigwhich will load the data to MongoDB. Essentially it uses parallel upsert operation to update existing data or create new data.
Below is the ETL Pipeline Flowchart. Flow chart images is stored here.
- Training Pipeline: Process data, train models and upload artifacts and models to AWS S3 bucket.
DataIngestionalong withDataIngestionConfigwill get data usingMongoDBConfig. It will use data from July 2012 to the previous month at runtime. This data is then split into train, validation & test sets and outputsDataIngestionArtifactcontaining file paths.DataValidationalong withDataValidationConfig&DataIngestionArtifactwill validate the column in all three datasets and check the data drift distribution between train-validation and train-test sets and outputsDataValidationArtifactcontaining file paths to valid, invalid datasets and drift report.DataTransformationalong withDataTransformationConfig&DataValidationArtifactwill create a scikit-learnPipelineobject with all the data transformation strategies. This object is then used tofit_transform()on the train set and saved to file. This fitted pipeline object will be used totransform()the validation and test set. All three datasets are then converted to numpy arrays and then outputsDataTransformationArtifactcontaining the file paths.ModelTraineralong withModelTrainerConfig&DataTransformationArtifactwill train multiple models usingGridSearchCVand then select the model that scores the highest on the validation set. This model is used to check performance on the test set. The best fit model along with the transformation pipeline object is then used to create aNetworkModelobject for use in Inference Pipeline. It finally outputsModelTrainerArtifactcontaining file paths.ModelPusheralong withModelPusherConfig&ModelTrainerArtifactwill push the best fit model and the artifacts folder to AWS S3 Bucket using AWS CLI.
Below is the Training Pipeline Flowchart. Flow chart images is stored here.
- Inference Pipeline: Get current months data and use best fit model to predict listing gains.
MakeIPOPredictionusesMongoDBConfigto get current month's company data for prediction.NetworkModelcontaining the best fit model and data transformation pipeline object is used to transform the prediction dataset.predict()method is called to make the prediction.- The predicted output is then appended with the prediction dataset and then sent back to dedicated collection in MongoDB for display in the Web App.
Below is the Inference Pipeline Flowchart. Flow chart images is stored here.
-
CI/CD Pipeline: Continuously monitors the repository for updates and deploys them to AWS EC2.
-
Continuous Integration:
- GitHub Actions is used for the CI/CD purposes.
- Code is checked and linted.
- Roadmap contains adding unit tests for the application.
-
Continuous Delivery:
- Code is checked and the ubuntu dependencies are installed.
- AWS credentials are configured and system will log into AWS ECR to push containerised images.
- Docker is used to containarise the entire app with all dependencies using
python:3.11.14-slim. - Dependencies are managed using UV Package Manager.
- playwright package is installed manually using bash script:
playwright install --with-depsin the CLI. - The containerised image is pushed to AWS ECR.
-
Continuous Deployment:
- AWS credentials are configured and system will log into AWS ECR.
- Pulls the latest images from AWS ECR and serves it to the user with secrets passed as enviroment variables.
- Older docker image is stopped and removed from file.
-
Below is the CI/CD Pipeline Flowchart. Flow chart images is stored here.
- Logged into AWS Management Console and created a IAM user group for this app and gave it full admin access.
- Created an EC2 Instance with 18 GB SSD and 8 GB RAM.
- Set up the ubuntu system and installed updates and GitHub Actions self-hosted runner.
- Made a GitHub Commit to run the CI/CD pipeline.
The Models & metrics are tracked in MLFlow via DagsHub. Roadmap contains adding DVC to the app. Below is the Monitoring DashBoard setup. Images are stored here.
A HTML webpage with JavaScript is employed for user interactions. Flask framework is used.
- AplcOps is used to call each of the pipeline seperately. Upon initialisation, the corresponding logs and callable functions are initialised.
- AplcOps is used for the following operations:
- get the name of the latest, corresponding log file when the user is running any of the 3 pipeline.
- a multi thread function that will call the pipeline function and return a json regarding the status
- UtilOps is used to get current month's predicted data, filter only required columns, get the log files for each pipelines and package all this into a json for the JavaScript in the HTML page to display to the end user.
- The Flask app is initialised and all the API routes are defined for each of the 3 pipelines. The host & ports are defined.
Below is a gif of the GUI. gif is stored here.
Multiple APIs were used for this project. These secrets were setup in GitHub Actions for use in CI/CD pipeline to pass as environment variables when the container is run. The following is a list of secrets used.
| Name | Description |
|---|---|
| AWS_ACCESS_KEY_ID | Access key ID to use AWS EC2 Instance |
| AWS_SECRET_ACCESS_KEY | Secret key to use AWS EC2 Instance |
| AWS_REGION | Region code where EC2 Instance is setup |
| AWS_OUTPUT_FORMAT | Output format of communication |
| AWS_ECR_LOGIN_URI | URL of private AWS ECR |
| AWS_ECR_REPO_NAME | Repo name of private AWS ECR |
| ANGEL_ONE_API_KEY | API key for angel one's personal account |
| ANGEL_ONE_SECRET_KEY | Secret key for angel one's personal account |
| ANGEL_ONE_CLIENT_ID | Unique client ID for angel one's personal account |
| ANGEL_ONE_PIN | PIN for angel one's personal account |
| ANGEL_ONE_QR_TOKEN | Token for OTP generation |
| DAGSHUB_USER_TOKEN | Token for DagsHub authentication |
| MONGO_DB_UN | Username to authenticate mongodb |
| MONGO_DB_PW | Password to authenticate mongodb |
Artifacts contains the raw, ingested, validated & transformed data along with drift report and NetworkModel object that is ready to predict.
Below is the visualised folder structure
Artifacts
├── 2025_10_12_18_22_05 <- Datetime is used for each run
│ ├── data_ingestion
│ │ ├── feature_store
│ │ │ └── ipo_scrn_gmp_EQ.csv <- Original Dataset from MongoDB
│ │ └── ingested
│ │ ├── test.csv
│ │ ├── train.csv
│ │ └── valid.csv
│ ├── data_transformation
│ │ ├── transformed_data <- Transformed datasets in numpy array format
│ │ │ ├── test.npy
│ │ │ ├── train.npy
│ │ │ └── valid.npy
│ │ └── transformed_object <- scikit-learn train set fitted Pipeline object
│ │ └── ppln_prpc.pkl
│ ├── data_validation
│ │ ├── drift_report <- Data drift report wrt train set
│ │ │ ├── drift_report_test.yaml
│ │ │ └── drift_report_vald.yaml
│ │ └── validated <- Only valid data is kept
│ │ ├── test.csv
│ │ ├── train.csv
│ │ └── valid.csv
│ └── model_trainer
│ └── trained_model <- NetworkModel object with 'ppln_prpc.pkl' and best fit model
│ └── 2025-10-12_18-22-05_model.pkl
- Since I am using structured data, an optimised
XGBRegressorshould be enough. I plan to spend more time with the data improving model performance. - The research paper mentions 25 questions that can be used to engineer new features. This needs to be explored.
- The RHP Document and Screener website tends to contain financial data of the company and this needs to be efficiently extracted.
- In the future, I'd like to create an agentic AI app with an API that can be called to automatically apply for IPO for a given user.
- This project was built after learning from Krish Nayak's course on Udemy.
- Here is my implementation of that project: Study_Network-Security.
- The original inspiration for this project is this paper: Experimenting with Multi-modal Information to Predict Success of Indian IPOs.
- This project uses crawl4ai for web scraping.





