-
Notifications
You must be signed in to change notification settings - Fork 373
Replies: 1 comment · 3 replies
-
What answer do you expect from a screenshot and with only a subjective comment "takes forever"? You can share a Quarto document using the following syntax, i.e., using more backticks than you have in your document (usually four ````qmd
---
title: "Reproducible Quarto Document"
format: html
---
This is a reproducible Quarto document using `format: html`.
It is written in Markdown and contains embedded R code.
When you run the code, it will produce a plot.
```{r}
plot(cars)
```

The end.
```` |
Beta Was this translation helpful? Give feedback.
All reactions
-
title: "Credit Card Fraud Detection Project" Goal: To correctly predict fraudulent credit card transaction.Loading required libraries.
Exploratory Data AnalysisLoading and skimming the dataset
Percentage of fraud (coded as 1) transactions
Very few (0.5%) cases of fraud transactions makes this an imbalanced dataset and we can no t use this dataset directly to fit the models, unless we treat it.
Exploring data types, need for any transformations or need to convert data types to improve prediction. Questions to consider:
Converting predictor variables "category" - category of merchant and "job" - job of credit card holder to "factors",Eliminating "merchant" - merchant name and "trans_num" - transactions number as they have low predictive power/high correlation with other predictors - merchant with merch_lat/merch_long. Converting characters such as "city" and "state" to geospatial data
Exploring category factor to understand the types of transactions (% and count)
Gas/transport has the most common category, followed by grocery, while least transactions took place for travel. Exploring character stringsBoth the merchant name (merchant) and the transaction number (trans_num) are string variables. The transaction number, being a unique identifier assigned during transaction processing, should not have an impact on the fraud rate, so we can safely exclude it from our dataset. The merchant name might have a correlation with fraud incidents, for instance, if an employee of the company was implicated. Nonetheless, this information is also encapsulated by the location and category data. If a particular location or category is identified as having a higher propensity for fraud, we can then conduct a more thorough investigation of those transactions, which would include examining the merchant name. Therefore, at this stage, we can also remove the merchant name from our dataset.
Exploring geospatial dataThe data we have is classified as numeric (for latitude and longitude) or character (for city/state), but we can identify it as geographical data and handle it accordingly. Initially, we have two types of geographical data associated with the merchant. One is the merchant’s location and the other is the location where the transaction took place. Creating separate scatter plots for latitude and longitude because I am interested in examining the relationship between the two types of data (merchant and transaction). I am also creating a common legend as per the instructions in this article.
These two sets of data are highly correlated (for latitude = 0.994 and for longitude = 0.999) and thus are redundant. So I remove
Visualising if some locations (of transaction) are more prone to fraud.
Some locations exclusively have fraud transactions.
Distance between card holder's home and location of transaction was derived and provided in the new dataset available here. The file is named "fraud_processed.RDS".
We can observe that some distances have fraudulent transactions. These may relate to the location with exclusively fraud transactions in figure 4. Exploring dob "Date of Birth of Card Holder" variableQuestions:
Age seems to be a more reasonable variable to include than dob. Exploring date-times
Would processing the date-times yield more useful predictors? First, I want to look at variation in the number of transactions with date-time. I chose to use a histogram with bins corresponding to one month widths.
Breaking the transaction date-time into separate components: the day of the week, the hour, and the date itself. Although I’m using functions from the lubridate package to accomplish this, it’s also possible to perform this operation during the model building phase with the step_date() function in the recipes package. Additionally, I plan to visualize the transactions based on the day of the week.
Monday has the highest number of transactions; this could be due to businesses processing orders that came in over the weekend. By default, lubridate codes the day of the week as a number where 1 means Monday, 7 means Sunday. Now, I look at what time of day do most transactions occur?
Distribution of transaction timings are awkward if seen through the context of working hours (i.e. 9 - 5 PM). It may well be an artifact of different time zones or since this is a synthesized data? Exploring numerical variables
Log transformed variables is more symmetrically distributed and shall be retained for further use. Correlation plot to explore association between variables
Tidymodels requires that the outcome be a factor and the positive class be the first level. So I create the factor and relevel it.
A final glimpse of the dataset before we begin with fitting the models.
Finding a high performing modelWe shall explore the following models for prediction and methods to handle class imbalance. Classification models:
Methods for handling imbalanced class problems. This link explains dealing with class imbalanced data in greater detail.
To manage the 4 * 4 different fits and keep track of all the combinations we have "workflow_Set" that creates all the combinations and "workflow_map" to run all the fitting process. Splitting the data
Creating recipies
Setting the model enginesSetting engines for the models and tuning the hyperparameters for certain models - elastic net logistic regression and lightgbm. Avoiding tuning hyperparameters for Random Forest as it may take a while run and slows down the overall process.
Creating a metrics setIn situations where the data is highly skewed, relying on accuracy as a measure can be misleading. This is because a model might achieve high accuracy simply by predicting the majority class for all instances. Therefore, alternative metrics such as sensitivity or the j-index are more suitable for evaluating models in these imbalanced class scenarios.
Creating the workflow_set
Fitting all the models
Evaluating the models
The best performing model / recipe pair by j-index is the downsampled lightgmb ( To see how this model/recipe performs across tuning parameters, we can use
Selecting the best set of hyperparameters.
Using finalize_workflow() and last_fit() to add the best hyperparameters to the workflow, train the model/recipe on the entire training set, and then predict on the entire test set. Validating the model with test data
Looking at the metrics and ROC curve for the test data.
Here’s how to interpret the confusion matrix:
So, in summary, our model correctly identified 329 fraudulent transactions and 64857 non-fraudulent transactions. However, it incorrectly flagged 2724 non-fraudulent transactions as fraudulent (may cause customer dissatisfaction) and missed 12 fraudulent transactions (may cause loss to company). These numbers can help you understand the trade-off between precision (how many of the predicted positives are actually positive) and recall (how many of the actual positives were correctly identified). They can also help you fine-tune our model for better performance. The aim is to maximize the True Positives and True Negatives (i.e., correct predictions) while minimizing the False Positives and False Negatives (i.e., incorrect predictions). In the context of credit card fraud detection, False Negatives can be particularly costly because it means the model failed to catch a fraudulent transaction. On the other hand, False Positives can lead to customer dissatisfaction as their legitimate transactions are being flagged as fraudulent. Calculating savings by the model
The model may potentially improve the savings of the company, as the losses from the model were 27 % of the potential losses. For more details about the machine learning methods to used in the context of R progamming language, one may refer to these resources: 1) Tidymodels learning platform 2) Book - Tidy Modeling with R! 3) A useful article on structural approach for using tidymodels |
Beta Was this translation helpful? Give feedback.
All reactions
-
This is not really helpful if you don't format properly the post as shown. Why are you loading the R package quarto inside a Quarto document? |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
FWIW here is our bug-report guideline about correctly formatting: https://quarto.org/bug-reports.html#formatting-make-githubs-markdown-work-for-us You can also create a github repo to share the doc / project and share the link. Thank you |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Description
Beta Was this translation helpful? Give feedback.
All reactions