Skip to content

Updates PBM notebook to output to versioned results folders#302

Merged
yiwen-h merged 1 commit intomainfrom
301_update_pbm_notebook_v3-2
Mar 10, 2025
Merged

Updates PBM notebook to output to versioned results folders#302
yiwen-h merged 1 commit intomainfrom
301_update_pbm_notebook_v3-2

Conversation

@yiwen-h
Copy link
Member

@yiwen-h yiwen-h commented Feb 21, 2025

Closes #301
Closes #280

Note that to get this notebook to run with current data I had to change this function in databricks.py. This will not be needed with merging of this branch in nhp_data

    def get_demographic_factors(self) -> pd.DataFrame:
        """Get the demographic factors dataframe

        :return: the demographic factors dataframe
        :rtype: pd.DataFrame
        """

        return (
            self._spark.read.parquet(
                "/Volumes/su_data/nhp/population-projections/demographic_data/projection=principal_proj/"
            )
            .withColumn("projection", F.lit("principal_proj"))
            .filter(F.col("area_code").rlike("^E0[6-9]"))
            .groupBy("projection", "age", "sex")
            .pivot("year")
            .agg(F.sum("value").alias("value"))
            .withColumnRenamed("projection", "variant")
            .toPandas()
        )

@yiwen-h yiwen-h requested a review from StatsRhian February 21, 2025 11:16
@codecov
Copy link

codecov bot commented Feb 21, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (97ada8c) to head (b03388b).
Report is 7 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff            @@
##              main      #302   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           18        18           
  Lines          999       999           
=========================================
  Hits           999       999           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@StatsRhian
Copy link
Member

StatsRhian commented Mar 6, 2025

I'm currently not able to run this with the following parameters

  • data_path: /Volumes/su_data/nhp/old_nhp_data
  • data_version: dev
  • params_file: sample_params.json (with hsa: false)
  • sample_rate: 0.01

The error I get is below.
Has the population projections changed?

Error while reading file dbfs:/Volumes/su_data/nhp/population-projections/demographic_data/projection=const_fert_no_mort_imp/sex=1/area_code=E06000014/part-00000-tid-4720669455922387179-4125d92f-521e-437b-9366-7cbf5cd6104e-25866-14.c000.snappy.parquet. Data type mismatches when reading Parquet column [value]. Expected Spark type string, actual Parquet type DOUBLE. SQLSTATE: KD001

Full error trace:

Py4JJavaError: An error occurred while calling o914.getResult.
: org.apache.spark.SparkException: Exception thrown in awaitResult: [FAILED_READ_FILE.PARQUET_COLUMN_DATA_TYPE_MISMATCH] Error while reading file dbfs:/Volumes/su_data/nhp/population-projections/demographic_data/projection=const_fert_no_mort_imp/sex=1/area_code=E06000014/part-00000-tid-4720669455922387179-4125d92f-521e-437b-9366-7cbf5cd6104e-25866-14.c000.snappy.parquet. Data type mismatches when reading Parquet column [value]. Expected Spark type string, actual Parquet type DOUBLE. SQLSTATE: KD001
	at org.apache.spark.util.SparkThreadUtils$.awaitResult(SparkThreadUtils.scala:51)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:519)
	at org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:108)
	at org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:104)
	at sun.reflect.GeneratedMethodAccessor710.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)
	at py4j.Gateway.invoke(Gateway.java:306)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:199)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:119)
	at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.spark.SparkException: [FAILED_READ_FILE.PARQUET_COLUMN_DATA_TYPE_MISMATCH] Error while reading file dbfs:/Volumes/su_data/nhp/population-projections/demographic_data/projection=const_fert_no_mort_imp/sex=1/area_code=E06000014/part-00000-tid-4720669455922387179-4125d92f-521e-437b-9366-7cbf5cd6104e-25866-14.c000.snappy.parquet. Data type mismatches when reading Parquet column [value]. Expected Spark type string, actual Parquet type DOUBLE. SQLSTATE: KD001
	at org.apache.spark.sql.errors.QueryExecutionErrors$.parquetColumnDataTypeMismatchError(QueryExecutionErrors.scala:1086)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.logErrorFileNameAndThrow(FileScanRDD.scala:781)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:739)
	at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.$anonfun$prepareNextFile$1(FileScanRDD.scala:980)
	at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
	at scala.util.Success.$anonfun$map$1(Try.scala:255)
	at scala.util.Success.map(Try.scala:213)
	at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
	at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
	at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.$anonfun$run$1(SparkThreadLocalForwardingThreadPoolExecutor.scala:157)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.databricks.spark.util.IdentityClaim$.withClaim(IdentityClaim.scala:48)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.$anonfun$runWithCaptured$4(SparkThreadLocalForwardingThreadPoolExecutor.scala:113)
	at com.databricks.unity.UCSEphemeralState$Handle.runWith(UCSEphemeralState.scala:51)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:112)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper.runWithCaptured$(SparkThreadLocalForwardingThreadPoolExecutor.scala:89)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:154)
	at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.run(SparkThreadLocalForwardingThreadPoolExecutor.scala:157)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more
Caused by: org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException: column: [value], physicalType: DOUBLE, logicalType: string
	at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.constructConvertNotSupportedException(ParquetVectorUpdaterFactory.java:1612)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.getUpdater(ParquetVectorUpdaterFactory.java:227)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:222)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatchInternal(VectorizedParquetRecordReader.java:417)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:398)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:301)
	at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:41)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:657)
	... 21 more
File <command-4950770123440603>, line 4
      1 # save_full_model_results set to True
      2 # This creates folders with the results for each of the 256 Monte Carlo simulations in notebooks/results/national/SCENARIONAME/CREATE_DATETIME
----> 4 results_dict["inpatients"] = _run_model(
      5     mdl.InpatientsModel,
      6     params,
      7     nhp_data,
      8     hsa,
      9     run_params,
     10     pcallback,
     11     True,
     12 )
File /databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
    324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325 if answer[1] == REFERENCE_TYPE:
--> 326     raise Py4JJavaError(
    327         "An error occurred while calling {0}{1}{2}.\n".
    328         format(target_id, ".", name), value)
    329 else:
    330     raise Py4JError(
    331         "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
    332         format(target_id, ".", name, value))

Copy link
Member

@StatsRhian StatsRhian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not able to run notebook

@yiwen-h
Copy link
Member Author

yiwen-h commented Mar 6, 2025

There's a horrible hack - the population projections other than principal are broken. Did you implement the changes mentioned here? it's just replacing one function: #302 (comment)

Understand if we prefer to hold off approving/merging until the issue with the population projections is fixed

@StatsRhian
Copy link
Member

StatsRhian commented Mar 6, 2025

Oh sorry I totally forgot about your comment earlier 🤦🏻 . I'll check with the hack again and then approve. Sorry!

@StatsRhian
Copy link
Member

StatsRhian commented Mar 10, 2025

I was banging my head against a wall all of last week because I couldn't get the temporary hack to work. I was convinced databricks wasn't recognising my updated scripts.

Then I realised I was update get_demographics() for the normal model not the national one. (They have the same name) 🤦🏻

Anyway that all runs nicely for me. Let's merge 😅

@yiwen-h yiwen-h merged commit f791a64 into main Mar 10, 2025
3 checks passed
@yiwen-h yiwen-h deleted the 301_update_pbm_notebook_v3-2 branch March 10, 2025 17:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Upload results from PBM to same version used for the data PBM notebook uses dev outputs app - switch to tagged release?

3 participants