{"id":612,"date":"2022-10-05T10:09:59","date_gmt":"2022-10-05T08:09:59","guid":{"rendered":"https:\/\/www.meb.ki.se\/sites\/meb-it\/?p=612"},"modified":"2024-01-30T13:35:13","modified_gmt":"2024-01-30T12:35:13","slug":"loading-temporary-data-into-the-registry-database-and-how-to-do-pass-thru-sql-with-r","status":"publish","type":"post","link":"https:\/\/www.meb.ki.se\/sites\/meb-it\/loading-temporary-data-into-the-registry-database-and-how-to-do-pass-thru-sql-with-r\/","title":{"rendered":"Loading temporary data into the registry database, and how to do pass-thru SQL"},"content":{"rendered":"\n<p>Sometimes you want to do something that involves a lot of data from the registry database, but you do not want to pull all that data into R &#8212; maybe you are only filtering out some subset of data with a join on a small set of keys that you have as a data frame in R, or you need to do a join of two large tables\/views in kosmos, and wants to use the indexes on those tables to speed up the operation.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The example data<\/h3>\n\n\n\n<p>The example code uses the <code>LMED<\/code> data in <code>CRIME3<\/code>:  the prescription registry for the year 2015 is about 25M rows, and the entire LMED (years 2005 to 2017) dataset is 800M rows, for a total of about 350GB of data. We will select a (very) small subset of ATC codes from the 2015 data, and then get data for  those ATC codes from the entire dataset.<\/p>\n\n\n\n<p>The example is somewhat contrived, since the whole operation could be run in the database, without fetching the small subset to R\/Python &#8212; but this is to demonstrate how to write a temporary table .<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The connection<\/h3>\n\n\n\n<p>The VDI environment, as well as the compute servers, have a <em>ODBC data source<\/em> named <code>kosmos<\/code> that we will use.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">With R<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Preparations<\/h3>\n\n\n\n<p>The following has been tested in R 4.2, both on <code>vector<\/code> and in RStudio on a Windows machine. <\/p>\n\n\n\n<p>You need two packages: <code><a href=\"https:\/\/cran.r-project.org\/web\/packages\/DBI\/index.html\">DBI<\/a><\/code> and <code><a rel=\"noreferrer noopener\" href=\"https:\/\/cran.r-project.org\/web\/packages\/odbc\/index.html\" target=\"_blank\">odbc<\/a><\/code>. Both are in CRAN, and regularly updated.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Code<\/h3>\n\n\n\n<p>First connect to the database:<\/p>\n\n\n\n<pre title=\"\" class=\"wp-block-code\"><code lang=\"r\" class=\"language-r line-numbers\">&gt; dbhandle &lt;- DBI::dbConnect(odbc::odbc(), dsn = \"kosmos\")<\/code><\/pre>\n\n\n\n<p>&#8230; then run the first query in the database:<\/p>\n\n\n\n<pre title=\"\" class=\"wp-block-code\"><code lang=\"r\" class=\"language-r line-numbers\">&gt; query &lt;- DBI::dbSendQuery(dbhandle, \n+   'select distinct ATC, LAKMED from MGRCRIME3.MGRCRIME3.V_LMED2015 where TKOST &gt; 500000')\n&gt; result &lt;- DBI::dbFetch(query)\n&gt; DBI::dbClearResult(query)\n&gt; result\n\n      ATC                                     LAKMED\n1 J05AX14       Daklinza, filmdragerad tablett 60 mg\n2 R07AX02      Kalydeco, filmdragerad tablett 150 mg\n3   J05AX Harvoni, filmdragerad tablett 90 mg\/400 mg\n4 J05AX15       Sovaldi, filmdragerad tablett 400 mg<\/code><\/pre>\n\n\n\n<p>Upload the results to a temporary table in the registry database &#8212; the <code>#<\/code>-mark tells the database engine to make it temporary:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"r\" class=\"language-r line-numbers\">&gt; DBI::dbWriteTable(dbhandle, \"#expensive_2015\", result)<\/code><\/pre>\n\n\n\n<p>&#8230; and use it in a second query:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"r\" class=\"language-r line-numbers\">&gt; query_2 &lt;- DBI::dbSendQuery(dbhandle, \n+   'select distinct a.LAKMED from MGRCRIME3.MGRCRIME3.V_LMED a join #expensive_2015 e on e.atc = a.atc')\n&gt; result_2 &lt;- DBI::dbFetch(query_2)\n&gt; DBI::dbClearResult(query_2)\n&gt; DBI::dbDisconnect(dbhandle)\n&gt; result_2\n                                              LAKMED\n1               Daklinza, filmdragerad tablett 30 mg\n2         Harvoni, filmdragerad tablett 90 mg\/400 mg\n3              Kalydeco, filmdragerad tablett 150 mg\n4               Exviera, filmdragerad tablett 250 mg\n5               Sovaldi, filmdragerad tablett 400 mg\n6 Viekirax, filmdragerad tablett 12,5 mg\/75 mg\/50 mg\n7               Daklinza, filmdragerad tablett 60 mg<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"alternative_dbplyr\">An alternative without (visible) pass-thru SQL<\/h3>\n\n\n\n<p>There is more than one way to do this exact operation, and this specific example can also be acheived with <code><a rel=\"noreferrer noopener\" href=\"https:\/\/cran.r-project.org\/web\/packages\/dplyr\/\" target=\"_blank\">dplyr<\/a>\/<a rel=\"noreferrer noopener\" href=\"https:\/\/cran.r-project.org\/web\/packages\/dbplyr\/\" target=\"_blank\">dbplyr<\/a><\/code>. This works with local data too, you just have to be <em>careful with the order of inputs<\/em>, since <code>dplyr<\/code> will copy the <strong>second<\/strong> input to a temp table in the database, <strong>regardless<\/strong> of which is the local one, and which is smaller. <\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"r\" class=\"language-r line-numbers\">&gt; library(dplyr)\n&gt; library(dbplyr)\n\n&gt; crime3 &lt;- DBI::dbConnect(odbc::odbc(),\n+                          dsn = \"kosmos\", database = \"MGRCRIME3\")\n&gt; lmed15 &lt;- tbl(crime3, in_schema(\"MGRCRIME3\", \"V_LMED2015\"))\n&gt; lmed &lt;- tbl(crime3, in_schema(\"MGRCRIME3\", \"V_LMED\"))\n&gt;\n&gt; result &lt;- lmed %&gt;% \n+   # either in the database\n+   # inner_join(lmed15 %&gt;% filter(tkost &gt; 500000) %&gt;% select(ATC)) %&gt;%\n+   # or with local data\n+   inner_join(data.frame(\"ATC\" = c(\"J05AX\", \"J05AX14\", \"J05AX15\", \"R07AX02\")), \n+              copy = TRUE) %&gt;% \n+   select(\"LAKMED\") %&gt;% distinct()\nJoining, by = \"ATC\"\n&gt; result\n# Source:   SQL [7 x 1]\n# Database: Microsoft SQL Server 13.00.7016[meb\\joelus@MEB-SQL02\\KOSMOS\/MGRCRIME3]\n  LAKMED                                            \n  &lt;chr&gt;                                             \n1 Harvoni, filmdragerad tablett 90 mg\/400 mg        \n2 Daklinza, filmdragerad tablett 60 mg              \n3 Sovaldi, filmdragerad tablett 400 mg              \n4 Exviera, filmdragerad tablett 250 mg              \n5 Viekirax, filmdragerad tablett 12,5 mg\/75 mg\/50 mg\n6 Kalydeco, filmdragerad tablett 150 mg             \n7 Daklinza, filmdragerad tablett 30 mg   \n\n&gt; # this must come *after* we use the result, since dplyr\/dbplyr is being \"lazy\"\n&gt; DBI::dbDisconnect(dbhandle)<\/code><\/pre>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">With Python<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Preparation<\/h3>\n\n\n\n<p>You need the <code>pyodbc<\/code> package &#8212; and for convenience the following code also uses <code>pandas<\/code> &#8230;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Code<\/h3>\n\n\n\n<p>Connecting to the database:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"python\" class=\"language-python line-numbers\">import pandas as pd\nimport pyodbc \n  \nconn = pyodbc.connect('DSN=kosmos;'\n                      'Database=MGRCRIME3;')<\/code><\/pre>\n\n\n\n<p>&#8230; and run the first query in the database:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"python\" class=\"language-python line-numbers\">df = pd.read_sql_query('select distinct ATC, LAKMED from MGRCRIME3.V_LMED2015 where TKOST &gt; 500000', conn)\n\nprint(df)\n\n       ATC                                      LAKMED\n0  J05AX14        Daklinza, filmdragerad tablett 60 mg\n1    J05AX  Harvoni, filmdragerad tablett 90 mg\/400 mg\n2  R07AX02       Kalydeco, filmdragerad tablett 150 mg\n3  J05AX15        Sovaldi, filmdragerad tablett 400 mg\n<\/code><\/pre>\n\n\n\n<p>Create a temporary table:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"python\" class=\"language-python line-numbers\">cursor = conn.cursor()\n\ncursor.execute('''CREATE TABLE #expensive_2015 (atc varchar(15),lakmed varchar(100))''')\n\nconn.commit()<\/code><\/pre>\n\n\n\n<p>&#8230; load the dataframe into that:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"python\" class=\"language-python line-numbers\">for index, row in df.iterrows():\n     cursor.execute(\"INSERT INTO #expensive_2015 (atc,lakmed) values(?,?)\", row.ATC, row.LAKMED)\nconn.commit()<\/code><\/pre>\n\n\n\n<p>And finally, run the join query:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"python\" class=\"language-python line-numbers\">df2=pd.read_sql_query('select distinct a.LAKMED from MGRCRIME3.V_LMED a join #expensive_2015 e on e.atc = a.atc', conn)\n\nprint(df2)\n\n                                              LAKMED\n0         Harvoni, filmdragerad tablett 90 mg\/400 mg\n1               Daklinza, filmdragerad tablett 30 mg\n2               Sovaldi, filmdragerad tablett 400 mg\n3               Exviera, filmdragerad tablett 250 mg\n4              Kalydeco, filmdragerad tablett 150 mg\n5               Daklinza, filmdragerad tablett 60 mg\n6  Viekirax, filmdragerad tablett 12,5 mg\/75 mg\/5...<\/code><\/pre>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Sometimes you want to do something that involves a lot of data from the registry database, but you do not want to pull all that data into R &#8212; maybe you are only filtering out some subset of data with a join on a small set of keys that you have as a data frame [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[11,30,21,22],"tags":[],"class_list":["post-612","post","type-post","status-publish","format-standard","hentry","category-dba","category-python","category-r","category-sql"],"blocksy_meta":{"styles_descriptor":{"styles":{"desktop":"","tablet":"","mobile":""},"google_fonts":[],"version":6}},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.meb.ki.se\/sites\/meb-it\/wp-json\/wp\/v2\/posts\/612","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.meb.ki.se\/sites\/meb-it\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.meb.ki.se\/sites\/meb-it\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.meb.ki.se\/sites\/meb-it\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.meb.ki.se\/sites\/meb-it\/wp-json\/wp\/v2\/comments?post=612"}],"version-history":[{"count":11,"href":"https:\/\/www.meb.ki.se\/sites\/meb-it\/wp-json\/wp\/v2\/posts\/612\/revisions"}],"predecessor-version":[{"id":888,"href":"https:\/\/www.meb.ki.se\/sites\/meb-it\/wp-json\/wp\/v2\/posts\/612\/revisions\/888"}],"wp:attachment":[{"href":"https:\/\/www.meb.ki.se\/sites\/meb-it\/wp-json\/wp\/v2\/media?parent=612"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.meb.ki.se\/sites\/meb-it\/wp-json\/wp\/v2\/categories?post=612"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.meb.ki.se\/sites\/meb-it\/wp-json\/wp\/v2\/tags?post=612"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}