You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/storage/common/tape-migration-guide.md
+20-18Lines changed: 20 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,15 +19,15 @@ Tape is one of the dominant storage media, and it stores a large part of world's
19
19
20
20
Tapes are a great medium for storing cold data. They're fast in sequential reading, but stages requiring mechanical movements (like loading, and unloading of tapes, tape seeks, etc.) are slower. That makes tapes unusable for traditional, random based access, and is the main reason that even today data stored on tapes is rarely used. In addition, tapes are a magnetic medium that require special handling. They're sensitive to environment, particularly temperature, and humidity. If kept within their operating environmental range, they can achieve high durability, and good restore success rate. However, when kept in unfriendly environment, deterioration happens often, and renders the tape unreadable.
21
21
22
-
Customers store a lot of data on tapes. Large portion of that data is dark data (data that is collected, and stored, but not used for any other purpose). Dark data brings no value to the data owner. With the increase in AI capability, and accessibility, the trend is changing. Customers are looking closer into how dark data can help them to increase efficiency, open new revenue streams, or increase their competitive advantage. However, data on tapes can't be directly accessed. It must be moved to an online storage (disks) first. This movement requires a manual effort. To take advantage of dark data, many organizations are considering migrating the data from tapes to cloud storage. Cloud storage is also great in storing old data, among other benefits. But, unlike tapes, it also provides an easy way to analyze the data, extract business value, and consuming services like AI, Machine Learning, Azure Search, etc.
22
+
Large portion of tapes store dark data (data that is created, and stored, but not used for any purpose). Dark data brings no value to the data owner. With the increase in AI capability, and accessibility, the trend is changing. Customers are looking into how dark data can help them to increase efficiency, open new revenue streams, or increase their competitive advantage. To take advantage of dark data, many organizations are considering migrating the data from tapes to cloud storage. Cloud storage provides an easy way to analyze the data, extract business value, and consume services like AI, Machine Learning, Azure Search, etc.
23
23
24
24
Some of the major reasons we're seeing increase in tape to cloud migrations are:
25
25
26
26
- Extracting business value from dark data,
27
27
- Reduce the effort required for managing data with long term retention,
28
28
- Avoid migration process from one tape generation to another,
29
29
- Reduce the risk for data loss, particularly for older generations of tapes,
30
-
- Replace off-site tape storage facility,
30
+
- Replace off-site tape storage facilities,
31
31
- Simplify disaster recovery processes,
32
32
- Applying modern tools like AI, and ML to historical data.
33
33
@@ -39,7 +39,7 @@ Before a tape migration process starts, options must be carefully considered. Fi
39
39
40
40
|Approach | Pros | Cons |
41
41
| ------- | ---- | ---- |
42
-
| Customer performed migration | - Data never leaves site <br> - No logistics for shipping tapes | - Requires hardware resources <br> - Adds more work to existing personnel <br> - Requires specific knowledge in handling tapes <br> - Possible unknown costs|
42
+
| Customer performed migration | - Data never leaves the site <br> - No logistics for shipping tapes | - Requires hardware resources <br> - Adds more work to personnel <br> - Requires specific knowledge in handling tapes <br> - Possible unknown costs|
43
43
| Tape migration partner | - Simple pricing, and known cost upfront (paid per tape) <br> - No impact on production <br> - No impact on personnel | - Requires logistics for shipping tapes <br> - Security considerations required due to shipping tapes <br> - Multiple copies needed for data availability during migration |
44
44
45
45
Several major considerations can easily guide our decision on who can execute the migration, customer, or partner.
@@ -54,7 +54,7 @@ Resources are the most critical part of the tape migration process, and we divid
54
54
| Hardware | - Different tape generations require different type of hardware <br> - Speed of the migration is proportional to available drives |
55
55
| Software | - Access to software that created the data is needed <br> - Access to encryption keys is needed |
56
56
57
-
Hardware is usually the most challenging part. If we're migrating existing tape generations, hardware is available, but used as part of the existing production. When migrating older tape generations, hardware is often not available anymore, and it's harder to acquire. With older tape generation, using a tape migration partner is preferred option.
57
+
Hardware is usually the most challenging part. If we're migrating existing tape generations, hardware is available, but used as part of the existing production. When migrating older tape generations, hardware is often not available anymore, and it's harder to acquire. With older tape generation, using a tape migration partner is a preferred option.
58
58
When hardware is available, careful planning is needed to make sure migration doesn't interfere with the existing production workloads. Here we can apply three different models:
59
59
60
60
| Model | Pros | Cons |
@@ -78,23 +78,23 @@ If there are no available resources to perform the migration, no matter what typ
78
78
1.**Migration performed on customer's site** when tape migration partner ships the hardware, and hires people, and performs the work on customer's location. Customer needs to provide access to the tapes, dedicated space for the equipment, network connections, and access to Azure Storage service. Partner is responsible for all other activities.
79
79
1.**Migration performed on partner's site** when customer ships the tapes to the partner, and provides access to Azure Storage service. Tape migration partner performs all the work to migrate the data from tapes to Azure Storage.
80
80
81
-
Second option is easier, and more commonly used. Tape migration partners have facilities that are designed and equipped to perform tape migration on a large scale. This option also reduces the risk, and the timeline since partners have more hardware resources available. Performing migration on customer's site is used only when security, and privacy concerns don't allow the customer to ship the tapes to the partner.
81
+
Second option is easier, and more commonly used. Tape migration partners have facilities that are designed, and equipped to perform tape migration on a large scale. This option also reduces the risk, and the timeline since partners have more hardware resources available. Performing migration on customer's site is used only when security, and privacy concerns don't allow the customer to ship the tapes to the partner.
82
82
83
83
Several partners can perform tape migrations to Azure. The full list of partners can be found on [offline media import](https://azure.microsoft.com/products/databox/offline-media-import/).
84
84
85
85
Here is a simple flowchart to ease the selection process.
There are other considerations that we need to think about before starting the migration. They don+t impact our decision on how we perform the migration, but make a huge impact on the later stages, and on migration design. Format that will be used to store the data is the critical consideration for future usability. Data can be stored in a proprietary, or native format. Proprietary formats are stored as a virtual tape, a raw image from the original tape. Native format requires to restore the data from tapes, and store them as files, or objects.
90
+
Data format has a large impact on migration design, and is the critical consideration for future data usability. Data can be stored in a proprietary, or native format. Proprietary formats are commonly stored as a virtual tapes, a raw image from the original tape. Native format requires to restore the data from tapes, and store them as files, or objects.
91
91
92
92
| Model | Pros | Cons |
93
93
| ----- | ---- | ---- |
94
-
|Virtual tapes | - Easier, and faster migration <br> - Can recreate identical tape media as the original <br> - No need to have access to the original software to write the data | - Requires maintaining virtual tape inventory <br> - Data stored in application dependent format <br> - Requires original software to restore the data |
94
+
|Virtual tapes | - Easier, and faster migration <br> - Can recreate identical tape media as the original <br> - No need to have access to the original software to write the data | - Requires maintaining virtual tape inventory <br> - Data stored in application dependent format, requires original software to restore the data <br> - Data not accessible by Azure services (AI / ML) without restore|
95
95
| Native files | - Files accessible by any application, and service (AI / ML) <br> - Possible to monetize the data <br> - No need to have access to original software for restores | - More complex migration <br> - Requires access to original software to write the data |
96
96
97
-
Main criteria for deciding the format is how do we plan to use the migrated data. If data is migrated only for long-term retention, then virtual tapes are a great choice. It simplifies the migration. In any other case, storing data in native format is a preferred option. It allows simple usage of data in the future, and opens up many possibilities with data analysis.
97
+
Main criteria for deciding the format is how do we plan to use the migrated data. If data is migrated only for long-term retention, then virtual tapes are a great choice. In any other case, storing data in native format is a preferred option. It allows simple usage of data in the future, and opens up many possibilities with data analysis.
98
98
99
99
## Migration process
100
100
@@ -110,26 +110,28 @@ Information phase is critical for gathering key requirements. Gathered informati
110
110
- What software was used to write the data on tapes, is that software still available?
111
111
- What is the format used to write the data on tapes, is the format open, or proprietary, is compression applied?
112
112
- Was encryption used, and if yes, what is the most secure option to exchange encryption keys?
113
-
- What is the target region, and storage service?
114
-
- What is available network bandwidth for data migration?
113
+
- What is the target region?
114
+
- What storage service is used?
115
+
- What regulatory requirements are critical (HIPAA, GDPR, etc.)? Is chain of custody mandatory?
116
+
- What is the migration deadline? Are there any critical milestones?
117
+
- How much network bandwidth is available for migration?
115
118
- Where are tapes physically stored, and can they be shipped?
116
-
- Are any regulatory requirements critical (HIPAA, GDPR, etc.)? Is chain of custody mandatory?
117
119
- Are tapes needed after migration?
118
-
- What is the migration deadline? Are there any critical milestones?
119
120
- How to maintain temperature, and humidity for tapes during migration / transport?
121
+
- Who are main stakeholders?
120
122
121
123
### Preparation phase
122
124
123
-
After we gathered basic information, we can prepare for the migration. Every migration is different. Preparation phase can include many different steps, depending on the goals. But there are some common steps most migrations go through:
125
+
After we gathered basic information, we can prepare for the migration. Preparation phase can include many different steps, but there are some common steps most migrations go through:
124
126
125
-
1.**Data analysis** provides information on the data that needs to be migrated. Information is critical to estimate how fast data can be read from tapes, and how much parallelism we need to achieve to successfully finish the migration before the deadline. It impacts estimates on the required hardware (libraries, robots, drives). Data analysis is done by sampling multiple tapes that represent the data set to be migrated. Typical information we are looking for is:
127
+
1.**Data analysis** provides information on the data that needs to be migrated. Information is critical to estimate how fast data can be read from tapes, and how much parallelism we need to achieve to successfully finish the migration before the deadline. It impacts estimates on the required hardware (libraries, robots, drives). Data analysis is done by sampling multiple tapes that represent the data set to be migrated. Typical information we are looking for is:
126
128
- file sizes,
127
129
- amount of data stored per tape,
128
130
- number of files per tape,
129
131
- minimum, and maximum file sizes,
130
132
- file types.
131
-
1.**Data quality** helps in estimating final, and unique dataset that needs to be migrated. One of the most common issues with tape migration is duplication of data. Tape migration is ideal time to clean all duplicated data. This process improves data quality for future use, it reduces cost, and the duration of the migration.
132
-
1.**Data prioritization** determines the order in which the data can be migrated. Ideally, we want to achieve direct streaming from each tape instead of randomly reading files from different tapes (to avoid constant loading, and unloading). This approach allows the highest possible throughput, and is always the fastest migration path. Data prioritization takes business requirements and technical feasibility to achieve the best possible migration option.
133
+
1.**Data quality** helps in estimating final, and unique dataset that needs to be migrated. One of the most common issues with tape migration is duplication of data. Tape migration is ideal time to clean up duplicated data. This process improves data quality for future use, it reduces cost, and the duration of the migration.
134
+
1.**Data prioritization** determines the order in which the data can be migrated. Ideally, we want to achieve direct streaming from each tape instead of randomly reading files from different tapes (to avoid constant loading, unloading, and seeks). This approach achieves the highest possible throughput, and is always the fastest migration path. Data prioritization takes business requirements, and technical feasibility to achieve the best results.
133
135
1.**Migration design** includes all the technical aspects of the migration, and the gathered information to form a final migration process. It's a written document that becomes source of truth for the remaining stages. It must contain at least:
134
136
135
137
- clear migration process, and migration deadline,
@@ -142,7 +144,7 @@ After we gathered basic information, we can prepare for the migration. Every mig
142
144
### Migration phase
143
145
144
146
Once the migration design is final, we start the migration process. Before ramping up to full migration pace, we always perform a test with a smaller sample. Goal for the test is to make sure that end-to-end process works. It allows us to make tweaks, and improve the process. Once the test is successful, we ramp up fully till the migration is done.
145
-
For each file we migrate, we need to perform data validation to make sure that data wasn't corrupted during the migration process. In ideal situation, source data already contains hash values that can be easily compared to hash values post-migration. If they match, file is marked as migrated. If not, file is discarded, and migrated again. Sometimes. The original data is corrupted on the source tapes, and that is the reason why having the original hash values helps with catching those cases. In cases like that, we can read the data from secondary copy if it exists. Data validation process is a critical component for migration design and process for handling failed validation must be defined. Migration phase is also constantly monitored to make sure we can react to unpredictable situation, and adapt to it. Regular reporting to main stakeholders is important to keep the migration on track.
147
+
For each file we migrate, we need to perform data validation to make sure that data wasn't corrupted during the migration process. In ideal situation, source data already contains hash values that can be easily compared to hash values post-migration. If hashes don't exist, they must be calculated before the file is migrated. If hashes match, file is marked as migrated. If not, file is discarded, and migrated again. Sometimes the data is corrupted on the source tapes. Having the original hash values helps with catching those rare cases. If they happen, we can read the data from secondary copy if it exists. Data validation process is a critical component for a migration design. Process for handling failed validation must be defined. Migration phase is also constantly monitored to make sure we can react to unpredictable situation, and adapt to it. Regular reporting to main stakeholders is important to keep the migration on track.
0 commit comments