Difference between revisions of "Working with large datasets"

From Earlham CS Department
Jump to navigation Jump to search
 
(8 intermediate revisions by one other user not shown)
Line 1: Line 1:
==Distinction between NFS mount Vs Local disk storage==
+
==Distinction between NFS mount and local disk storage==
  
NFS (Network File System) is a distributed file system protocol that offers the ability to share the directories of a server (located on server disk storage) with the client machines on a network. The advantages of a distributed file system are centralized administration, easy sharing of data between the clients and higher security.
+
NFS (Network File System) is a distributed file system protocol that offers the ability to share the directories of a server (located on server disk storage) with client machines on a network. The advantages of a distributed file system are centralized administration, easy sharing of data between the clients and higher security.
  
 
The basic architecture of NFS is the following: there is a client-side file system and a server-side file system that are connected over the network. Whenever a user requests the data on a client machine, the request is redirected to the server over the network. After that, the server retrieves files from its local disk and sends it to the client.
 
The basic architecture of NFS is the following: there is a client-side file system and a server-side file system that are connected over the network. Whenever a user requests the data on a client machine, the request is redirected to the server over the network. After that, the server retrieves files from its local disk and sends it to the client.
Line 9: Line 9:
 
The disadvantage of NFS is that it operates over the network. Sending data over the network is much, much slower than obtaining it directly from the disk. That is why we encourage you not to run data-intensive programs in your user directory (unless you are working on Hopper).
 
The disadvantage of NFS is that it operates over the network. Sending data over the network is much, much slower than obtaining it directly from the disk. That is why we encourage you not to run data-intensive programs in your user directory (unless you are working on Hopper).
  
Instead, you can copy your data files to the local disk of the machine that you are working on and make use of symlinks so that you can continue working in the same environment.  
+
Instead, you can copy your data files to the local disk of a machine that you are working on and make use of symlinks so that you can continue working in the same environment.
  
 
==Example workflow==
 
==Example workflow==
Line 15: Line 15:
 
Suppose your data files are in a directory called data and you’re working on pollock.
 
Suppose your data files are in a directory called data and you’re working on pollock.
  
First run the command:
+
First copy the data to the local disk:
 
   >> cp -r data  /mounts/pollock  
 
   >> cp -r data  /mounts/pollock  
 
(for other machines use the same pattern: /mounts/<host>)
 
(for other machines use the same pattern: /mounts/<host>)
Line 24: Line 24:
 
Optional: You can access your files directly from /mounts/pollock, but I prefer to have a symlink, pointing to a file on local disk, in my user directory.
 
Optional: You can access your files directly from /mounts/pollock, but I prefer to have a symlink, pointing to a file on local disk, in my user directory.
  
We will create a soft link (instead of hard link) that acts just like a shortcut – it’s an indirect pointer to a file or directory.
+
You should use a soft link that acts just like a shortcut – it’s an indirect pointer to a file or directory.
  
 
To create a soft link, use the command:
 
To create a soft link, use the command:
Line 31: Line 31:
 
Make sure that the symlink was created:
 
Make sure that the symlink was created:
 
   >> ls -l  
 
   >> ls -l  
Instead of having a regular name entry, the output should look similar to this:
+
Instead of having a regular name entry, the output of ls should look similar to this:
data -> /mounts/pollock/data
+
''data -> /mounts/pollock/data''
  
You can remove a symbolic link using either:
+
You can remove the symbolic link using either:
 
   >> unlink <name of symlink>
 
   >> unlink <name of symlink>
 
   or
 
   or
Line 42: Line 42:
  
 
'''Warning''': Running the command rm -r <name of symlink> will delete the actual directory, not the symlink!!!
 
'''Warning''': Running the command rm -r <name of symlink> will delete the actual directory, not the symlink!!!
 +
 +
Tested and working 2022

Latest revision as of 14:44, 26 November 2022

Distinction between NFS mount and local disk storage

NFS (Network File System) is a distributed file system protocol that offers the ability to share the directories of a server (located on server disk storage) with client machines on a network. The advantages of a distributed file system are centralized administration, easy sharing of data between the clients and higher security.

The basic architecture of NFS is the following: there is a client-side file system and a server-side file system that are connected over the network. Whenever a user requests the data on a client machine, the request is redirected to the server over the network. After that, the server retrieves files from its local disk and sends it to the client.

The NFS server on Cluster is Hopper. Home directories on all cluster machines are mounted via NFS from Hopper. That is why your user directory (which lives in the home directory) is the same on Pollock, Lovelace, or any other cluster machine.

The disadvantage of NFS is that it operates over the network. Sending data over the network is much, much slower than obtaining it directly from the disk. That is why we encourage you not to run data-intensive programs in your user directory (unless you are working on Hopper).

Instead, you can copy your data files to the local disk of a machine that you are working on and make use of symlinks so that you can continue working in the same environment.

Example workflow

Suppose your data files are in a directory called data and you’re working on pollock.

First copy the data to the local disk:

 >> cp -r data  /mounts/pollock 

(for other machines use the same pattern: /mounts/<host>)

Once you make sure that your files have been successfully copied, you can remove the original file with the command:

 >> rm -r data

Optional: You can access your files directly from /mounts/pollock, but I prefer to have a symlink, pointing to a file on local disk, in my user directory.

You should use a soft link that acts just like a shortcut – it’s an indirect pointer to a file or directory.

To create a soft link, use the command:

 >> ln -s /mounts/pollock/data data

Make sure that the symlink was created:

 >> ls -l 

Instead of having a regular name entry, the output of ls should look similar to this: data -> /mounts/pollock/data

You can remove the symbolic link using either:

 >> unlink <name of symlink>
 or
 >> rm <name of symlink>

Note that, even if the symlink points to a directory, you should use rm command without the recursive option(-r).

Warning: Running the command rm -r <name of symlink> will delete the actual directory, not the symlink!!!

Tested and working 2022